GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813
GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813abtom87 wants to merge 1 commit intoapache:mainfrom
Conversation
14851b5 to
f1be8c1
Compare
There was a problem hiding this comment.
Pull request overview
Fixes potential integer-overflow/invalid-length issues in Gandiva string functions by adding overflow-checked allocation sizing and expanding unit tests to cover extreme and negative lengths.
Changes:
- Add overflow-checked allocation calculations for
quote_utf8andto_hex_binary. - Introduce a shared allocation-length helper for
gdv_fn_lower_utf8,gdv_fn_upper_utf8, andgdv_fn_initcap_utf8, including negative-length rejection. - Add/extend GoogleTest coverage for overflow and negative-length scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| cpp/src/gandiva/precompiled/string_ops.cc | Adds overflow-checked allocation sizing for quote_utf8 and to_hex_binary. |
| cpp/src/gandiva/precompiled/string_ops_test.cc | Adds overflow-focused tests for quote_utf8 and to_hex_binary. |
| cpp/src/gandiva/gdv_string_function_stubs.cc | Adds shared allocation-length helper and overflow checks; adjusts some integer types in translate. |
| cpp/src/gandiva/gdv_function_stubs_test.cc | Adds tests for max-length overflow and negative-length handling in upper/lower/initcap and translate. |
Comments suppressed due to low confidence (1)
cpp/src/gandiva/gdv_string_function_stubs.cc:792
- In the multibyte path, the
from_for <= from_lenloop readsfrom[from_for](to computelen_char_from) before checkingfrom_for == from_len. Whenfrom_forreachesfrom_len(the sentinel iteration used to handle the "not found" case), this does a 1-past-the-end read. Reorder the loop so thefrom_for == from_lencase is handled before indexing intofrom(or change the loop bounds to avoid the sentinel iteration and handle the fallback after the loop).
for (int32_t from_for = 0; from_for <= from_len; from_for += len_char_from) {
// Updating len to char in this position
len_char_from = gdv_fn_utf8_char_length(from[from_for]);
// Making copy to std::string with length for this char position
std::string copy_from_compare(from + from_for, len_char_from);
if (from_for == from_len) {
// If it's not in the FROM list, just add it to the map and the result.
std::string insert_copy_value(in + in_for, len_char_in);
// Insert in map to next loops
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20; | ||
| out_str = to_hex_binary(ctx_ptr, binary_string, bad_in_len, &out_len); | ||
| EXPECT_EQ(out_len, 0); | ||
| EXPECT_EQ(out_str, ""); |
There was a problem hiding this comment.
These assertions compare const char* pointers (EXPECT_EQ(out_str, "")) rather than string contents, which can be non-deterministic. Use EXPECT_STREQ(out_str, "") or compare std::string(out_str, out_len) instead, and consider asserting/resetting the context error since the overflow path sets an error message.
| EXPECT_EQ(out_str, ""); | |
| EXPECT_STREQ(out_str, ""); | |
| EXPECT_TRUE(ctx.has_error()); | |
| ctx.Reset(); |
| } | ||
| } else { // If there are no multibytes in the input, work with std::strings | ||
| } else { | ||
| // If there are no multibytes in the input, work with std::strings |
There was a problem hiding this comment.
The comment in the has_multi_byte branch is inverted: this else executes when multibyte UTF-8 is present, but the comment says "If there are no multibytes". Please correct the comment to match the control flow to avoid confusion during future maintenance.
| // If there are no multibytes in the input, work with std::strings | |
| // If there are multibytes in the input, work with std::strings |
| int32_t alloc_length = 0; | ||
| // Check overflow: 2 * in_len | ||
| if (ARROW_PREDICT_FALSE( | ||
| arrow::internal::MultiplyWithOverflow(2, in_len, &alloc_length))) { | ||
| gdv_fn_context_set_error_msg(context, "Would overflow maximum output size"); | ||
| *out_len = 0; | ||
| return ""; | ||
| } | ||
|
|
||
| // try to allocate double size output string (worst case) | ||
| auto out = | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, (in_len * 2) + 2)); | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 2)); | ||
| if (out == nullptr) { |
There was a problem hiding this comment.
quote_utf8 checks overflow for 2 * in_len, but the allocation size is alloc_length + 2. For boundary values (e.g., in_len == INT32_MAX/2), 2 * in_len fits in int32_t while + 2 overflows, which can lead to a negative/incorrect allocation size being passed to gdv_fn_context_arena_malloc. Please include the + 2 in the overflow-safe size computation (or separately check that alloc_length <= INT32_MAX - 2).
| auto ret = | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, text_len * 2 + 1)); | ||
| reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 1)); | ||
|
|
There was a problem hiding this comment.
to_hex_binary only special-cases text_len == 0. If text_len is negative, the overflow check may not trigger and alloc_length can become negative, which then gets passed to gdv_fn_context_arena_malloc (takes an int32_t size) and can result in invalid/huge allocations or crashes. Treat negative lengths as invalid (or at least as empty) by changing the guard to text_len <= 0 (and ideally setting an error for negative input, consistent with other functions).
4f4290a to
d0280e4
Compare
Fixes potential integer-overflow/invalid-length issues in Gandiva string functions by adding overflow-checked allocation sizing and expanding unit tests to cover extreme and negative lengths. Fixed memcpy call in gdv_substring_index function since, the length argument is of type "size_t". Incorporated review comments.
d0280e4 to
8bb980f
Compare
| out_str = to_hex_binary(ctx_ptr, binary_string, neg_in_len, &out_len); | ||
| EXPECT_EQ(out_len, 0); | ||
| EXPECT_STREQ(out_str, ""); | ||
| ctx.Reset(); |
There was a problem hiding this comment.
Could we add a couple more boundary tests here just to make the hardening feel more complete? I’m thinking of cases like quote_utf8 right around the INT32_MAX / 2 boundary, substring_index with negative lengths, and maybe one concat_ws_* case where the combined output size would overflow.
Rationale for this change
Fix the overflow in functions where strings are used
What changes are included in this PR?
Fixes overflow and handles negative lengths
Are these changes tested?
Yes, modified relevant google tests and ran them locally.
Are there any user-facing changes?
No
This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)