Skip to content

GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813

Open
abtom87 wants to merge 1 commit intoapache:mainfrom
abtom87:fix-overflow-in-string-func
Open

GH-49753: [C++][Gandiva] Fix overflow in string functions.#49813
abtom87 wants to merge 1 commit intoapache:mainfrom
abtom87:fix-overflow-in-string-func

Conversation

@abtom87
Copy link
Copy Markdown

@abtom87 abtom87 commented Apr 20, 2026

Rationale for this change

Fix the overflow in functions where strings are used

What changes are included in this PR?

Fixes overflow and handles negative lengths

Are these changes tested?

Yes, modified relevant google tests and ran them locally.

Are there any user-facing changes?

No

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes potential integer-overflow/invalid-length issues in Gandiva string functions by adding overflow-checked allocation sizing and expanding unit tests to cover extreme and negative lengths.

Changes:

  • Add overflow-checked allocation calculations for quote_utf8 and to_hex_binary.
  • Introduce a shared allocation-length helper for gdv_fn_lower_utf8, gdv_fn_upper_utf8, and gdv_fn_initcap_utf8, including negative-length rejection.
  • Add/extend GoogleTest coverage for overflow and negative-length scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
cpp/src/gandiva/precompiled/string_ops.cc Adds overflow-checked allocation sizing for quote_utf8 and to_hex_binary.
cpp/src/gandiva/precompiled/string_ops_test.cc Adds overflow-focused tests for quote_utf8 and to_hex_binary.
cpp/src/gandiva/gdv_string_function_stubs.cc Adds shared allocation-length helper and overflow checks; adjusts some integer types in translate.
cpp/src/gandiva/gdv_function_stubs_test.cc Adds tests for max-length overflow and negative-length handling in upper/lower/initcap and translate.
Comments suppressed due to low confidence (1)

cpp/src/gandiva/gdv_string_function_stubs.cc:792

  • In the multibyte path, the from_for <= from_len loop reads from[from_for] (to compute len_char_from) before checking from_for == from_len. When from_for reaches from_len (the sentinel iteration used to handle the "not found" case), this does a 1-past-the-end read. Reorder the loop so the from_for == from_len case is handled before indexing into from (or change the loop bounds to avoid the sentinel iteration and handle the fallback after the loop).
        for (int32_t from_for = 0; from_for <= from_len; from_for += len_char_from) {
          // Updating len to char in this position
          len_char_from = gdv_fn_utf8_char_length(from[from_for]);
          // Making copy to std::string with length for this char position
          std::string copy_from_compare(from + from_for, len_char_from);
          if (from_for == from_len) {
            // If it's not in the FROM list, just add it to the map and the result.
            std::string insert_copy_value(in + in_for, len_char_in);
            // Insert in map to next loops

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/gandiva/precompiled/string_ops_test.cc Outdated
int32_t bad_in_len = std::numeric_limits<int32_t>::max() / 2 + 20;
out_str = to_hex_binary(ctx_ptr, binary_string, bad_in_len, &out_len);
EXPECT_EQ(out_len, 0);
EXPECT_EQ(out_str, "");
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions compare const char* pointers (EXPECT_EQ(out_str, "")) rather than string contents, which can be non-deterministic. Use EXPECT_STREQ(out_str, "") or compare std::string(out_str, out_len) instead, and consider asserting/resetting the context error since the overflow path sets an error message.

Suggested change
EXPECT_EQ(out_str, "");
EXPECT_STREQ(out_str, "");
EXPECT_TRUE(ctx.has_error());
ctx.Reset();

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
} else { // If there are no multibytes in the input, work with std::strings
} else {
// If there are no multibytes in the input, work with std::strings
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in the has_multi_byte branch is inverted: this else executes when multibyte UTF-8 is present, but the comment says "If there are no multibytes". Please correct the comment to match the control flow to avoid confusion during future maintenance.

Suggested change
// If there are no multibytes in the input, work with std::strings
// If there are multibytes in the input, work with std::strings

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +1928 to 1940
int32_t alloc_length = 0;
// Check overflow: 2 * in_len
if (ARROW_PREDICT_FALSE(
arrow::internal::MultiplyWithOverflow(2, in_len, &alloc_length))) {
gdv_fn_context_set_error_msg(context, "Would overflow maximum output size");
*out_len = 0;
return "";
}

// try to allocate double size output string (worst case)
auto out =
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, (in_len * 2) + 2));
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 2));
if (out == nullptr) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote_utf8 checks overflow for 2 * in_len, but the allocation size is alloc_length + 2. For boundary values (e.g., in_len == INT32_MAX/2), 2 * in_len fits in int32_t while + 2 overflows, which can lead to a negative/incorrect allocation size being passed to gdv_fn_context_arena_malloc. Please include the + 2 in the overflow-safe size computation (or separately check that alloc_length <= INT32_MAX - 2).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 2837 to 2853
auto ret =
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, text_len * 2 + 1));
reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context, alloc_length + 1));

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_hex_binary only special-cases text_len == 0. If text_len is negative, the overflow check may not trigger and alloc_length can become negative, which then gets passed to gdv_fn_context_arena_malloc (takes an int32_t size) and can result in invalid/huge allocations or crashes. Treat negative lengths as invalid (or at least as empty) by changing the guard to text_len <= 0 (and ideally setting an error for negative input, consistent with other functions).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch 2 times, most recently from 4f4290a to d0280e4 Compare April 22, 2026 17:03
Fixes potential integer-overflow/invalid-length issues in Gandiva string
functions by adding overflow-checked allocation sizing and expanding
unit tests to cover extreme and negative lengths.

Fixed memcpy call in gdv_substring_index function since, the length
argument is of type "size_t".

Incorporated review comments.
@abtom87 abtom87 force-pushed the fix-overflow-in-string-func branch from d0280e4 to 8bb980f Compare April 23, 2026 03:45
out_str = to_hex_binary(ctx_ptr, binary_string, neg_in_len, &out_len);
EXPECT_EQ(out_len, 0);
EXPECT_STREQ(out_str, "");
ctx.Reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a couple more boundary tests here just to make the hardening feel more complete? I’m thinking of cases like quote_utf8 right around the INT32_MAX / 2 boundary, substring_index with negative lengths, and maybe one concat_ws_* case where the combined output size would overflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants