Bug: Mixed byte/character units in GCS text offload decision
Validated against: main at 2d61cb69
Problem
HybridContentParser._parse_content_object (CASE C, text handling) conflates two different size concerns — byte-based inline storage limits and character-based truncation limits — in a single mixed-unit comparison.
GCS offload uses bytes (bigquery_agent_analytics_plugin.py:1433):
text_len = len(part.text.encode("utf-8")) # BYTES
Threshold mixes bytes and characters (bigquery_agent_analytics_plugin.py:1436-1438):
offload_threshold = self.inline_text_limit # 32KB — bytes
if self.max_length != -1 and self.max_length < offload_threshold:
offload_threshold = self.max_length # max_content_length — CHARACTERS
This computes min(inline_text_limit, max_length) across different units. With default config (inline_text_limit=32*1024, max_content_length=500*1024), the threshold is 32KB (bytes), but if a user sets max_content_length=10000, the threshold becomes 10000 and is compared against a byte count.
_truncate uses characters (bigquery_agent_analytics_plugin.py:1370):
if self.max_length != -1 and len(text) > self.max_length: # CHARACTERS
Impact
For multi-byte content (CJK, emoji, Arabic), byte size diverges from character count:
| Example |
Characters |
UTF-8 Bytes |
Default threshold (32KB) |
| 20K emoji characters |
~20,000 |
~80,000 |
Offloaded (80K > 32K) |
| 20K ASCII characters |
20,000 |
20,000 |
Stays inline (20K < 32K) |
The same content in different scripts gets different treatment. 20K emoji characters trigger GCS upload today because 80K bytes > 32KB, while 20K ASCII characters stay inline.
When max_content_length is smaller than inline_text_limit, the threshold becomes a character count compared against a byte measurement — a direct unit mismatch.
Proposed fix
Separate the two decisions instead of computing a mixed-unit min(). Each limit should be compared in its own unit:
char_len = len(part.text)
byte_len = len(part.text.encode("utf-8"))
exceeds_inline_byte_limit = byte_len > self.inline_text_limit
exceeds_char_truncation_limit = (
self.max_length != -1 and char_len > self.max_length
)
if self.offloader and (
exceeds_inline_byte_limit or exceeds_char_truncation_limit
):
# offload to GCS
...
This preserves both intents:
inline_text_limit controls approximate inline storage size — bytes, as intended by the 32KB name
max_content_length controls truncation semantics — characters, consistent with _truncate() and _recursive_smart_truncate()
- No mixed-unit
min() comparison
Alternative
Convert all limits to characters and rename/re-document inline_text_limit so it is no longer described as KB. This is simpler but weakens the inline-size guard for multi-byte text — 32K CJK characters could be ~96KB of UTF-8 payload kept inline.
The two-predicate approach is preferred because it preserves the original intent of each limit.
Affected code
| Location |
Current Unit |
Intent |
_parse_content_object:1433 |
Bytes (encode("utf-8")) |
GCS offload size check |
_parse_content_object:1436-1438 |
Mixed (min(bytes, chars)) |
Offload threshold |
_truncate:1370 |
Characters (len(text)) |
Inline text truncation |
_recursive_smart_truncate:309 |
Characters (len(obj)) |
Dict/list string truncation |
BigQueryLoggerConfig.max_content_length:570 |
Characters |
Config (default 500K) |
HybridContentParser.inline_text_limit:1367 |
Bytes-intended |
Hardcoded 32KB |
Bug: Mixed byte/character units in GCS text offload decision
Validated against:
mainat2d61cb69Problem
HybridContentParser._parse_content_object(CASE C, text handling) conflates two different size concerns — byte-based inline storage limits and character-based truncation limits — in a single mixed-unit comparison.GCS offload uses bytes (
bigquery_agent_analytics_plugin.py:1433):Threshold mixes bytes and characters (
bigquery_agent_analytics_plugin.py:1436-1438):This computes
min(inline_text_limit, max_length)across different units. With default config (inline_text_limit=32*1024,max_content_length=500*1024), the threshold is 32KB (bytes), but if a user setsmax_content_length=10000, the threshold becomes10000and is compared against a byte count._truncateuses characters (bigquery_agent_analytics_plugin.py:1370):Impact
For multi-byte content (CJK, emoji, Arabic), byte size diverges from character count:
The same content in different scripts gets different treatment. 20K emoji characters trigger GCS upload today because
80K bytes > 32KB, while 20K ASCII characters stay inline.When
max_content_lengthis smaller thaninline_text_limit, the threshold becomes a character count compared against a byte measurement — a direct unit mismatch.Proposed fix
Separate the two decisions instead of computing a mixed-unit
min(). Each limit should be compared in its own unit:This preserves both intents:
inline_text_limitcontrols approximate inline storage size — bytes, as intended by the 32KB namemax_content_lengthcontrols truncation semantics — characters, consistent with_truncate()and_recursive_smart_truncate()min()comparisonAlternative
Convert all limits to characters and rename/re-document
inline_text_limitso it is no longer described as KB. This is simpler but weakens the inline-size guard for multi-byte text — 32K CJK characters could be ~96KB of UTF-8 payload kept inline.The two-predicate approach is preferred because it preserves the original intent of each limit.
Affected code
_parse_content_object:1433encode("utf-8"))_parse_content_object:1436-1438min(bytes, chars))_truncate:1370len(text))_recursive_smart_truncate:309len(obj))BigQueryLoggerConfig.max_content_length:570HybridContentParser.inline_text_limit:1367