Skip to content

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload #5561

@caohy1988

Description

@caohy1988

Bug: Mixed byte/character units in GCS text offload decision

Validated against: main at 2d61cb69

Problem

HybridContentParser._parse_content_object (CASE C, text handling) conflates two different size concerns — byte-based inline storage limits and character-based truncation limits — in a single mixed-unit comparison.

GCS offload uses bytes (bigquery_agent_analytics_plugin.py:1433):

text_len = len(part.text.encode("utf-8"))  # BYTES

Threshold mixes bytes and characters (bigquery_agent_analytics_plugin.py:1436-1438):

offload_threshold = self.inline_text_limit  # 32KB — bytes
if self.max_length != -1 and self.max_length < offload_threshold:
    offload_threshold = self.max_length  # max_content_length — CHARACTERS

This computes min(inline_text_limit, max_length) across different units. With default config (inline_text_limit=32*1024, max_content_length=500*1024), the threshold is 32KB (bytes), but if a user sets max_content_length=10000, the threshold becomes 10000 and is compared against a byte count.

_truncate uses characters (bigquery_agent_analytics_plugin.py:1370):

if self.max_length != -1 and len(text) > self.max_length:  # CHARACTERS

Impact

For multi-byte content (CJK, emoji, Arabic), byte size diverges from character count:

Example Characters UTF-8 Bytes Default threshold (32KB)
20K emoji characters ~20,000 ~80,000 Offloaded (80K > 32K)
20K ASCII characters 20,000 20,000 Stays inline (20K < 32K)

The same content in different scripts gets different treatment. 20K emoji characters trigger GCS upload today because 80K bytes > 32KB, while 20K ASCII characters stay inline.

When max_content_length is smaller than inline_text_limit, the threshold becomes a character count compared against a byte measurement — a direct unit mismatch.

Proposed fix

Separate the two decisions instead of computing a mixed-unit min(). Each limit should be compared in its own unit:

char_len = len(part.text)
byte_len = len(part.text.encode("utf-8"))

exceeds_inline_byte_limit = byte_len > self.inline_text_limit
exceeds_char_truncation_limit = (
    self.max_length != -1 and char_len > self.max_length
)

if self.offloader and (
    exceeds_inline_byte_limit or exceeds_char_truncation_limit
):
    # offload to GCS
    ...

This preserves both intents:

  • inline_text_limit controls approximate inline storage size — bytes, as intended by the 32KB name
  • max_content_length controls truncation semantics — characters, consistent with _truncate() and _recursive_smart_truncate()
  • No mixed-unit min() comparison

Alternative

Convert all limits to characters and rename/re-document inline_text_limit so it is no longer described as KB. This is simpler but weakens the inline-size guard for multi-byte text — 32K CJK characters could be ~96KB of UTF-8 payload kept inline.

The two-predicate approach is preferred because it preserves the original intent of each limit.

Affected code

Location Current Unit Intent
_parse_content_object:1433 Bytes (encode("utf-8")) GCS offload size check
_parse_content_object:1436-1438 Mixed (min(bytes, chars)) Offload threshold
_truncate:1370 Characters (len(text)) Inline text truncation
_recursive_smart_truncate:309 Characters (len(obj)) Dict/list string truncation
BigQueryLoggerConfig.max_content_length:570 Characters Config (default 500K)
HybridContentParser.inline_text_limit:1367 Bytes-intended Hardcoded 32KB

Metadata

Metadata

Labels

bq[Component] This issue is related to Big Query integration

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions