fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload

## Bug: Mixed byte/character units in GCS text offload decision

**Validated against:** `main` at `2d61cb69`

### Problem

`HybridContentParser._parse_content_object` (CASE C, text handling) conflates two different size concerns — byte-based inline storage limits and character-based truncation limits — in a single mixed-unit comparison.

**GCS offload uses bytes** (`bigquery_agent_analytics_plugin.py:1433`):
```python
text_len = len(part.text.encode("utf-8"))  # BYTES
```

**Threshold mixes bytes and characters** (`bigquery_agent_analytics_plugin.py:1436-1438`):
```python
offload_threshold = self.inline_text_limit  # 32KB — bytes
if self.max_length != -1 and self.max_length < offload_threshold:
    offload_threshold = self.max_length  # max_content_length — CHARACTERS
```

This computes `min(inline_text_limit, max_length)` across different units. With default config (`inline_text_limit=32*1024`, `max_content_length=500*1024`), the threshold is 32KB (bytes), but if a user sets `max_content_length=10000`, the threshold becomes `10000` and is compared against a byte count.

**`_truncate` uses characters** (`bigquery_agent_analytics_plugin.py:1370`):
```python
if self.max_length != -1 and len(text) > self.max_length:  # CHARACTERS
```

### Impact

For multi-byte content (CJK, emoji, Arabic), byte size diverges from character count:

| Example | Characters | UTF-8 Bytes | Default threshold (32KB) |
|---|---|---|---|
| 20K emoji characters | ~20,000 | ~80,000 | Offloaded (80K > 32K) |
| 20K ASCII characters | 20,000 | 20,000 | Stays inline (20K < 32K) |

The same content in different scripts gets different treatment. 20K emoji characters trigger GCS upload today because `80K bytes > 32KB`, while 20K ASCII characters stay inline.

When `max_content_length` is smaller than `inline_text_limit`, the threshold becomes a character count compared against a byte measurement — a direct unit mismatch.

### Proposed fix

Separate the two decisions instead of computing a mixed-unit `min()`. Each limit should be compared in its own unit:

```python
char_len = len(part.text)
byte_len = len(part.text.encode("utf-8"))

exceeds_inline_byte_limit = byte_len > self.inline_text_limit
exceeds_char_truncation_limit = (
    self.max_length != -1 and char_len > self.max_length
)

if self.offloader and (
    exceeds_inline_byte_limit or exceeds_char_truncation_limit
):
    # offload to GCS
    ...
```

This preserves both intents:
- `inline_text_limit` controls approximate inline storage size — **bytes**, as intended by the 32KB name
- `max_content_length` controls truncation semantics — **characters**, consistent with `_truncate()` and `_recursive_smart_truncate()`
- No mixed-unit `min()` comparison

### Alternative

Convert all limits to characters and rename/re-document `inline_text_limit` so it is no longer described as KB. This is simpler but weakens the inline-size guard for multi-byte text — 32K CJK characters could be ~96KB of UTF-8 payload kept inline.

The two-predicate approach is preferred because it preserves the original intent of each limit.

### Affected code

| Location | Current Unit | Intent |
|---|---|---|
| `_parse_content_object:1433` | Bytes (`encode("utf-8")`) | GCS offload size check |
| `_parse_content_object:1436-1438` | Mixed (`min(bytes, chars)`) | Offload threshold |
| `_truncate:1370` | Characters (`len(text)`) | Inline text truncation |
| `_recursive_smart_truncate:309` | Characters (`len(obj)`) | Dict/list string truncation |
| `BigQueryLoggerConfig.max_content_length:570` | Characters | Config (default 500K) |
| `HybridContentParser.inline_text_limit:1367` | Bytes-intended | Hardcoded 32KB |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload #5561

Bug: Mixed byte/character units in GCS text offload decision

Problem

Impact

Proposed fix

Alternative

Affected code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Example	Characters	UTF-8 Bytes	Default threshold (32KB)
20K emoji characters	~20,000	~80,000	Offloaded (80K > 32K)
20K ASCII characters	20,000	20,000	Stays inline (20K < 32K)

Location	Current Unit	Intent
`_parse_content_object:1433`	Bytes (`encode("utf-8")`)	GCS offload size check
`_parse_content_object:1436-1438`	Mixed (`min(bytes, chars)`)	Offload threshold
`_truncate:1370`	Characters (`len(text)`)	Inline text truncation
`_recursive_smart_truncate:309`	Characters (`len(obj)`)	Dict/list string truncation
`BigQueryLoggerConfig.max_content_length:570`	Characters	Config (default 500K)
`HybridContentParser.inline_text_limit:1367`	Bytes-intended	Hardcoded 32KB

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload #5561

Description

Bug: Mixed byte/character units in GCS text offload decision

Problem

Impact

Proposed fix

Alternative

Affected code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions