Skip to content

fix(ai-cache): bypass caching for prompts carrying non-text content#13652

Closed
AlinsRan wants to merge 2 commits into
apache:masterfrom
AlinsRan:fix/ai-cache-multimodal-bypass
Closed

fix(ai-cache): bypass caching for prompts carrying non-text content#13652
AlinsRan wants to merge 2 commits into
apache:masterfrom
AlinsRan:fix/ai-cache-multimodal-bypass

Conversation

@AlinsRan

@AlinsRan AlinsRan commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Description

https://github.com/apache/apisix/actions/runs/28582602012/job/84746279063

Since #13634 the protocol layer's get_messages() flattens structured message content to plain text, dropping image / audio / other non-text typed parts. ai-cache consumed that flattened view, so a text + image prompt became indistinguishable from a text-only one:

  • L1 (exact): the fingerprint is built from the flattened messages, so text+image collapses to the same key as the text-only prompt → cross-modal L1 collision.
  • L2 (semantic): window_has_nontext() inspected the flattened messages (content is already a plain string), so it could never detect the image → the multimodal request was embedded and matched the text-only vector.

Either way a multimodal request was wrongly served a text-only cached response (X-AI-Cache-Status: HIT).

#13634 did not touch ai-cache, so this regressed silently on master. t/plugin/ai-cache-semantic.t TEST 64 fails: got 'HIT', expected 'MISS'.

Fix

Detect non-text content from the raw request body (before get_messages() flattens it) and bypass the cache entirely for such a request: read nothing, write nothing (fingerprint left unset so log() skips the write-back), and report a MISS so it always proceeds to the upstream. This restores the multimodal-bypass invariant guarded by TEST 62-64. The semantic layer's own check is switched to the same raw-body detector as a second line of defence.

A prompt that carries non-text content cannot be faithfully keyed while the canonical form is text-only, so not caching it is the safe behaviour.

Verification

  • t/plugin/ai-cache-semantic.t: 212/212 pass (was 1 failing — TEST 64).
  • Reverting the fix reproduces exactly got 'HIT', expected 'MISS'.
  • luacheck clean on both changed files.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change (existing TEST 62-64 now pass)
  • I have updated the documentation to reflect this change (N/A — behaviour fix)
  • I have verified that this change is backward compatible

Since apache#13634 the protocol get_messages() flattens structured message
content to plain text, dropping image/audio/other typed parts. ai-cache
consumed that flattened view, so a text+image prompt became
indistinguishable from a text-only one: they collided on the same L1
exact key and on the same L2 semantic cell, letting a multimodal request
be wrongly served a text-only cached response (X-AI-Cache-Status: HIT).

Detect non-text content from the RAW request body (before flattening) and
bypass the cache entirely for such a request -- read nothing, write
nothing, report a MISS -- so it always proceeds to the upstream. This
restores the multimodal-bypass invariant guarded by
t/plugin/ai-cache-semantic.t TEST 62-64.
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Jul 3, 2026
nic-6443
nic-6443 previously approved these changes Jul 3, 2026
@membphis

membphis commented Jul 3, 2026

Copy link
Copy Markdown
Member

I do not think this is ready to merge yet.

The OpenAI Chat regression fix is headed in the right direction, but the new raw-body detector still misses non-text content for an already supported protocol: Bedrock Converse.

body_has_nontext() currently only returns true when a content block has a type field and that value is not text:

function _M.body_has_nontext(body)
local messages = body and body.messages
if type(messages) ~= "table" then
return false
end
for _, msg in ipairs(messages) do
local content = type(msg) == "table" and msg.content
if type(content) == "table" then
for _, block in ipairs(content) do
if type(block) == "table" and block.type and block.type ~= "text" then
return true

However, Bedrock Converse content blocks are flattened by reading only block.text, and non-text blocks such as image/document blocks do not need to use this OpenAI-style type field:

-- Append the text of each Bedrock content block ({text = "..."}) into `texts`.
local function append_block_texts(texts, blocks)
if type(blocks) ~= "table" then
return
end
for _, block in ipairs(blocks) do
if type(block) == "table" and type(block.text) == "string" then
core.table.insert(texts, block.text)
end

--- Get messages in canonical {role, content} format.
-- Bedrock content blocks [{text: "..."}] are flattened to plain text.
function _M.get_messages(body)
local messages = {}
local system_texts = {}
append_block_texts(system_texts, body.system)
if #system_texts > 0 then
core.table.insert(messages, {
role = "system",
content = table.concat(system_texts, " "),
})
end
if type(body.messages) == "table" then
for _, message in ipairs(body.messages) do
if type(message) == "table" then
local texts = {}
append_block_texts(texts, message.content)
if #texts > 0 then
core.table.insert(messages, {
role = message.role,
content = table.concat(texts, " "),
})
end
end
end
end
return messages

At the same time, the L1 fingerprint excludes raw messages from params, so a Bedrock text-only request and a text+image/text+document request can still canonicalize to the same cache key:

local params = {}
for k, v in pairs(body) do
if k ~= "messages" and k ~= "model" and k ~= "stream" then
params[k] = v
end

Suggested fix: make the raw-body non-text detector protocol-aware, or at least treat Bedrock content blocks that are not pure { text = ... } blocks as non-text, then add a Bedrock exact-cache regression test for text-only versus same text plus image/document.

body_has_nontext() only recognised OpenAI-style typed parts
(type ~= "text"), so Bedrock Converse content blocks -- which mark
non-text modalities with a distinct key (image/document/toolUse/...) and
no `type` field -- slipped through. get_messages() then flattened them to
plain text, letting a Bedrock text+image request collide with a text-only
one on the same L1 exact key.

Treat a block as text only when it carries a string `text` and no
non-text discriminator, covering both the OpenAI `type` tag and the
Bedrock key-per-modality shape. Adds a Bedrock exact-cache regression
(TEST 66-68).
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 3, 2026
@AlinsRan

AlinsRan commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Good catch. Bedrock Converse blocks use {image=...}/{document=...} with no type, so the OpenAI-only type ~= "text" check missed them and a Bedrock text+image request still collided with text-only at L1.

Reworked the detector to be shape-agnostic: a block counts as text only if it has a string text and no non-text discriminator, covering both the OpenAI type tag and Bedrock's key-per-modality blocks. Added a Bedrock exact-cache regression (TEST 66-68) — text-only warms L1, same text + image is a MISS; reverting the detector reproduces the HIT, so the test bites.

The access-phase guard bypasses before the fingerprint is computed, so this raw-body detection also covers the L1 canonicalization gap you pointed at — no key.lua change needed.

@AlinsRan AlinsRan closed this Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants