Skip to content

[Feature]Add output fallback support for OpenAI serving#7942

Open
luukunn wants to merge 11 commits into
PaddlePaddle:developfrom
luukunn:fallback
Open

[Feature]Add output fallback support for OpenAI serving#7942
luukunn wants to merge 11 commits into
PaddlePaddle:developfrom
luukunn:fallback

Conversation

@luukunn
Copy link
Copy Markdown
Collaborator

@luukunn luukunn commented May 27, 2026

Motivation

当前 OpenAI serving 在输出处理上缺少统一的兜底扩展机制。当业务侧希望对模型输出做补充处理时,例如在 streaming 场景下拦截/缓存/截断部分输出,或在 non-streaming 场景下对完整文本做后处理,现有链路缺少统一的抽象和可扩展入口。

本 PR 引入 output fallback framework,为 OpenAI serving 提供统一的输出兜底处理框架,支持:

  • 对完整输出文本进行后处理
  • 对流式输出增量进行逐段处理
  • 支持 send / hold / drop / flush / truncate 等流式控制语义
  • 支持通过插件方式注册和加载自定义 fallback strategy

Modifications

本 PR 主要包含以下改动:

本 PR 主要包含以下改动:

  1. 新增 output fallback framework

    • 新增 fastdeploy/output/fallback/ 模块
    • 新增 OutputFallbackStrategy 抽象基类,用于定义 fallback 策略接口
    • 新增 OutputFallbackContext,统一传递 request、request_id、choice_index、stream、output 等上下文信息
    • 新增 StreamFallbackDecision,用于表达流式场景下的策略决策结果
    • 新增 OutputFallbackManager,负责策略注册、实例化、链式执行、状态管理和插件导入
  2. 在 OpenAI serving 路径中接入 output fallback manager

    • fastdeploy/entrypoints/openai/api_server.py
    • fastdeploy/entrypoints/openai/serving_chat.py
    • fastdeploy/entrypoints/openai/serving_completion.py
    • fastdeploy/entrypoints/openai/v1/serving_base.py
    • fastdeploy/entrypoints/openai/v1/serving_chat.py
    • fastdeploy/entrypoints/openai/v1/serving_completion.py
  3. 新增 output fallback 相关启动参数

    • --output-fallback
    • --output-fallback-plugin
    • --output-fallback-config
  4. 支持在 non-streaming 场景中对完整文本应用 fallback

    • 通过 OutputFallbackManager.apply() 对最终生成文本进行处理
    • 适用于无状态或基于完整文本的后处理逻辑
  5. 支持在 streaming 场景中对增量输出应用 fallback

    • 通过 on_delta() 对每个 delta 做处理
    • 通过 on_finish() 在流式输出结束时执行 flush
    • 支持以下 action:
      • send:发送当前文本
      • hold:暂存当前文本,本轮不输出
      • drop:丢弃当前文本
      • flush:流结束时输出缓存内容
      • truncate:发送当前文本并提前终止后续生成
    • 当 fallback 触发截断时,会主动 abort 对应 choice 的后续生成,并跳过残余输出
  6. 增加插件加载机制

    • 新增 fastdeploy.plugins.output_fallback
    • 支持通过插件组 fastdeploy.output_fallback_plugins 自动加载 output fallback 插件
    • 同时支持通过 --output-fallback-plugin 指定外部插件路径进行导入
  7. 补充测试

    • 新增 tests/output/test_fallback.py
    • 覆盖 strategy 默认行为、manager 链式执行、streaming 状态流转、truncate/flush、cleanup、插件导入等场景
    • 补充 OpenAI chat/completion streaming 下 fallback truncate 的单测
    • 补充相关 API server / metrics route 参数兼容测试

Usage or Command

启用内置 fallback 策略示例:

--output-fallback your-strategy-name

配置策略参数示例:

--output-fallback-config '{"your-strategy-name": {"key": "value"}}'

加载自定义 fallback 插件示例:

--output-fallback-plugin /path/to/custom_fallback.py

How to add a custom output fallback strategy

可以通过继承 OutputFallbackStrategy 并使用 OutputFallbackManager.register(...) 注册策略。
示例:

from fastdeploy.output.fallback import (
    OutputFallbackContext,
    OutputFallbackManager,
    OutputFallbackStrategy,
    StreamFallbackDecision,
)


@OutputFallbackManager.register("custom-fallback")
class CustomFallbackStrategy(OutputFallbackStrategy):
    name = "custom-fallback"

    def should_apply(self, text: str, context: OutputFallbackContext) -> bool:
        return "bad" in text

    def apply(self, text: str, context: OutputFallbackContext) -> str:
        return text.replace("bad", "good")

    def on_delta(
        self,
        delta_text: str,
        context: OutputFallbackContext,
        state: dict,
    ) -> StreamFallbackDecision:
        # streaming 场景下可按需自定义逻辑
        if "stop" in delta_text:
            return StreamFallbackDecision(action="truncate", text=delta_text)
        return StreamFallbackDecision(action="send", text=delta_text)

    def on_finish(
        self,
        context: OutputFallbackContext,
        state: dict,
    ) -> StreamFallbackDecision:
        return StreamFallbackDecision(action="flush")

自定义策略说明:

  1. should_apply(text, context)
    -判断当前文本是否需要应用 fallback
  2. apply(text, context)
    -处理non-streaming的完整文本
    -默认的on_delta()实现也会复用这两个接口进行无状态处理
  3. on_delta(delta_text, context, state)
    -处理streaming场景下的增量文本
    -state为当前request/choice/strategy维度的状态字典,可用于跨chunk缓存状态
  4. on_finish(context, state)
    -在流结束后执行flush逻辑,输出剩余缓存内容

加载方式有两种:

  1. 通过插件路径加载:
    -使用--output-fallback-plugin /path/to/custom_fallback.py
  2. 通过插件组自动加载:
    -将插件注册到fastdeploy.output_fallback_plugins对应的entry point group

Accuracy Tests

本 PR 不涉及模型权重、kernel 或 model forward 计算逻辑修改,不影响模型数值精度,因此未进行 accuracy 对比测试。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 27, 2026 10:02
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 27, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

本 PR 为 OpenAI 兼容服务新增 output fallback 兜底处理框架,在 streaming / non-streaming 路径上对模型输出做后处理(修复 Markdown 加粗冒号、Markdown 表格、检测重复输出截断),并通过策略注册 + 插件机制支持自定义扩展。

Changes:

  • 新增 fastdeploy/output/fallback/ 子包:定义 OutputFallbackStrategy 基类、OutputFallbackContextStreamFallbackDecisionOutputFallbackManager,并内置 markdown-bold-colon / markdown-table / repeat-truncate 三个策略。
  • EngineArgs / api_server 接入 --output-fallback--output-fallback-plugin--output-fallback-config 三个启动参数,并将 manager 注入到 v0 / v1 chat 和 completion 的 serving 类。
  • 在 streaming / non-streaming 处理流程中调用 manager 的 apply / on_delta / on_finish / cleanup;命中 repeat-truncate 时将 finish_reason 设为 repeat_truncate 并 abort 对应 choice。

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fastdeploy/output/fallback/init.py 暴露公共类并导入三个内置策略以触发注册
fastdeploy/output/fallback/base.py 定义 fallback context / decision / 抽象基类
fastdeploy/output/fallback/manager.py 注册表 / 插件加载 / apply / on_delta / on_finish / cleanup
fastdeploy/output/fallback/markdown_bold_colon.py 修正 **xxx:** 冒号位置,支持跨 delta 缓存
fastdeploy/output/fallback/markdown_table.py 修复 Markdown 表格分隔行 / 列数不一致
fastdeploy/output/fallback/repeat_truncate.py 基于 token window 检测重复输出并触发 truncate
fastdeploy/engine/args_utils.py 增加 3 个新 CLI 参数
fastdeploy/entrypoints/openai/api_server.py 解析参数构建 manager 并注入各 handler,/config-info 暴露相应字段
fastdeploy/entrypoints/openai/serving_chat.py v0 chat 流/非流路径接入 fallback,含 repeat_truncate finish_reason
fastdeploy/entrypoints/openai/serving_completion.py v0 completion 流/非流路径接入 fallback
fastdeploy/entrypoints/openai/v1/serving_base.py 基类构造接收 manager 并在 finally 清理状态
fastdeploy/entrypoints/openai/v1/serving_chat.py v1 chat 接入 fallback(非多模态路径)
fastdeploy/entrypoints/openai/v1/serving_completion.py v1 completion 接入 fallback
tests/output/test_fallback.py 覆盖 manager、内置策略、流式 hold/flush/truncate、cleanup、插件导入

choice_completion_tokens = response_ctx.choice_completion_tokens_dict[output.index]
choice.finish_reason = self._calc_finish_reason(request_output, max_tokens, choice_completion_tokens)
if fallback_truncated:
choice.finish_reason = "repeat_truncate"
if res.get("error_msg") is not None and "Aborted" in res["error_msg"]:
choices[-1].finish_reason = "abort"
if fallback_truncated:
choices[-1].finish_reason = "repeat_truncate"
choice.finish_reason = "abort"

if fallback_truncated:
choice.finish_reason = "repeat_truncate"
Comment on lines +307 to +308
if fallback_truncated:
choice.finish_reason = "repeat_truncate"
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

❌ Patch coverage is 72.39264% with 90 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a918693). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/output/fallback/manager.py 68.59% 26 Missing and 12 partials ⚠️
...deploy/entrypoints/openai/v1/serving_completion.py 63.63% 8 Missing and 4 partials ⚠️
fastdeploy/entrypoints/openai/v1/serving_chat.py 67.64% 7 Missing and 4 partials ⚠️
fastdeploy/entrypoints/openai/serving_chat.py 80.00% 4 Missing and 3 partials ⚠️
fastdeploy/entrypoints/openai/api_server.py 25.00% 4 Missing and 2 partials ⚠️
...astdeploy/entrypoints/openai/serving_completion.py 85.29% 3 Missing and 2 partials ⚠️
fastdeploy/entrypoints/openai/v1/serving_base.py 44.44% 3 Missing and 2 partials ⚠️
fastdeploy/plugins/output_fallback/__init__.py 60.00% 2 Missing and 2 partials ⚠️
fastdeploy/output/fallback/base.py 92.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7942   +/-   ##
==========================================
  Coverage           ?   67.67%           
==========================================
  Files              ?      471           
  Lines              ?    65505           
  Branches           ?    10075           
==========================================
  Hits               ?    44328           
  Misses             ?    18325           
  Partials           ?     2852           
Flag Coverage Δ
GPU 77.87% <72.39%> (?)
XPU 7.06% <3.37%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 27, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-30 01:28:28

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 有 1 个 required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 38 4 0 0 0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h26m PR问题:测试未 Mock engine_client.abort 方法 为两测试文件 engine_client Mock 添加 abort = AsyncMock() Job -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m40s Job -
CI_HPU 1h5m Job -
Trigger Jenkins for PR 17s Job -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 高
  • 根因摘要: 新增 fallback truncate 调用 engine_client.abort,测试未 Mock abort 方法
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_serving_chat_v1.py::TestOpenAIServingChat::test_build_stream_response_with_fallback_truncate AttributeError: Mock has no attribute 'abort' engine_client Mock 未配置 abort 方法
test_serving_completion_v1.py::TestOpenAIServingCompletion::test_build_stream_response_with_fallback_truncate AssertionError: 1 != 2 abort 未 Mock 致异常被捕获,generator 只 yield 1 个错误响应

根因详情:

PR 新增了 output fallback truncate 功能,在 serving_chat.py:345serving_completion.py:318fallback_truncated=True 时新增调用 await self.engine_client.abort()。但两个新增测试中 engine_client 使用 AsyncMock(spec='AsyncLLM') 创建,均未配置 abort 属性。test_serving_chat_v1AttributeError 直接传播导致测试报错;test_serving_completion_v1 中异常被 try-except 捕获,generator 只 yield 了 1 个错误响应,与预期的 2 个正常响应(delta chunk + [DONE])不符。

修复建议:

  1. tests/entrypoints/openai/v1/test_serving_chat_v1.py:在 test_build_stream_response_with_fallback_truncate 调用前添加 self.serving_chat.engine_client.abort = AsyncMock()
  2. tests/entrypoints/openai/v1/test_serving_completion_v1.py:在 test_build_stream_response_with_fallback_truncate 调用前添加 self.serving_completion.engine_client.abort = AsyncMock()

关联变更: serving_chat.py:345serving_completion.py:318 新增 await self.engine_client.abort(...) 调用

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

This comment was marked as outdated.

@luukunn luukunn changed the title [Feature][APIServer] Add output fallback support for OpenAI serving [Feature]Add output fallback support for OpenAI serving May 28, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 28, 2026 12:14

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-29 15:57:48

📋 Review 摘要

PR 概述:为 OpenAI serving 引入统一的 output fallback 框架,支持流式/非流式场景下的输出后处理、截断、缓冲等控制语义。
变更范围fastdeploy/output/fallback/fastdeploy/entrypoints/openai/fastdeploy/engine/args_utils.pyfastdeploy/plugins/tests/
影响面 Tag[APIServer] [DataProcessor] [FDConfig]

问题

级别 文件 概述
❓ 疑问 fastdeploy/output/fallback/manager.py:155 on_finish 返回 truncate action 时,所有调用方只检查 text 字段,truncate 语义被静默忽略

历史 Findings 修复情况

Finding 问题 状态
F1 output_fallback 类型注解缺少 Optional ✅ 已修复
F2 v1 streaming 路径缺少 fallback_truncated_choices 保护集 ✅ 已修复(改用 response_ctx.truncated_choices + serving_base.py 统一过滤)
F3 v1 completion streaming 路径同样缺少保护集 ✅ 已修复(同 F2)
F4 repeat_truncate 不是 OpenAI 标准 finish_reason ✅ 已修复(改为 "length"
F5 truncatehold/drop 同时触发时截断文本被静默丢弃 ✅ 已修复(显式返回 text=""blocked_action is not None
F6 _calc_finish_reason 返回类型注解包含 "repeat_truncate" ✅ 已修复(注解已更新为 Literal["stop", "length", "tool_calls", "recover_stop"]
F7 asdict(output) 在每个 streaming delta 中执行深拷贝 ⚠️ 仍存在

📝 PR 规范检查

PR 标题 [Feature]Add output fallback support for OpenAI serving 包含两个 Tag([Feature][APIServer]),按 checklist D1 规范标题仅能含一个官方 Tag。

标题建议(可直接复制):

  • [Feature] Add output fallback support for OpenAI serving

PR 描述结构完整,包含 Motivation、Modifications、Usage or Command、Accuracy Tests 和 Checklist 全部必填章节,内容充实,checklist 勾选状态符合实际变更。无需修改描述。

总体评价

本 PR 整体设计清晰,历史 7 个 findings 中 6 个已修复,框架核心逻辑(策略链、状态管理、cleanup)实现正确。剩余一个新疑问(on_finishtruncate action 在服务层未被消费)请作者确认语义;F7(asdict 性能)仍待优化。

try:
decision = strategy.on_delta(pending, replace(flush_context, delta_text=pending), state)
except Exception:
data_processor_logger.exception(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 on_finish 返回 action="truncate" 时,所有调用方(serving_chat.py、serving_completion.py 等)只检查 finish_decision.text,不检查 finish_decision.action,导致 truncate 语义被静默忽略。

若策略在 on_finish 中返回 truncate,flush 文本仍会被发送,但不会触发 abort。请确认这是预期行为,还是需要在调用方补充对 action=="truncate" 的处理?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants