[Feature]Add output fallback support for OpenAI serving#7942
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
本 PR 为 OpenAI 兼容服务新增 output fallback 兜底处理框架,在 streaming / non-streaming 路径上对模型输出做后处理(修复 Markdown 加粗冒号、Markdown 表格、检测重复输出截断),并通过策略注册 + 插件机制支持自定义扩展。
Changes:
- 新增
fastdeploy/output/fallback/子包:定义OutputFallbackStrategy基类、OutputFallbackContext、StreamFallbackDecision、OutputFallbackManager,并内置markdown-bold-colon/markdown-table/repeat-truncate三个策略。 - 在
EngineArgs/ api_server 接入--output-fallback、--output-fallback-plugin、--output-fallback-config三个启动参数,并将 manager 注入到 v0 / v1 chat 和 completion 的 serving 类。 - 在 streaming / non-streaming 处理流程中调用 manager 的
apply/on_delta/on_finish/cleanup;命中 repeat-truncate 时将finish_reason设为repeat_truncate并 abort 对应 choice。
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/output/fallback/init.py | 暴露公共类并导入三个内置策略以触发注册 |
| fastdeploy/output/fallback/base.py | 定义 fallback context / decision / 抽象基类 |
| fastdeploy/output/fallback/manager.py | 注册表 / 插件加载 / apply / on_delta / on_finish / cleanup |
| fastdeploy/output/fallback/markdown_bold_colon.py | 修正 **xxx:** 冒号位置,支持跨 delta 缓存 |
| fastdeploy/output/fallback/markdown_table.py | 修复 Markdown 表格分隔行 / 列数不一致 |
| fastdeploy/output/fallback/repeat_truncate.py | 基于 token window 检测重复输出并触发 truncate |
| fastdeploy/engine/args_utils.py | 增加 3 个新 CLI 参数 |
| fastdeploy/entrypoints/openai/api_server.py | 解析参数构建 manager 并注入各 handler,/config-info 暴露相应字段 |
| fastdeploy/entrypoints/openai/serving_chat.py | v0 chat 流/非流路径接入 fallback,含 repeat_truncate finish_reason |
| fastdeploy/entrypoints/openai/serving_completion.py | v0 completion 流/非流路径接入 fallback |
| fastdeploy/entrypoints/openai/v1/serving_base.py | 基类构造接收 manager 并在 finally 清理状态 |
| fastdeploy/entrypoints/openai/v1/serving_chat.py | v1 chat 接入 fallback(非多模态路径) |
| fastdeploy/entrypoints/openai/v1/serving_completion.py | v1 completion 接入 fallback |
| tests/output/test_fallback.py | 覆盖 manager、内置策略、流式 hold/flush/truncate、cleanup、插件导入 |
| choice_completion_tokens = response_ctx.choice_completion_tokens_dict[output.index] | ||
| choice.finish_reason = self._calc_finish_reason(request_output, max_tokens, choice_completion_tokens) | ||
| if fallback_truncated: | ||
| choice.finish_reason = "repeat_truncate" |
| if res.get("error_msg") is not None and "Aborted" in res["error_msg"]: | ||
| choices[-1].finish_reason = "abort" | ||
| if fallback_truncated: | ||
| choices[-1].finish_reason = "repeat_truncate" |
| choice.finish_reason = "abort" | ||
|
|
||
| if fallback_truncated: | ||
| choice.finish_reason = "repeat_truncate" |
| if fallback_truncated: | ||
| choice.finish_reason = "repeat_truncate" |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7942 +/- ##
==========================================
Coverage ? 67.67%
==========================================
Files ? 471
Lines ? 65505
Branches ? 10075
==========================================
Hits ? 44328
Misses ? 18325
Partials ? 2852
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览❌ 有 1 个 required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 9/10 通过
2.2 可选任务 — 29/32 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例:
根因详情: PR 新增了 output fallback truncate 功能,在 修复建议:
关联变更: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-29 15:57:48
📋 Review 摘要
PR 概述:为 OpenAI serving 引入统一的 output fallback 框架,支持流式/非流式场景下的输出后处理、截断、缓冲等控制语义。
变更范围:fastdeploy/output/fallback/、fastdeploy/entrypoints/openai/、fastdeploy/engine/args_utils.py、fastdeploy/plugins/、tests/
影响面 Tag:[APIServer] [DataProcessor] [FDConfig]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | fastdeploy/output/fallback/manager.py:155 |
on_finish 返回 truncate action 时,所有调用方只检查 text 字段,truncate 语义被静默忽略 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | output_fallback 类型注解缺少 Optional |
✅ 已修复 |
| F2 | v1 streaming 路径缺少 fallback_truncated_choices 保护集 |
✅ 已修复(改用 response_ctx.truncated_choices + serving_base.py 统一过滤) |
| F3 | v1 completion streaming 路径同样缺少保护集 | ✅ 已修复(同 F2) |
| F4 | repeat_truncate 不是 OpenAI 标准 finish_reason |
✅ 已修复(改为 "length") |
| F5 | truncate 与 hold/drop 同时触发时截断文本被静默丢弃 |
✅ 已修复(显式返回 text="" 当 blocked_action is not None) |
| F6 | _calc_finish_reason 返回类型注解包含 "repeat_truncate" |
✅ 已修复(注解已更新为 Literal["stop", "length", "tool_calls", "recover_stop"]) |
| F7 | asdict(output) 在每个 streaming delta 中执行深拷贝 |
📝 PR 规范检查
PR 标题 [Feature]Add output fallback support for OpenAI serving 包含两个 Tag([Feature] 和 [APIServer]),按 checklist D1 规范标题仅能含一个官方 Tag。
标题建议(可直接复制):
[Feature] Add output fallback support for OpenAI serving
PR 描述结构完整,包含 Motivation、Modifications、Usage or Command、Accuracy Tests 和 Checklist 全部必填章节,内容充实,checklist 勾选状态符合实际变更。无需修改描述。
总体评价
本 PR 整体设计清晰,历史 7 个 findings 中 6 个已修复,框架核心逻辑(策略链、状态管理、cleanup)实现正确。剩余一个新疑问(on_finish 的 truncate action 在服务层未被消费)请作者确认语义;F7(asdict 性能)仍待优化。
| try: | ||
| decision = strategy.on_delta(pending, replace(flush_context, delta_text=pending), state) | ||
| except Exception: | ||
| data_processor_logger.exception( |
There was a problem hiding this comment.
❓ 疑问 on_finish 返回 action="truncate" 时,所有调用方(serving_chat.py、serving_completion.py 等)只检查 finish_decision.text,不检查 finish_decision.action,导致 truncate 语义被静默忽略。
若策略在 on_finish 中返回 truncate,flush 文本仍会被发送,但不会触发 abort。请确认这是预期行为,还是需要在调用方补充对 action=="truncate" 的处理?
Motivation
当前 OpenAI serving 在输出处理上缺少统一的兜底扩展机制。当业务侧希望对模型输出做补充处理时,例如在 streaming 场景下拦截/缓存/截断部分输出,或在 non-streaming 场景下对完整文本做后处理,现有链路缺少统一的抽象和可扩展入口。
本 PR 引入 output fallback framework,为 OpenAI serving 提供统一的输出兜底处理框架,支持:
send/hold/drop/flush/truncate等流式控制语义Modifications
本 PR 主要包含以下改动:
本 PR 主要包含以下改动:
新增 output fallback framework
fastdeploy/output/fallback/模块OutputFallbackStrategy抽象基类,用于定义 fallback 策略接口OutputFallbackContext,统一传递 request、request_id、choice_index、stream、output 等上下文信息StreamFallbackDecision,用于表达流式场景下的策略决策结果OutputFallbackManager,负责策略注册、实例化、链式执行、状态管理和插件导入在 OpenAI serving 路径中接入 output fallback manager
fastdeploy/entrypoints/openai/api_server.pyfastdeploy/entrypoints/openai/serving_chat.pyfastdeploy/entrypoints/openai/serving_completion.pyfastdeploy/entrypoints/openai/v1/serving_base.pyfastdeploy/entrypoints/openai/v1/serving_chat.pyfastdeploy/entrypoints/openai/v1/serving_completion.py新增 output fallback 相关启动参数
--output-fallback--output-fallback-plugin--output-fallback-config支持在 non-streaming 场景中对完整文本应用 fallback
OutputFallbackManager.apply()对最终生成文本进行处理支持在 streaming 场景中对增量输出应用 fallback
on_delta()对每个 delta 做处理on_finish()在流式输出结束时执行 flushsend:发送当前文本hold:暂存当前文本,本轮不输出drop:丢弃当前文本flush:流结束时输出缓存内容truncate:发送当前文本并提前终止后续生成增加插件加载机制
fastdeploy.plugins.output_fallbackfastdeploy.output_fallback_plugins自动加载 output fallback 插件--output-fallback-plugin指定外部插件路径进行导入补充测试
tests/output/test_fallback.pyUsage or Command
启用内置 fallback 策略示例:
配置策略参数示例:
--output-fallback-config '{"your-strategy-name": {"key": "value"}}'加载自定义 fallback 插件示例:
How to add a custom output fallback strategy
可以通过继承 OutputFallbackStrategy 并使用 OutputFallbackManager.register(...) 注册策略。
示例:
自定义策略说明:
should_apply(text, context)-判断当前文本是否需要应用 fallback
apply(text, context)-处理
non-streaming的完整文本-默认的
on_delta()实现也会复用这两个接口进行无状态处理on_delta(delta_text, context, state)-处理
streaming场景下的增量文本-
state为当前request/choice/strategy维度的状态字典,可用于跨chunk缓存状态on_finish(context, state)-在流结束后执行
flush逻辑,输出剩余缓存内容加载方式有两种:
-使用
--output-fallback-plugin /path/to/custom_fallback.py-将插件注册到
fastdeploy.output_fallback_plugins对应的entry point groupAccuracy Tests
本 PR 不涉及模型权重、kernel 或 model forward 计算逻辑修改,不影响模型数值精度,因此未进行 accuracy 对比测试。
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.