[Feature]Add output fallback support for OpenAI serving by luukunn · Pull Request #7942 · PaddlePaddle/FastDeploy

luukunn · 2026-05-27T10:02:57Z

Motivation

当前 OpenAI serving 在输出处理上缺少统一的兜底扩展机制。当业务侧希望对模型输出做补充处理时，例如在 streaming 场景下拦截/缓存/截断部分输出，或在 non-streaming 场景下对完整文本做后处理，现有链路缺少统一的抽象和可扩展入口。

本 PR 引入 output fallback framework，为 OpenAI serving 提供统一的输出兜底处理框架，支持：

对完整输出文本进行后处理
对流式输出增量进行逐段处理
支持 send / hold / drop / flush / truncate 等流式控制语义
支持通过插件方式注册和加载自定义 fallback strategy

Modifications

本 PR 主要包含以下改动：

新增 output fallback framework
- 新增 fastdeploy/output/fallback/ 模块
- 新增 OutputFallbackStrategy 抽象基类，用于定义 fallback 策略接口
- 新增 OutputFallbackContext，统一传递 request、request_id、choice_index、stream、output 等上下文信息
- 新增 StreamFallbackDecision，用于表达流式场景下的策略决策结果
- 新增 OutputFallbackManager，负责策略注册、实例化、链式执行、状态管理和插件导入
在 OpenAI serving 路径中接入 output fallback manager
- fastdeploy/entrypoints/openai/api_server.py
- fastdeploy/entrypoints/openai/serving_chat.py
- fastdeploy/entrypoints/openai/serving_completion.py
- fastdeploy/entrypoints/openai/v1/serving_base.py
- fastdeploy/entrypoints/openai/v1/serving_chat.py
- fastdeploy/entrypoints/openai/v1/serving_completion.py
新增 output fallback 相关启动参数
- --output-fallback
- --output-fallback-plugin
- --output-fallback-config
支持在 non-streaming 场景中对完整文本应用 fallback
- 通过 OutputFallbackManager.apply() 对最终生成文本进行处理
- 适用于无状态或基于完整文本的后处理逻辑
支持在 streaming 场景中对增量输出应用 fallback
- 通过 on_delta() 对每个 delta 做处理
- 通过 on_finish() 在流式输出结束时执行 flush
- 支持以下 action：
  - send：发送当前文本
  - hold：暂存当前文本，本轮不输出
  - drop：丢弃当前文本
  - flush：流结束时输出缓存内容
  - truncate：发送当前文本并提前终止后续生成
- 当 fallback 触发截断时，会主动 abort 对应 choice 的后续生成，并跳过残余输出
增加插件加载机制
- 新增 fastdeploy.plugins.output_fallback
- 支持通过插件组 fastdeploy.output_fallback_plugins 自动加载 output fallback 插件
- 同时支持通过 --output-fallback-plugin 指定外部插件路径进行导入
补充测试
- 新增 tests/output/test_fallback.py
- 覆盖 strategy 默认行为、manager 链式执行、streaming 状态流转、truncate/flush、cleanup、插件导入等场景
- 补充 OpenAI chat/completion streaming 下 fallback truncate 的单测
- 补充相关 API server / metrics route 参数兼容测试

Usage or Command

启用内置 fallback 策略示例：

--output-fallback your-strategy-name

配置策略参数示例：

--output-fallback-config '{"your-strategy-name": {"key": "value"}}'

加载自定义 fallback 插件示例：

--output-fallback-plugin /path/to/custom_fallback.py

How to add a custom output fallback strategy

可以通过继承 OutputFallbackStrategy 并使用 OutputFallbackManager.register(...) 注册策略。
示例：

from fastdeploy.output.fallback import (
    OutputFallbackContext,
    OutputFallbackManager,
    OutputFallbackStrategy,
    StreamFallbackDecision,
)


@OutputFallbackManager.register("custom-fallback")
class CustomFallbackStrategy(OutputFallbackStrategy):
    name = "custom-fallback"

    def should_apply(self, text: str, context: OutputFallbackContext) -> bool:
        return "bad" in text

    def apply(self, text: str, context: OutputFallbackContext) -> str:
        return text.replace("bad", "good")

    def on_delta(
        self,
        delta_text: str,
        context: OutputFallbackContext,
        state: dict,
    ) -> StreamFallbackDecision:
        # streaming 场景下可按需自定义逻辑
        if "stop" in delta_text:
            return StreamFallbackDecision(action="truncate", text=delta_text)
        return StreamFallbackDecision(action="send", text=delta_text)

    def on_finish(
        self,
        context: OutputFallbackContext,
        state: dict,
    ) -> StreamFallbackDecision:
        return StreamFallbackDecision(action="flush")

自定义策略说明：

should_apply(text, context)
-判断当前文本是否需要应用 fallback
apply(text, context)
-处理non-streaming的完整文本
-默认的on_delta()实现也会复用这两个接口进行无状态处理
on_delta(delta_text, context, state)
-处理streaming场景下的增量文本
-state为当前request/choice/strategy维度的状态字典，可用于跨chunk缓存状态
on_finish(context, state)
-在流结束后执行flush逻辑，输出剩余缓存内容

加载方式有两种：

通过插件路径加载：
-使用--output-fallback-plugin /path/to/custom_fallback.py
通过插件组自动加载：
-将插件注册到fastdeploy.output_fallback_plugins对应的entry point group

Accuracy Tests

本 PR 不涉及模型权重、kernel 或 model forward 计算逻辑修改，不影响模型数值精度，因此未进行 accuracy 对比测试。

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-27T10:03:05Z

Thanks for your contribution!

Copilot

Pull request overview

本 PR 为 OpenAI 兼容服务新增 output fallback 兜底处理框架，在 streaming / non-streaming 路径上对模型输出做后处理（修复 Markdown 加粗冒号、Markdown 表格、检测重复输出截断），并通过策略注册 + 插件机制支持自定义扩展。

Changes:

新增 fastdeploy/output/fallback/ 子包：定义 OutputFallbackStrategy 基类、OutputFallbackContext、StreamFallbackDecision、OutputFallbackManager，并内置 markdown-bold-colon / markdown-table / repeat-truncate 三个策略。
在 EngineArgs / api_server 接入 --output-fallback、--output-fallback-plugin、--output-fallback-config 三个启动参数，并将 manager 注入到 v0 / v1 chat 和 completion 的 serving 类。
在 streaming / non-streaming 处理流程中调用 manager 的 apply / on_delta / on_finish / cleanup；命中 repeat-truncate 时将 finish_reason 设为 repeat_truncate 并 abort 对应 choice。

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
fastdeploy/output/fallback/init.py	暴露公共类并导入三个内置策略以触发注册
fastdeploy/output/fallback/base.py	定义 fallback context / decision / 抽象基类
fastdeploy/output/fallback/manager.py	注册表 / 插件加载 / `apply` / `on_delta` / `on_finish` / `cleanup`
fastdeploy/output/fallback/markdown_bold_colon.py	修正 `xxx：` 冒号位置，支持跨 delta 缓存
fastdeploy/output/fallback/markdown_table.py	修复 Markdown 表格分隔行 / 列数不一致
fastdeploy/output/fallback/repeat_truncate.py	基于 token window 检测重复输出并触发 truncate
fastdeploy/engine/args_utils.py	增加 3 个新 CLI 参数
fastdeploy/entrypoints/openai/api_server.py	解析参数构建 manager 并注入各 handler，`/config-info` 暴露相应字段
fastdeploy/entrypoints/openai/serving_chat.py	v0 chat 流/非流路径接入 fallback，含 repeat_truncate finish_reason
fastdeploy/entrypoints/openai/serving_completion.py	v0 completion 流/非流路径接入 fallback
fastdeploy/entrypoints/openai/v1/serving_base.py	基类构造接收 manager 并在 finally 清理状态
fastdeploy/entrypoints/openai/v1/serving_chat.py	v1 chat 接入 fallback（非多模态路径）
fastdeploy/entrypoints/openai/v1/serving_completion.py	v1 completion 接入 fallback
tests/output/test_fallback.py	覆盖 manager、内置策略、流式 hold/flush/truncate、cleanup、插件导入

            choice_completion_tokens = response_ctx.choice_completion_tokens_dict[output.index]
            choice.finish_reason = self._calc_finish_reason(request_output, max_tokens, choice_completion_tokens)
+            if fallback_truncated:
+                choice.finish_reason = "repeat_truncate"


                        if res.get("error_msg") is not None and "Aborted" in res["error_msg"]:
                            choices[-1].finish_reason = "abort"
+                        if fallback_truncated:
+                            choices[-1].finish_reason = "repeat_truncate"


                            choice.finish_reason = "abort"

+                        if fallback_truncated:
+                            choice.finish_reason = "repeat_truncate"


+                if fallback_truncated:
+                    choice.finish_reason = "repeat_truncate"


codecov-commenter · 2026-05-27T10:43:25Z

Codecov Report

❌ Patch coverage is 72.39264% with 90 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a918693). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/output/fallback/manager.py	68.59%	26 Missing and 12 partials ⚠️
...deploy/entrypoints/openai/v1/serving_completion.py	63.63%	8 Missing and 4 partials ⚠️
fastdeploy/entrypoints/openai/v1/serving_chat.py	67.64%	7 Missing and 4 partials ⚠️
fastdeploy/entrypoints/openai/serving_chat.py	80.00%	4 Missing and 3 partials ⚠️
fastdeploy/entrypoints/openai/api_server.py	25.00%	4 Missing and 2 partials ⚠️
...astdeploy/entrypoints/openai/serving_completion.py	85.29%	3 Missing and 2 partials ⚠️
fastdeploy/entrypoints/openai/v1/serving_base.py	44.44%	3 Missing and 2 partials ⚠️
fastdeploy/plugins/output_fallback/__init__.py	60.00%	2 Missing and 2 partials ⚠️
fastdeploy/output/fallback/base.py	92.59%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7942   +/-   ##
==========================================
  Coverage           ?   67.67%           
==========================================
  Files              ?      471           
  Lines              ?    65505           
  Branches           ?    10075           
==========================================
  Hits               ?    44328           
  Misses             ?    18325           
  Partials           ?     2852

Flag	Coverage Δ
GPU	`77.87% <72.39%> (?)`
XPU	`7.06% <3.37%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-27T10:59:04Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-30 01:28:28

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 6355b4a
Merge base: a918693 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

❌ 有 1 个 required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	38	4	0	0	0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h26m	PR问题：测试未 Mock engine_client.abort 方法	为两测试文件 engine_client Mock 添加 abort = AsyncMock()	Job	-
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	1m40s	Job	-
❌	`CI_HPU`	1h5m	Job	-
❌	`Trigger Jenkins for PR`	17s	Job	-
✅	其余 29 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: 新增 fallback truncate 调用 engine_client.abort，测试未 Mock abort 方法
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_serving_chat_v1.py::TestOpenAIServingChat::test_build_stream_response_with_fallback_truncate`	AttributeError: Mock has no attribute 'abort'	engine_client Mock 未配置 abort 方法
`test_serving_completion_v1.py::TestOpenAIServingCompletion::test_build_stream_response_with_fallback_truncate`	AssertionError: 1 != 2	abort 未 Mock 致异常被捕获，generator 只 yield 1 个错误响应

根因详情:

PR 新增了 output fallback truncate 功能，在 serving_chat.py:345 和 serving_completion.py:318 当 fallback_truncated=True 时新增调用 await self.engine_client.abort()。但两个新增测试中 engine_client 使用 AsyncMock(spec='AsyncLLM') 创建，均未配置 abort 属性。test_serving_chat_v1 中 AttributeError 直接传播导致测试报错；test_serving_completion_v1 中异常被 try-except 捕获，generator 只 yield 了 1 个错误响应，与预期的 2 个正常响应（delta chunk + [DONE]）不符。

修复建议:

tests/entrypoints/openai/v1/test_serving_chat_v1.py：在 test_build_stream_response_with_fallback_truncate 调用前添加 self.serving_chat.engine_client.abort = AsyncMock()
tests/entrypoints/openai/v1/test_serving_completion_v1.py：在 test_build_stream_response_with_fallback_truncate 调用前添加 self.serving_completion.engine_client.abort = AsyncMock()

关联变更: serving_chat.py:345 和 serving_completion.py:318 新增 await self.engine_client.abort(...) 调用

… fallback

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-29 15:57:48

📋 Review 摘要

PR 概述：为 OpenAI serving 引入统一的 output fallback 框架，支持流式/非流式场景下的输出后处理、截断、缓冲等控制语义。
变更范围：fastdeploy/output/fallback/、fastdeploy/entrypoints/openai/、fastdeploy/engine/args_utils.py、fastdeploy/plugins/、tests/
影响面 Tag：[APIServer] [DataProcessor] [FDConfig]

问题

级别	文件	概述
❓ 疑问	`fastdeploy/output/fallback/manager.py:155`	`on_finish` 返回 `truncate` action 时，所有调用方只检查 `text` 字段，`truncate` 语义被静默忽略

历史 Findings 修复情况

Finding	问题	状态
F1	`output_fallback` 类型注解缺少 `Optional`	✅ 已修复
F2	v1 streaming 路径缺少 `fallback_truncated_choices` 保护集	✅ 已修复（改用 `response_ctx.truncated_choices` + `serving_base.py` 统一过滤）
F3	v1 completion streaming 路径同样缺少保护集	✅ 已修复（同 F2）
F4	`repeat_truncate` 不是 OpenAI 标准 finish_reason	✅ 已修复（改为 `"length"`）
F5	`truncate` 与 `hold/drop` 同时触发时截断文本被静默丢弃	✅ 已修复（显式返回 `text=""` 当 `blocked_action is not None`）
F6	`_calc_finish_reason` 返回类型注解包含 `"repeat_truncate"`	✅ 已修复（注解已更新为 `Literal["stop", "length", "tool_calls", "recover_stop"]`）
F7	`asdict(output)` 在每个 streaming delta 中执行深拷贝	⚠️ 仍存在

📝 PR 规范检查

PR 标题 [Feature]Add output fallback support for OpenAI serving 包含两个 Tag（[Feature] 和 [APIServer]），按 checklist D1 规范标题仅能含一个官方 Tag。

标题建议（可直接复制）：

[Feature] Add output fallback support for OpenAI serving

PR 描述结构完整，包含 Motivation、Modifications、Usage or Command、Accuracy Tests 和 Checklist 全部必填章节，内容充实，checklist 勾选状态符合实际变更。无需修改描述。

总体评价

本 PR 整体设计清晰，历史 7 个 findings 中 6 个已修复，框架核心逻辑（策略链、状态管理、cleanup）实现正确。剩余一个新疑问（on_finish 的 truncate action 在服务层未被消费）请作者确认语义；F7（asdict 性能）仍待优化。

PaddlePaddle-bot · 2026-05-29T08:04:09Z

+                try:
+                    decision = strategy.on_delta(pending, replace(flush_context, delta_text=pending), state)
+                except Exception:
+                    data_processor_logger.exception(


❓ 疑问 on_finish 返回 action="truncate" 时，所有调用方（serving_chat.py、serving_completion.py 等）只检查 finish_decision.text，不检查 finish_decision.action，导致 truncate 语义被静默忽略。

若策略在 on_finish 中返回 truncate，flush 文本仍会被发送，但不会触发 abort。请确认这是预期行为，还是需要在调用方补充对 action=="truncate" 的处理？

luukunn added 3 commits May 26, 2026 11:47

first commit

4594e20

add markdown&repeat

475342d

fix

ed019d0

Copilot AI review requested due to automatic review settings May 27, 2026 10:02

luukunn had a problem deploying to Metax_ci May 27, 2026 10:03 — with GitHub Actions Failure

Copilot started reviewing on behalf of luukunn May 27, 2026 10:03 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

fix review & unit test

fb81e76

luukunn had a problem deploying to Metax_ci May 27, 2026 12:10 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix review

09a9344

luukunn had a problem deploying to Metax_ci May 27, 2026 12:45 — with GitHub Actions Failure

Merge branch 'develop' into fallback

70527e5

EmmonsCurse had a problem deploying to Metax_ci May 27, 2026 12:50 — with GitHub Actions Failure

luukunn requested a review from Copilot May 27, 2026 12:57

Copilot started reviewing on behalf of luukunn May 27, 2026 12:58 View session

This comment was marked as outdated.

Sign in to view

luukunn added 2 commits May 28, 2026 14:47

add unit test

630f519

Merge branch 'fallback' of https://github.com/luukunn/FastDeploy into…

7ba1b73

… fallback

luukunn had a problem deploying to Metax_ci May 28, 2026 06:48 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

luukunn requested a review from Copilot May 28, 2026 07:09

Copilot started reviewing on behalf of luukunn May 28, 2026 07:09 View session

This comment was marked as outdated.

Sign in to view

luukunn changed the title ~~[Feature][APIServer] Add output fallback support for OpenAI serving~~ [Feature]Add output fallback support for OpenAI serving May 28, 2026

fix review

36e8309

luukunn had a problem deploying to Metax_ci May 28, 2026 09:37 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix review

8a41fb8

Copilot AI review requested due to automatic review settings May 28, 2026 12:14

luukunn had a problem deploying to Metax_ci May 28, 2026 12:14 — with GitHub Actions Failure

Copilot started reviewing on behalf of luukunn May 28, 2026 12:14 View session

This comment was marked as outdated.

Sign in to view

luukunn requested a review from Copilot May 28, 2026 12:25

Copilot started reviewing on behalf of luukunn May 28, 2026 12:26 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

add unit test

6355b4a

luukunn had a problem deploying to Metax_ci May 29, 2026 07:55 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 29, 2026

View reviewed changes

		if fallback_truncated:
		choice.finish_reason = "repeat_truncate"

Conversation

luukunn commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

How to add a custom output fallback strategy

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/10 通过

2.2 可选任务 — 29/32 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

luukunn commented May 27, 2026 •

edited

Loading

codecov-commenter commented May 27, 2026 •

edited

Loading

PaddlePaddle-bot commented May 27, 2026 •

edited

Loading