[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant by zccjjj · Pull Request #7924 · PaddlePaddle/FastDeploy

zccjjj · 2026-05-25T13:54:25Z

…yer mix quant

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…yer mix quant

paddle-bot · 2026-05-25T13:54:36Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-25T14:12:13Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-29 11:10:14

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0c310a5
Merge base: cc413e0 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

Required 任务存在 1 个失败，当前不建议合入；需优先处理 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 的差异覆盖率未达标问题。Optional 失败不阻塞合并，但可按需关注。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	35	6	0	0	0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h23m	PR问题：差异覆盖率29%，新增分支未覆盖	补测3个变更文件覆盖未达标行	Job	-
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 26/31 通过

可选任务不阻塞合并，失败仅供参考；本次仅对 Required 失败任务做深度分析。

状态	任务	耗时	日志	重跑
❌	`xpu_coverage_report / xpu_coverage_combine`	45s	Job	-
❌	`Run iluvatar Tests / run_iluvatar_cases`	2m48s	Job	-
❌	`Check PR Template`	20s	Job	-
❌	`CI_HPU`	1h4m	Job	-
❌	`Trigger Jenkins for PR`	1m58s	Job	-
✅	其余 26 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率未达标（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 覆盖率未达标
置信度: 高
根因摘要: 差异覆盖率29%，新增分支未覆盖
分析器: ci_analyze_unittest_fastdeploy

失败用例: 无。日志显示单元测试已通过（TEST_EXIT_CODE=0），失败发生在 Verify Code Coverage Threshold (80%)。

根因详情:
CI 在生成差异覆盖率报告后以 COVERAGE_EXIT_CODE=9 退出，diff_coverage.json 显示本 PR 差异覆盖率为 29%，低于 80% 阈值。未覆盖行集中在本 PR 新增/修改的 fastdeploy/model_executor/layers/quantization/__init__.py、fastdeploy/model_executor/utils.py、fastdeploy/model_executor/layers/moe/moe.py 三个文件，因此判断为 PR 新增逻辑缺少单元测试覆盖。

覆盖率明细:

文件	差异覆盖率	未覆盖行
`fastdeploy/model_executor/layers/quantization/__init__.py`	50.0%	283, 287
`fastdeploy/model_executor/utils.py`	37.5%	443, 568, 572, 573, 575
`fastdeploy/model_executor/layers/moe/moe.py`	16.67%	312-316, 319-321, 357-358

关键日志:

All tests passed
Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
"total_num_lines": 24,
"total_num_violations": 17,
"total_percent_covered": 29,
"num_changed_lines": 154
##[error]Process completed with exit code 9.

修复建议:

在 tests/quantization/test_quantization_init.py 增加 XPU 平台分支用例，mock/patch current_platform.is_xpu() 为 true，验证 get_quantization_config("kvcache") 返回 XPUKvCacheQuantConfig，覆盖 __init__.py L283/L287。
为 fastdeploy/model_executor/utils.py 增加或扩展单测，覆盖 XPU 下 v1_loader_support 对 w4a8 的支持分支（L443），以及 rename_offline_ckpt_suffix_to_fd_suffix 在 MoE w4a8/w4afp8 下将 quant_weight、activation_scale 映射到 weight、in_scale 的分支（L568/L572-L575）。
扩展 tests/layers/test_w4a8_moe.py 中 W4A8 MoE 加载断言，覆盖 _load_in_scale_weight 的 reshape/cast/copy 路径以及 SHARD_ID_TO_SHARDED_DIM 为 None 时进入 in_scale loader 的分支（moe.py L312-L321、L357-L358）。

修复建议摘要: 补测3个变更文件覆盖未达标行

关联变更: fastdeploy/model_executor/layers/quantization/__init__.py L279-L287；fastdeploy/model_executor/utils.py L439-L443、L565-L575；fastdeploy/model_executor/layers/moe/moe.py L310-L321、L355-L358

链接: 查看日志

说明：本轮 Required 失败任务命中历史分析缓存，未重复下载完整日志；已额外读取相关变更文件和测试文件上下文核对，结论与当前代码一致.

codecov-commenter · 2026-05-25T14:34:47Z

Codecov Report

❌ Patch coverage is 5.26316% with 72 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cc413e0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...cutor/layers/backends/xpu/quantization/kv_cache.py	0.00%	44 Missing ⚠️
fastdeploy/model_executor/layers/moe/moe.py	8.33%	10 Missing and 1 partial ⚠️
...odel_executor/layers/backends/xpu/moe/fused_moe.py	0.00%	6 Missing ⚠️
fastdeploy/model_executor/utils.py	25.00%	5 Missing and 1 partial ⚠️
...loy/model_executor/layers/quantization/__init__.py	25.00%	2 Missing and 1 partial ⚠️
...oy/model_executor/layers/backends/xpu/attention.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7924   +/-   ##
==========================================
  Coverage           ?   63.98%           
==========================================
  Files              ?      467           
  Lines              ?    65023           
  Branches           ?     9973           
==========================================
  Hits               ?    41605           
  Misses             ?    20592           
  Partials           ?     2826

Flag	Coverage Δ
GPU	`73.13% <16.66%> (?)`
XPU	`7.07% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 18:50:20

📋 Review 摘要

PR 概述：为昆仑芯 XPU 平台文心 ERNIE 4.5 MoE 模型添加 W4A8 C8/C16 KV Cache 量化支持，修复 TP4EP4 + PD Disaggregation 场景下权重加载错误，同时修复跳层混合量化的 suffix 映射逻辑。
变更范围：layers/backends/xpu/、layers/moe/moe.py、layers/quantization/__init__.py、models/ernie4_5_moe.py、utils.py
影响面 Tag：[XPU] [Quantization] [Models]

问题

级别	文件	概述
❓ 疑问	`kv_cache.py:62`	`self.has_zero_point` 新增属性未被 `create_weights` 消费，疑似冗余
🟡 建议	`kv_cache.py:239`	`process_weights_after_loading` 公式从 `1/scale` 改为 `max_bound/scale`，C16 路径行为变更需确认 kernel 适配
❓ 疑问	`moe.py:309`	注释拼写错误 `spport` → `support`

📝 PR 规范检查

PR 描述中 Motivation、Modifications、Usage or Command、Accuracy Tests 四个章节均为空（仅保留了模板占位符），Checklist 全部未勾选，不符合 PR 描述模板要求。

标题建议（可直接复制）：

[XPU] Support ERNIE4.5-MoE w4a8 C8/C16 kvcache quant + TP4EP4 PD disaggregation + skip-layer mixed quant

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
为昆仑芯 XPU 平台文心 ERNIE 4.5 MoE 模型添加 W4A8 量化支持，具体包括：
1. C8（通道级 + 零点）和 C16（通道级无零点）KV Cache 量化，支持 TP4EP4 场景下的 scale/zp 分片加载；
2. 修复 TP4EP4 + PD Disaggregation 场景下 `cache_k_zp`/`cache_v_zp` 从 `self` 误读的 Bug；
3. 修复跳层混合量化（skip-layer mix quant）场景下权重 suffix 映射逻辑（if-if → if-elif）。

## Modifications
- `attention.py`：`cache_k_zp`/`cache_v_zp` 从 `self` 改为从 `layer` 读取（Bug Fix）；C8 场景 zp 转换为 bfloat16 再传入 kernel
- `kv_cache.py`：重构 `create_weights`，新增 `_tp_shard_along_kv_heads` 实现 TP 下通道级 scale/zp 的分片加载；`process_weights_after_loading` 统一改用 `max_bound / scale` 公式；`XPUKvCacheQuantConfig.__init__` 补充存储 `has_zero_point`
- `fused_moe.py`（XPU）：W4A8 场景下为 `up_gate_proj`/`down_proj` 权重及 scale 补充 `weight_loader`；为 in_scale 设置 `SHARD_ID_TO_SHARDED_DIM={"gate":None,"up":None,"down":None}` 标识
- `moe.py`：新增 `_load_in_scale_weight` 方法，按 `expert_id` 加载 MoE in_scale；`weight_loader` 中识别全 None 分片维度时走 in_scale 路径
- `quantization/__init__.py`：XPU 平台将 `kvcache` 量化配置替换为 `XPUKvCacheQuantConfig`
- `ernie4_5_moe.py`：补充 `down_proj_in_scale` → `down_proj.in_scale` 权重映射
- `utils.py`：reshape 条件增加 `math.prod` 相等前置校验；XPU 移除 w4a8 不支持限制；suffix 映射 if-if 修复为 if-elif 链并新增 w4a8 映射分支

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰，Bug Fix 方向正确（self → layer、if→elif 修复）。需确认 C16 非 channel-wise 路径下 process_weights_after_loading 公式变更是否已同步 XPU Attention Kernel 适配；self.has_zero_point 冗余字段请明确是否有后续用途。无阻塞性问题，以上疑问确认后可合入。

PaddlePaddle-bot · 2026-05-26T10:55:20Z

        super().__init__()
        self.kv_cache_quant_type = kv_cache_quant_type
        self.is_channel_wise = is_channel_wise
+        self.has_zero_point = has_zero_point


❓ 疑问 新增的 self.has_zero_point = has_zero_point 属性疑似冗余。

当前 create_weights 方法中仍然使用 self.cache_quant_config.has_zero_point 而非 self.has_zero_point，新字段没有被消费。请确认：

是否是为未来使用而预留？若是，建议加注释说明

还是应该在 create_weights 中统一改为 self.has_zero_point？

PaddlePaddle-bot · 2026-05-26T10:55:21Z

        # cache_k_out_scale is the reciprocal of cache_k_scale
        if layer.cache_k_scale._is_initialized():
-            layer.cache_k_out_scale.set_value(1 / layer.cache_k_scale)  # cache_k_out_scale
+            layer.cache_k_out_scale.set_value(


🟡 建议 process_weights_after_loading 中公式从 1 / scale 改为 max_bound / scale，对非 channel-wise（is_channel_wise=False）路径是一个行为变更。

C8（channel-wise）路径：_kv_scale_weight_loader 已将 cache_k_scale 存储为 max_bound / raw_scale，此处再做 max_bound / scale 会还原为 raw_scale，逻辑自洽。

C16（非 channel-wise）路径：cache_k_scale 存储的是原始 raw_scale，此处变为 max_bound / raw_scale，不再与原来的 1 / raw_scale 一致。

请确认 C16 路径下消费 cache_k_out_scale 的 XPU Attention Kernel 是否已适配新的公式，避免推理精度异常。

PaddlePaddle-bot · 2026-05-26T10:55:21Z

@@ -307,6 +307,19 @@ def __init__(
            tp_size={self.tp_size}."
        )



❓ 疑问 注释拼写错误：only spport ernie now → only support ernie now。

[XPU] support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip la…

0c310a5

…yer mix quant

zccjjj had a problem deploying to Metax_ci May 25, 2026 13:54 — with GitHub Actions Failure

zccjjj changed the title ~~[XPU] support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip la…~~ [XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant May 26, 2026

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant#7924

[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant#7924
zccjjj wants to merge 1 commit into
PaddlePaddle:developfrom
zccjjj:skipdev

zccjjj commented May 25, 2026

Uh oh!

paddle-bot Bot commented May 25, 2026

Uh oh!

PaddlePaddle-bot commented May 25, 2026 •

edited

Loading

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

codecov-commenter commented May 25, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 26, 2026

Uh oh!

PaddlePaddle-bot May 26, 2026

Uh oh!

PaddlePaddle-bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zccjjj commented May 25, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 25, 2026

Uh oh!

PaddlePaddle-bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/10 通过

2.2 可选任务 — 26/31 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

codecov-commenter commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 25, 2026 •

edited

Loading

codecov-commenter commented May 25, 2026 •

edited

Loading