Skip to content

[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant#7924

Open
zccjjj wants to merge 1 commit into
PaddlePaddle:developfrom
zccjjj:skipdev
Open

[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant#7924
zccjjj wants to merge 1 commit into
PaddlePaddle:developfrom
zccjjj:skipdev

Conversation

@zccjjj
Copy link
Copy Markdown
Contributor

@zccjjj zccjjj commented May 25, 2026

…yer mix quant

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 25, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 25, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-29 11:10:14

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

Required 任务存在 1 个失败,当前不建议合入;需优先处理 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 的差异覆盖率未达标问题。Optional 失败不阻塞合并,但可按需关注。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 35 6 0 0 0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h23m PR问题:差异覆盖率29%,新增分支未覆盖 补测3个变更文件覆盖未达标行 Job -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 26/31 通过

可选任务不阻塞合并,失败仅供参考;本次仅对 Required 失败任务做深度分析。

状态 任务 耗时 日志 重跑
xpu_coverage_report / xpu_coverage_combine 45s Job -
Run iluvatar Tests / run_iluvatar_cases 2m48s Job -
Check PR Template 20s Job -
CI_HPU 1h4m Job -
Trigger Jenkins for PR 1m58s Job -
其余 26 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率未达标(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率未达标
  • 置信度: 高
  • 根因摘要: 差异覆盖率29%,新增分支未覆盖
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例: 无。日志显示单元测试已通过(TEST_EXIT_CODE=0),失败发生在 Verify Code Coverage Threshold (80%)

根因详情:
CI 在生成差异覆盖率报告后以 COVERAGE_EXIT_CODE=9 退出,diff_coverage.json 显示本 PR 差异覆盖率为 29%,低于 80% 阈值。未覆盖行集中在本 PR 新增/修改的 fastdeploy/model_executor/layers/quantization/__init__.pyfastdeploy/model_executor/utils.pyfastdeploy/model_executor/layers/moe/moe.py 三个文件,因此判断为 PR 新增逻辑缺少单元测试覆盖。

覆盖率明细:

文件 差异覆盖率 未覆盖行
fastdeploy/model_executor/layers/quantization/__init__.py 50.0% 283, 287
fastdeploy/model_executor/utils.py 37.5% 443, 568, 572, 573, 575
fastdeploy/model_executor/layers/moe/moe.py 16.67% 312-316, 319-321, 357-358

关键日志:

All tests passed
Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
"total_num_lines": 24,
"total_num_violations": 17,
"total_percent_covered": 29,
"num_changed_lines": 154
##[error]Process completed with exit code 9.

修复建议:

  1. tests/quantization/test_quantization_init.py 增加 XPU 平台分支用例,mock/patch current_platform.is_xpu() 为 true,验证 get_quantization_config("kvcache") 返回 XPUKvCacheQuantConfig,覆盖 __init__.py L283/L287。
  2. fastdeploy/model_executor/utils.py 增加或扩展单测,覆盖 XPU 下 v1_loader_support 对 w4a8 的支持分支(L443),以及 rename_offline_ckpt_suffix_to_fd_suffix 在 MoE w4a8/w4afp8 下将 quant_weightactivation_scale 映射到 weightin_scale 的分支(L568/L572-L575)。
  3. 扩展 tests/layers/test_w4a8_moe.py 中 W4A8 MoE 加载断言,覆盖 _load_in_scale_weight 的 reshape/cast/copy 路径以及 SHARD_ID_TO_SHARDED_DIM 为 None 时进入 in_scale loader 的分支(moe.py L312-L321、L357-L358)。

修复建议摘要: 补测3个变更文件覆盖未达标行

关联变更: fastdeploy/model_executor/layers/quantization/__init__.py L279-L287;fastdeploy/model_executor/utils.py L439-L443、L565-L575;fastdeploy/model_executor/layers/moe/moe.py L310-L321、L355-L358

链接: 查看日志


说明:本轮 Required 失败任务命中历史分析缓存,未重复下载完整日志;已额外读取相关变更文件和测试文件上下文核对,结论与当前代码一致.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2026

Codecov Report

❌ Patch coverage is 5.26316% with 72 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cc413e0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...cutor/layers/backends/xpu/quantization/kv_cache.py 0.00% 44 Missing ⚠️
fastdeploy/model_executor/layers/moe/moe.py 8.33% 10 Missing and 1 partial ⚠️
...odel_executor/layers/backends/xpu/moe/fused_moe.py 0.00% 6 Missing ⚠️
fastdeploy/model_executor/utils.py 25.00% 5 Missing and 1 partial ⚠️
...loy/model_executor/layers/quantization/__init__.py 25.00% 2 Missing and 1 partial ⚠️
...oy/model_executor/layers/backends/xpu/attention.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7924   +/-   ##
==========================================
  Coverage           ?   63.98%           
==========================================
  Files              ?      467           
  Lines              ?    65023           
  Branches           ?     9973           
==========================================
  Hits               ?    41605           
  Misses             ?    20592           
  Partials           ?     2826           
Flag Coverage Δ
GPU 73.13% <16.66%> (?)
XPU 7.07% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zccjjj zccjjj changed the title [XPU] support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip la… [XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant May 26, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 18:50:20

📋 Review 摘要

PR 概述:为昆仑芯 XPU 平台文心 ERNIE 4.5 MoE 模型添加 W4A8 C8/C16 KV Cache 量化支持,修复 TP4EP4 + PD Disaggregation 场景下权重加载错误,同时修复跳层混合量化的 suffix 映射逻辑。
变更范围layers/backends/xpu/layers/moe/moe.pylayers/quantization/__init__.pymodels/ernie4_5_moe.pyutils.py
影响面 Tag[XPU] [Quantization] [Models]

问题

级别 文件 概述
❓ 疑问 kv_cache.py:62 self.has_zero_point 新增属性未被 create_weights 消费,疑似冗余
🟡 建议 kv_cache.py:239 process_weights_after_loading 公式从 1/scale 改为 max_bound/scale,C16 路径行为变更需确认 kernel 适配
❓ 疑问 moe.py:309 注释拼写错误 spportsupport

📝 PR 规范检查

PR 描述中 MotivationModificationsUsage or CommandAccuracy Tests 四个章节均为空(仅保留了模板占位符),Checklist 全部未勾选,不符合 PR 描述模板要求。

标题建议(可直接复制):

  • [XPU] Support ERNIE4.5-MoE w4a8 C8/C16 kvcache quant + TP4EP4 PD disaggregation + skip-layer mixed quant

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
为昆仑芯 XPU 平台文心 ERNIE 4.5 MoE 模型添加 W4A8 量化支持,具体包括:
1. C8(通道级 + 零点)和 C16(通道级无零点)KV Cache 量化,支持 TP4EP4 场景下的 scale/zp 分片加载;
2. 修复 TP4EP4 + PD Disaggregation 场景下 `cache_k_zp`/`cache_v_zp``self` 误读的 Bug;
3. 修复跳层混合量化(skip-layer mix quant)场景下权重 suffix 映射逻辑(if-if → if-elif)。

## Modifications
- `attention.py``cache_k_zp`/`cache_v_zp``self` 改为从 `layer` 读取(Bug Fix);C8 场景 zp 转换为 bfloat16 再传入 kernel
- `kv_cache.py`:重构 `create_weights`,新增 `_tp_shard_along_kv_heads` 实现 TP 下通道级 scale/zp 的分片加载;`process_weights_after_loading` 统一改用 `max_bound / scale` 公式;`XPUKvCacheQuantConfig.__init__` 补充存储 `has_zero_point`
- `fused_moe.py`(XPU):W4A8 场景下为 `up_gate_proj`/`down_proj` 权重及 scale 补充 `weight_loader`;为 in_scale 设置 `SHARD_ID_TO_SHARDED_DIM={"gate":None,"up":None,"down":None}` 标识
- `moe.py`:新增 `_load_in_scale_weight` 方法,按 `expert_id` 加载 MoE in_scale;`weight_loader` 中识别全 None 分片维度时走 in_scale 路径
- `quantization/__init__.py`:XPU 平台将 `kvcache` 量化配置替换为 `XPUKvCacheQuantConfig`
- `ernie4_5_moe.py`:补充 `down_proj_in_scale``down_proj.in_scale` 权重映射
- `utils.py`:reshape 条件增加 `math.prod` 相等前置校验;XPU 移除 w4a8 不支持限制;suffix 映射 if-if 修复为 if-elif 链并新增 w4a8 映射分支

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰,Bug Fix 方向正确(selflayer、if→elif 修复)。需确认 C16 非 channel-wise 路径下 process_weights_after_loading 公式变更是否已同步 XPU Attention Kernel 适配;self.has_zero_point 冗余字段请明确是否有后续用途。无阻塞性问题,以上疑问确认后可合入。

super().__init__()
self.kv_cache_quant_type = kv_cache_quant_type
self.is_channel_wise = is_channel_wise
self.has_zero_point = has_zero_point
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 新增的 self.has_zero_point = has_zero_point 属性疑似冗余。

当前 create_weights 方法中仍然使用 self.cache_quant_config.has_zero_point 而非 self.has_zero_point,新字段没有被消费。请确认:

  • 是否是为未来使用而预留?若是,建议加注释说明
  • 还是应该在 create_weights 中统一改为 self.has_zero_point

# cache_k_out_scale is the reciprocal of cache_k_scale
if layer.cache_k_scale._is_initialized():
layer.cache_k_out_scale.set_value(1 / layer.cache_k_scale) # cache_k_out_scale
layer.cache_k_out_scale.set_value(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 process_weights_after_loading 中公式从 1 / scale 改为 max_bound / scale,对非 channel-wise(is_channel_wise=False)路径是一个行为变更。

  • C8(channel-wise)路径_kv_scale_weight_loader 已将 cache_k_scale 存储为 max_bound / raw_scale,此处再做 max_bound / scale 会还原为 raw_scale,逻辑自洽。
  • C16(非 channel-wise)路径cache_k_scale 存储的是原始 raw_scale,此处变为 max_bound / raw_scale,不再与原来的 1 / raw_scale 一致。

请确认 C16 路径下消费 cache_k_out_scale 的 XPU Attention Kernel 是否已适配新的公式,避免推理精度异常。

@@ -307,6 +307,19 @@ def __init__(
tp_size={self.tp_size}."
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 注释拼写错误:only spport ernie nowonly support ernie now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants