[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant#7924
[XPU] [model]support yiyan model w4a8C8/C16+TP4EP4/PD disaggregation+skip layer mix quant#7924zccjjj wants to merge 1 commit into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务存在 1 个失败,当前不建议合入;需优先处理
2 任务状态汇总2.1 Required任务 : 9/10 通过
2.2 可选任务 — 26/31 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率未达标(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无。日志显示单元测试已通过( 根因详情: 覆盖率明细:
关键日志: 修复建议:
修复建议摘要: 补测3个变更文件覆盖未达标行 关联变更: 链接: 查看日志
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7924 +/- ##
==========================================
Coverage ? 63.98%
==========================================
Files ? 467
Lines ? 65023
Branches ? 9973
==========================================
Hits ? 41605
Misses ? 20592
Partials ? 2826
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-26 18:50:20
📋 Review 摘要
PR 概述:为昆仑芯 XPU 平台文心 ERNIE 4.5 MoE 模型添加 W4A8 C8/C16 KV Cache 量化支持,修复 TP4EP4 + PD Disaggregation 场景下权重加载错误,同时修复跳层混合量化的 suffix 映射逻辑。
变更范围:layers/backends/xpu/、layers/moe/moe.py、layers/quantization/__init__.py、models/ernie4_5_moe.py、utils.py
影响面 Tag:[XPU] [Quantization] [Models]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | kv_cache.py:62 |
self.has_zero_point 新增属性未被 create_weights 消费,疑似冗余 |
| 🟡 建议 | kv_cache.py:239 |
process_weights_after_loading 公式从 1/scale 改为 max_bound/scale,C16 路径行为变更需确认 kernel 适配 |
| ❓ 疑问 | moe.py:309 |
注释拼写错误 spport → support |
📝 PR 规范检查
PR 描述中 Motivation、Modifications、Usage or Command、Accuracy Tests 四个章节均为空(仅保留了模板占位符),Checklist 全部未勾选,不符合 PR 描述模板要求。
标题建议(可直接复制):
[XPU] Support ERNIE4.5-MoE w4a8 C8/C16 kvcache quant + TP4EP4 PD disaggregation + skip-layer mixed quant
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
为昆仑芯 XPU 平台文心 ERNIE 4.5 MoE 模型添加 W4A8 量化支持,具体包括:
1. C8(通道级 + 零点)和 C16(通道级无零点)KV Cache 量化,支持 TP4EP4 场景下的 scale/zp 分片加载;
2. 修复 TP4EP4 + PD Disaggregation 场景下 `cache_k_zp`/`cache_v_zp` 从 `self` 误读的 Bug;
3. 修复跳层混合量化(skip-layer mix quant)场景下权重 suffix 映射逻辑(if-if → if-elif)。
## Modifications
- `attention.py`:`cache_k_zp`/`cache_v_zp` 从 `self` 改为从 `layer` 读取(Bug Fix);C8 场景 zp 转换为 bfloat16 再传入 kernel
- `kv_cache.py`:重构 `create_weights`,新增 `_tp_shard_along_kv_heads` 实现 TP 下通道级 scale/zp 的分片加载;`process_weights_after_loading` 统一改用 `max_bound / scale` 公式;`XPUKvCacheQuantConfig.__init__` 补充存储 `has_zero_point`
- `fused_moe.py`(XPU):W4A8 场景下为 `up_gate_proj`/`down_proj` 权重及 scale 补充 `weight_loader`;为 in_scale 设置 `SHARD_ID_TO_SHARDED_DIM={"gate":None,"up":None,"down":None}` 标识
- `moe.py`:新增 `_load_in_scale_weight` 方法,按 `expert_id` 加载 MoE in_scale;`weight_loader` 中识别全 None 分片维度时走 in_scale 路径
- `quantization/__init__.py`:XPU 平台将 `kvcache` 量化配置替换为 `XPUKvCacheQuantConfig`
- `ernie4_5_moe.py`:补充 `down_proj_in_scale` → `down_proj.in_scale` 权重映射
- `utils.py`:reshape 条件增加 `math.prod` 相等前置校验;XPU 移除 w4a8 不支持限制;suffix 映射 if-if 修复为 if-elif 链并新增 w4a8 映射分支
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现思路清晰,Bug Fix 方向正确(self → layer、if→elif 修复)。需确认 C16 非 channel-wise 路径下 process_weights_after_loading 公式变更是否已同步 XPU Attention Kernel 适配;self.has_zero_point 冗余字段请明确是否有后续用途。无阻塞性问题,以上疑问确认后可合入。
| super().__init__() | ||
| self.kv_cache_quant_type = kv_cache_quant_type | ||
| self.is_channel_wise = is_channel_wise | ||
| self.has_zero_point = has_zero_point |
There was a problem hiding this comment.
❓ 疑问 新增的 self.has_zero_point = has_zero_point 属性疑似冗余。
当前 create_weights 方法中仍然使用 self.cache_quant_config.has_zero_point 而非 self.has_zero_point,新字段没有被消费。请确认:
- 是否是为未来使用而预留?若是,建议加注释说明
- 还是应该在
create_weights中统一改为self.has_zero_point?
| # cache_k_out_scale is the reciprocal of cache_k_scale | ||
| if layer.cache_k_scale._is_initialized(): | ||
| layer.cache_k_out_scale.set_value(1 / layer.cache_k_scale) # cache_k_out_scale | ||
| layer.cache_k_out_scale.set_value( |
There was a problem hiding this comment.
🟡 建议 process_weights_after_loading 中公式从 1 / scale 改为 max_bound / scale,对非 channel-wise(is_channel_wise=False)路径是一个行为变更。
- C8(channel-wise)路径:
_kv_scale_weight_loader已将cache_k_scale存储为max_bound / raw_scale,此处再做max_bound / scale会还原为raw_scale,逻辑自洽。 - C16(非 channel-wise)路径:
cache_k_scale存储的是原始raw_scale,此处变为max_bound / raw_scale,不再与原来的1 / raw_scale一致。
请确认 C16 路径下消费 cache_k_out_scale 的 XPU Attention Kernel 是否已适配新的公式,避免推理精度异常。
| @@ -307,6 +307,19 @@ def __init__( | |||
| tp_size={self.tp_size}." | |||
| ) | |||
|
|
|||
There was a problem hiding this comment.
❓ 疑问 注释拼写错误:only spport ernie now → only support ernie now。
…yer mix quant
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.