[Metax] support FLASH_ATTN#7914
Conversation
(cherry picked from commit 8130e7c5a77ba39fdb47cce4db586257a3cf10e0) # Conflicts: # custom_ops/metax_ops/apply_rope_qkv.cu # custom_ops/metax_ops/maca_version.h # fastdeploy/spec_decode/mtp.py # fastdeploy/worker/input_batch.py # fastdeploy/worker/metax_model_runner.py
(cherry picked from commit 49a405b5ab0867d297c1a74643fdf83e3bb1bed5)
support cuda graph (cherry picked from commit f78cbfbe0b69eac20bad4f5b1ed7aec25f12ce73)
(cherry picked from commit a0ca9aef03a1e7fa50a205c1737dcdf084f18685)
(cherry picked from commit e8bfe916642e78ff317a398317b846a7bd448772) # Conflicts: # fastdeploy/envs.py # fastdeploy/worker/input_batch.py
(cherry picked from commit 0890acc6f4f94740c14e9788903ada9bbdaaf469)
(cherry picked from commit 712fd9c106109e54a7cba4e93ee90e8181d87a3d)
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 面向 Metax(MACA) 平台从 rel2.5 迁移,补齐/替换注意力后端与相关自定义算子,并在 Worker/SpecDecode 路径中接入新的 forward meta 与输入缓存字段,以支持新的 FlashAttention/Triton Attention 计算链路。
Changes:
- 在 Metax 平台新增/切换注意力后端(FlashAttention + Triton),并扩展
MetaxForwardMeta支持rotary_embs_bf16。 - Worker / MTP 推理链路补充
rope_emb_bf16、routing replay 初始化,以及 MTP reorder/insert 与index_to_batch_id的联动。 - 扩展并接入多份 Metax 自定义算子(RoPE、KV cache 写入、FlashAttention),同时调整 custom ops 编译链接参数。
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/worker/metax_worker.py | cache 初始化时按配置初始化 routing replay manager |
| fastdeploy/worker/metax_model_runner.py | 切到 MetaxForwardMeta,补充 rope_emb_bf16 并调整 MTP 调用参数 |
| fastdeploy/worker/input_batch.py | MACA 下禁用部分 pin_memory;ProposerInputBatch 补充 pre_ids/平台判断 |
| fastdeploy/spec_decode/mtp.py | MACA 条件引入 MetaxForwardMeta 与 rope_emb_bf16 |
| fastdeploy/spec_decode/mtp_cuda.py | MACA 下 forward_meta 使用 MetaxForwardMeta 并传入 rotary_embs_bf16 |
| fastdeploy/platforms/maca.py | 扩展可选注意力后端(FLASH/TRITON),并更新提示文案 |
| fastdeploy/platforms/base.py | _Backend 枚举新增 TRITON_ATTN |
| fastdeploy/model_executor/layers/backends/metax/attention/triton_attn_metax_backend.py | 新增 Metax Triton 注意力后端(Python 侧封装) |
| fastdeploy/model_executor/layers/backends/metax/attention/triton_attn_kernels.py | 新增 Triton kernel:unified attention(prefill/decode) |
| fastdeploy/model_executor/layers/backends/metax/attention/flash_attn_metax_backend.py | 新增 Metax FlashAttention 后端(split/mix 两种 PD 模式) |
| fastdeploy/model_executor/layers/backends/metax/init.py | 导出新增的 Metax Flash/Triton attention backend |
| fastdeploy/model_executor/forward_meta.py | 新增 MetaxForwardMeta,扩展 rotary_embs_bf16 字段 |
| fastdeploy/envs.py | 新增 Metax FA split 开关与 KV cache lock 开关 |
| custom_ops/setup_ops.py | 增加 Metax 新算子源文件与链接库/头文件路径 |
| custom_ops/metax_ops/write_cache_kv.cu | 新增:将 K/V 写入 paged KV cache 的算子 |
| custom_ops/metax_ops/write_cache_kv_with_rope.cu | 新增:带 RoPE 的写 cache(含 speculate 分支)算子 |
| custom_ops/metax_ops/rotary_position_embedding.cu | 新增:可变长/Neox/partial rotary 的 RoPE 算子 |
| custom_ops/metax_ops/flash_attention.cu | 新增:对接 mcFlashAttn 的 varlen/kvcache 前向算子 |
| custom_ops/metax_ops/maca_version.h | 删除:MACA 版本宏头文件 |
| custom_ops/metax_ops/fused_moe_gemm_kernels.h | 移除 MACA_VERSION 条件分支,统一调用参数类型 |
| custom_ops/metax_ops/apply_rope_qkv.cu | 删除:旧的 apply_rope_qkv 实现 |
| custom_ops/gpu_ops/gelu_tanh.cu | 修正 block 线程数计算(避免超过 1024) |
Comments suppressed due to low confidence (1)
custom_ops/metax_ops/flash_attention.cu:400
- 同上:这里同样没有真正抛出错误,失败时会静默继续执行,可能导致 NaN/越界等后续问题。建议改为
PD_THROW直接终止并暴露错误码。
if (status != MCFLASHATTN_STATUS_SUCCESS) {
phi::errors::External("Error in McFlashAttn, error code is %d", status);
}
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务仍有 2 个失败,其中 1 个为覆盖率阈值失败、1 个需要人工 Approval;请优先处理 Required 失败任务后再合入。
2 任务状态汇总日志列说明:失败任务直接使用日志链接,运行中任务使用 Job 链接。 2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/32 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率阈值(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无。日志显示 根因详情: 关键日志: 修复建议:
修复建议摘要: 补充 MACA FLASH_ATTN 单测 关联变更: Approval — 人工审批(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7914 +/- ##
==========================================
Coverage ? 63.60%
==========================================
Files ? 468
Lines ? 65244
Branches ? 9987
==========================================
Hits ? 41496
Misses ? 20945
Partials ? 2803
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if (status != MCFLASHATTN_STATUS_SUCCESS) { | ||
| phi::errors::External("Error in McFlashAttn, error code is %d", status); | ||
| } |
| if (status != MCFLASHATTN_STATUS_SUCCESS) { | ||
| phi::errors::External("Error in McFlashAttn, error code is %d", status); | ||
| } |
| if (status != MCFLASHATTN_STATUS_SUCCESS) { | ||
| phi::errors::External("Error in McFlashAttn, error code is %d", status); | ||
| } |
| if num_requests < self.max_num_seqs: | ||
| self.block_tables_buffer[num_requests:] = self.block_tables_buffer[num_requests - 1] |
| return "fastdeploy.model_executor.layers.attention.PaddleNativeAttnBackend" | ||
| elif selected_backend == _Backend.APPEND_ATTN: | ||
| logger.info("Using FLASH ATTN backend to instead of attend attention.") | ||
| logger.info("Using FLASH ATTN backend to instead of APPEND ATTN.") |
| extra_compile_args=metax_extra_compile_args, | ||
| library_dirs=[os.path.join(maca_path, "lib")], | ||
| extra_link_args=["-lruntime_cu", "-lmctlassEx"], | ||
| extra_link_args=["-lruntime_cu", "-lmctlassEx", "-lmcFlashAttn"], |
bde3c09 to
c5bf2d1
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-27 10:32:53
📋 Review 摘要
PR 概述:为沐曦(Metax)GPU 新增 Flash Attention 支持,替换旧版 RoPE 实现,并修复 gelu_tanh block size 计算错误。
变更范围:custom_ops/metax_ops/、fastdeploy/worker/、fastdeploy/model_executor/layers/backends/metax/
影响面 Tag:[Metax] [OP]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | custom_ops/metax_ops/flash_attention.cu:182 |
phi::errors::External(...) 仅构造错误对象,不抛出异常,mha 调用失败后静默继续执行 |
| 🔴 Bug | custom_ops/metax_ops/flash_attention.cu |
flash_attn_kvcache_forward 中同样存在相同问题(mha_fwd_kvcache 失败后不抛出) |
📝 PR 规范检查
PR 标题缺少功能性 Tag([Metax] 单独使用语义不完整,建议补充 [Feature]),且 PR 描述所有章节均为空模板,未填写任何实质内容。
标题建议(可直接复制):
[Metax][Feature] Support Flash Attention for Metax GPU
PR 描述建议(点击展开,可直接复制)
## Motivation
为沐曦(Metax)GPU 添加 Flash Attention 支持,使用 McFlashAttn 库替换原有自定义 RoPE+Attention 实现,提升推理性能。同时修复 gelu_tanh kernel 中 block size 上限计算错误(`std::max` → `std::min`),并移除已过时的 MACA 版本兼容代码(最低版本要求提升至 > 3.3.2.0)。
## Modifications
- `custom_ops/metax_ops/flash_attention.cu`:新增 Flash Attention 算子,支持 `flash_attn_varlen_forward`(变长序列 prefill)和 `flash_attn_kvcache_forward`(decode 阶段 KV Cache)两种模式
- `custom_ops/metax_ops/rotary_position_embedding.cu`:新增 RoPE 位置编码 kernel(GQA 支持,含 neox/partial 变体),替换旧版 `apply_rope_qkv.cu`
- `custom_ops/metax_ops/apply_rope_qkv.cu`:删除旧版 RoPE 实现
- `custom_ops/metax_ops/maca_version.h`:删除版本兼容头文件
- `custom_ops/metax_ops/fused_moe_gemm_kernels.h`:移除 MACA 版本条件编译分支
- `custom_ops/gpu_ops/gelu_tanh.cu`:修复 block size 计算错误(`std::max` → `std::min`)
- `fastdeploy/model_executor/layers/backends/metax/attention/flash_attn_metax_backend.py`:新增 Python 侧 Flash Attention backend 调用封装
- `fastdeploy/worker/metax_model_runner.py`、`metax_worker.py`:适配新 attention backend
- `custom_ops/setup_ops.py`:更新编译源文件列表
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
新增 Metax Flash Attention 功能整体结构清晰,但两处 phi::errors::External(...) 调用存在严重错误处理缺陷,必须修复后方可合入。PR 描述需补充完整。
|
|
||
| if (status != MCFLASHATTN_STATUS_SUCCESS) { | ||
| phi::errors::External("Error in McFlashAttn, error code is %d", status); | ||
| } |
There was a problem hiding this comment.
🔴 Bug phi::errors::External(...) 仅构造错误对象但不抛出,mha_varlen_fwd 失败后程序静默继续执行,后续 release_tensor 正常调用但输出结果为无效数据。
flash_attn_kvcache_forward 中 mha_fwd_kvcache 调用后存在相同问题。
建议修复:
if (status != MCFLASHATTN_STATUS_SUCCESS) {
PADDLE_THROW(phi::errors::External(
"McFlashAttn failed with error code %d", status));
}
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 2 个 required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/32 通过
3 失败详情(仅 required)Approval — 需要人工审批(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 run_tests_with_coverage — 覆盖率阈值(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无。日志显示 根因详情: 关键日志: 修复建议:
关联变更: |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.