[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X#1778
[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X#1778Oseltamivir wants to merge 1 commit into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @Oseltamivir's task in 5m 13s —— View job PR Review: Fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X
LGTM - no blocking issues found. This is a well-engineered PR with thorough integrity checking. The deferred allreduce pattern, AITER fused primitive invocation, pinned wheel installation, and SHA256 fingerprint enforcement are all sound. The A/B experiment design ( Key observations:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9f83809. Configure here.
| TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }} | ||
| DURATION: ${{ inputs.duration }} | ||
| M3_AITER_AR_RMS_MODE: ${{ inputs.m3-aiter-ar-rms-mode }} | ||
| EXPERIMENT_SUFFIX: ${{ inputs.m3-aiter-ar-rms-mode != 'off' && format('_m3ar-{0}', inputs.m3-aiter-ar-rms-mode) || '' }} |
There was a problem hiding this comment.
Result suffix ignores concurrency override
Medium Severity
When m3-aiter-ar-rms-mode is fused and concurrency is 1, the MI300X recipe forces M3_AITER_AR_RMS_MODE to off, but EXPERIMENT_SUFFIX and RESULT_FILENAME still use the workflow input (_m3ar-fused). Stored artifacts and job labels can describe a fused run while the server actually used the default path.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 9f83809. Configure here.


Summary
m3-aiter-ar-rms-modeswitch for MiniMax M3 vLLM jobs on MI300XThis PR is intentionally isolated from the profiling branch. It does not include profile workflow changes, trace analysis, sparse-index tuning, MXFP8 changes, or the profiling report.
Why
MiniMax M3 uses Gemma-style RMSNorm, and TP decode profiles show the separate custom allreduce and norm boundaries dominating low/medium-concurrency decode. vLLM already exposes AITER's
fused_ar_rmsprimitive, but M3 cannot use the normaltorch.compilepattern-matching path. This change invokes that primitive explicitly while preserving anoffcontrol path.Only the AITER custom-allreduce dependency is enabled. Other independently selectable AITER attention, linear, MoE, RMSNorm, BMM, RoPE, and shared-expert paths remain disabled.
Validation
165 passed: installer, runtime patch, recipe policy, and matrix-generator tests4a560dd8db67c270f5e2afb614558271b76f2294bash -nThe isolated branch validation ran only the TP8/c16 1k1k and 8k1k configurations and completed successfully:
The earlier graph smoke measured 218.96 tok/s/GPU at 1k1k/c16, 1.87% above the same MXFP8 baseline. The broader experimental branch measured positive c16/c256 deltas but also contained separate index-launch work; those changes are not present in this PR.
GSM8K strict exact match in the combined experimental run was 0.95830 versus 0.95679 before the change.
Upstream
The implementation builds on the vLLM AITER fused-allreduce/RMSNorm work in:
Note
High Risk
Mutates installed vLLM sources and model forward/TP reduction behavior at runtime; incorrect fusion or patch drift could affect numerical correctness and decode performance on MI300X.
Overview
Adds an opt-in
m3-aiter-ar-rms-mode(off/control/fused) on benchmark and e2e workflows, wired throughM3_AITER_AR_RMS_MODEand a distinct result filename suffix when notoff. E2e matrix generation rejects non–MiniMax M3 vLLM MI300X jobs when the mode is enabled.The MI300X MiniMax M3 recipe applies pinned AITER wheels (fused path only), SHA256-guarded runtime patches to
fused_allreduce_gemma_rms_norm, and a deferred-FFN vLLM model patch so TP reductions fuse with the next Gemma RMSNorm viafused_ar_rms(AITER custom AR only;fused+ concurrency 1 falls back tooff). New installer/patch utilities and unit tests cover wheel pins and source fingerprint drift.Reviewed by Cursor Bugbot for commit 9f83809. Bugbot is set up for automated code reviews on this repo. Configure here.