[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X by Oseltamivir · Pull Request #1778 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-15T11:23:36Z

Summary

add an opt-in m3-aiter-ar-rms-mode switch for MiniMax M3 vLLM jobs on MI300X
call AITER's fused custom-allreduce + Gemma RMSNorm primitive directly from M3's existing helper
defer attention and FFN/MoE tensor-parallel reductions into the following Gemma RMSNorm boundary
initialize the AITER communicator before HIP graph capture
install pinned, checksummed AITER/FlyDSL wheels and reject unexpected vLLM source drift
keep graph-mode concurrency 1 on the existing path after a measured 2.2% regression

This PR is intentionally isolated from the profiling branch. It does not include profile workflow changes, trace analysis, sparse-index tuning, MXFP8 changes, or the profiling report.

Why

MiniMax M3 uses Gemma-style RMSNorm, and TP decode profiles show the separate custom allreduce and norm boundaries dominating low/medium-concurrency decode. vLLM already exposes AITER's fused_ar_rms primitive, but M3 cannot use the normal torch.compile pattern-matching path. This change invokes that primitive explicitly while preserving an off control path.

Only the AITER custom-allreduce dependency is enabled. Other independently selectable AITER attention, linear, MoE, RMSNorm, BMM, RoPE, and shared-expert paths remain disabled.

Validation

165 passed: installer, runtime patch, recipe policy, and matrix-generator tests
both runtime transformations apply to pristine vLLM 4a560dd8db67c270f5e2afb614558271b76f2294
patched source SHA256 values match the recipe's pinned fingerprints
patched Python files compile
benchmark script passes bash -n
workflow YAML parses successfully
isolated branch validation: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27543360103
graph-capture smoke: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27537660155
combined experimental six-point run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27538604485
c1 fallback validation: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27540379158

The isolated branch validation ran only the TP8/c16 1k1k and 8k1k configurations and completed successfully:

Sequence	Total tok/s/GPU	Output tok/s/GPU	Mean TTFT	Mean TPOT
1k1k	208.09	103.49	0.210 s	0.019 s
8k1k	678.21	75.15	0.942 s	0.025 s

The earlier graph smoke measured 218.96 tok/s/GPU at 1k1k/c16, 1.87% above the same MXFP8 baseline. The broader experimental branch measured positive c16/c256 deltas but also contained separate index-launch work; those changes are not present in this PR.

GSM8K strict exact match in the combined experimental run was 0.95830 versus 0.95679 before the change.

Upstream

The implementation builds on the vLLM AITER fused-allreduce/RMSNorm work in:

Note

High Risk
Mutates installed vLLM sources and model forward/TP reduction behavior at runtime; incorrect fusion or patch drift could affect numerical correctness and decode performance on MI300X.

Overview
Adds an opt-in m3-aiter-ar-rms-mode (off / control / fused) on benchmark and e2e workflows, wired through M3_AITER_AR_RMS_MODE and a distinct result filename suffix when not off. E2e matrix generation rejects non–MiniMax M3 vLLM MI300X jobs when the mode is enabled.

The MI300X MiniMax M3 recipe applies pinned AITER wheels (fused path only), SHA256-guarded runtime patches to fused_allreduce_gemma_rms_norm, and a deferred-FFN vLLM model patch so TP reductions fuse with the next Gemma RMSNorm via fused_ar_rms (AITER custom AR only; fused + concurrency 1 falls back to off). New installer/patch utilities and unit tests cover wheel pins and source fingerprint drift.

^{Reviewed by Cursor Bugbot for commit 9f83809. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-15T11:23:46Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-15T14:51:47Z

Claude finished @Oseltamivir's task in 5m 13s —— View job

PR Review: Fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X

Read repository guidelines
Gather context and read all changed files
Review benchmark script and workflow changes
Review installer and patcher utilities
Review tests
Cross-reference with vLLM source (MoERunner, FusedMoE, AITER ops)
Post review findings

LGTM - no blocking issues found.

This is a well-engineered PR with thorough integrity checking. The deferred allreduce pattern, AITER fused primitive invocation, pinned wheel installation, and SHA256 fingerprint enforcement are all sound. The A/B experiment design (control vs fused) is clean, and the c1 fallback for the measured 2.2% regression is a good defensive measure.

Key observations:

All source and patched file SHA256 checks provide strong drift detection across both the model patch and the runtime helper patch.
The MiniMaxM3DeferredMoERunner correctly skips the TP allreduce in _maybe_reduce_final_output, deferring it to the next layer's fused_allreduce_gemma_rms_norm boundary. The final model norm handles the last layer correctly.
The AITER invocation uses use_1stage=False (two-stage, graph-safe) which matches the pinned 0.1.15 tag that lacks the one-stage exit barrier fix.
Gamma caching (_inferencex_aiter_gamma) correctly adds 1.0 for GemmaRMSNorm's (1 + w) convention and is guarded by a type check on the norm class.
The installer's retry logic, torch version invariant check, and use_1stage signature verification are solid.
The control mode safely falls through to the standard allreduce path (AITER code is only entered when M3_AITER_AR_RMS_MODE == "fused"), making it a proper baseline experiment.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9f83809. Configure here.}

cursor · 2026-06-15T14:53:24Z

  TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
  DURATION: ${{ inputs.duration }}
+  M3_AITER_AR_RMS_MODE: ${{ inputs.m3-aiter-ar-rms-mode }}
+  EXPERIMENT_SUFFIX: ${{ inputs.m3-aiter-ar-rms-mode != 'off' && format('_m3ar-{0}', inputs.m3-aiter-ar-rms-mode) || '' }}


Result suffix ignores concurrency override

Medium Severity

When m3-aiter-ar-rms-mode is fused and concurrency is 1, the MI300X recipe forces M3_AITER_AR_RMS_MODE to off, but EXPERIMENT_SUFFIX and RESULT_FILENAME still use the workflow input (_m3ar-fused). Stored artifacts and job labels can describe a fused run while the server actually used the default path.

Additional Locations (2)

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh#L38-L44

.github/workflows/benchmark-tmpl.yml#L186-L187

^{Reviewed by Cursor Bugbot for commit 9f83809. Configure here.}

perf(mi300x): fuse M3 allreduce Gemma RMSNorm

9f83809

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

Oseltamivir added the full-sweep-enabled label Jun 15, 2026

Oseltamivir marked this pull request as ready for review June 15, 2026 14:51

Oseltamivir requested a review from a team June 15, 2026 14:51

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X#1778

[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X#1778
Oseltamivir wants to merge 1 commit into
mainfrom
codex/minimax-m3-mi300x-aiter-rmsnorm

Oseltamivir commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Upstream

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Result suffix ignores concurrency override

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 15, 2026 •

edited by cursor Bot

Loading

claude Bot commented Jun 15, 2026 •

edited

Loading