Skip to content

[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X#1778

Open
Oseltamivir wants to merge 1 commit into
mainfrom
codex/minimax-m3-mi300x-aiter-rmsnorm
Open

[codex] perf: fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X#1778
Oseltamivir wants to merge 1 commit into
mainfrom
codex/minimax-m3-mi300x-aiter-rmsnorm

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add an opt-in m3-aiter-ar-rms-mode switch for MiniMax M3 vLLM jobs on MI300X
  • call AITER's fused custom-allreduce + Gemma RMSNorm primitive directly from M3's existing helper
  • defer attention and FFN/MoE tensor-parallel reductions into the following Gemma RMSNorm boundary
  • initialize the AITER communicator before HIP graph capture
  • install pinned, checksummed AITER/FlyDSL wheels and reject unexpected vLLM source drift
  • keep graph-mode concurrency 1 on the existing path after a measured 2.2% regression

This PR is intentionally isolated from the profiling branch. It does not include profile workflow changes, trace analysis, sparse-index tuning, MXFP8 changes, or the profiling report.

Why

MiniMax M3 uses Gemma-style RMSNorm, and TP decode profiles show the separate custom allreduce and norm boundaries dominating low/medium-concurrency decode. vLLM already exposes AITER's fused_ar_rms primitive, but M3 cannot use the normal torch.compile pattern-matching path. This change invokes that primitive explicitly while preserving an off control path.

Only the AITER custom-allreduce dependency is enabled. Other independently selectable AITER attention, linear, MoE, RMSNorm, BMM, RoPE, and shared-expert paths remain disabled.

Validation

The isolated branch validation ran only the TP8/c16 1k1k and 8k1k configurations and completed successfully:

Sequence Total tok/s/GPU Output tok/s/GPU Mean TTFT Mean TPOT
1k1k 208.09 103.49 0.210 s 0.019 s
8k1k 678.21 75.15 0.942 s 0.025 s

The earlier graph smoke measured 218.96 tok/s/GPU at 1k1k/c16, 1.87% above the same MXFP8 baseline. The broader experimental branch measured positive c16/c256 deltas but also contained separate index-launch work; those changes are not present in this PR.

GSM8K strict exact match in the combined experimental run was 0.95830 versus 0.95679 before the change.

Upstream

The implementation builds on the vLLM AITER fused-allreduce/RMSNorm work in:


Note

High Risk
Mutates installed vLLM sources and model forward/TP reduction behavior at runtime; incorrect fusion or patch drift could affect numerical correctness and decode performance on MI300X.

Overview
Adds an opt-in m3-aiter-ar-rms-mode (off / control / fused) on benchmark and e2e workflows, wired through M3_AITER_AR_RMS_MODE and a distinct result filename suffix when not off. E2e matrix generation rejects non–MiniMax M3 vLLM MI300X jobs when the mode is enabled.

The MI300X MiniMax M3 recipe applies pinned AITER wheels (fused path only), SHA256-guarded runtime patches to fused_allreduce_gemma_rms_norm, and a deferred-FFN vLLM model patch so TP reductions fuse with the next Gemma RMSNorm via fused_ar_rms (AITER custom AR only; fused + concurrency 1 falls back to off). New installer/patch utilities and unit tests cover wheel pins and source fingerprint drift.

Reviewed by Cursor Bugbot for commit 9f83809. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 15, 2026 14:51
@Oseltamivir Oseltamivir requested a review from a team June 15, 2026 14:51
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 5m 13s —— View job


PR Review: Fuse MiniMax M3 allreduce and Gemma RMSNorm on MI300X

  • Read repository guidelines
  • Gather context and read all changed files
  • Review benchmark script and workflow changes
  • Review installer and patcher utilities
  • Review tests
  • Cross-reference with vLLM source (MoERunner, FusedMoE, AITER ops)
  • Post review findings

LGTM - no blocking issues found.

This is a well-engineered PR with thorough integrity checking. The deferred allreduce pattern, AITER fused primitive invocation, pinned wheel installation, and SHA256 fingerprint enforcement are all sound. The A/B experiment design (control vs fused) is clean, and the c1 fallback for the measured 2.2% regression is a good defensive measure.

Key observations:

  • All source and patched file SHA256 checks provide strong drift detection across both the model patch and the runtime helper patch.
  • The MiniMaxM3DeferredMoERunner correctly skips the TP allreduce in _maybe_reduce_final_output, deferring it to the next layer's fused_allreduce_gemma_rms_norm boundary. The final model norm handles the last layer correctly.
  • The AITER invocation uses use_1stage=False (two-stage, graph-safe) which matches the pinned 0.1.15 tag that lacks the one-stage exit barrier fix.
  • Gamma caching (_inferencex_aiter_gamma) correctly adds 1.0 for GemmaRMSNorm's (1 + w) convention and is guarded by a type check on the norm class.
  • The installer's retry logic, torch version invariant check, and use_1stage signature verification are solid.
  • The control mode safely falls through to the standard allreduce path (AITER code is only entered when M3_AITER_AR_RMS_MODE == "fused"), making it a proper baseline experiment.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9f83809. Configure here.

TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
DURATION: ${{ inputs.duration }}
M3_AITER_AR_RMS_MODE: ${{ inputs.m3-aiter-ar-rms-mode }}
EXPERIMENT_SUFFIX: ${{ inputs.m3-aiter-ar-rms-mode != 'off' && format('_m3ar-{0}', inputs.m3-aiter-ar-rms-mode) || '' }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result suffix ignores concurrency override

Medium Severity

When m3-aiter-ar-rms-mode is fused and concurrency is 1, the MI300X recipe forces M3_AITER_AR_RMS_MODE to off, but EXPERIMENT_SUFFIX and RESULT_FILENAME still use the workflow input (_m3ar-fused). Stored artifacts and job labels can describe a fused run while the server actually used the default path.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9f83809. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant