Skip to content

feat(blog): MiniMax M3 day-0 — H200 beats B200 at low concurrency on vLLM FP8#454

Open
Oseltamivir wants to merge 3 commits into
masterfrom
blog/minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency
Open

feat(blog): MiniMax M3 day-0 — H200 beats B200 at low concurrency on vLLM FP8#454
Oseltamivir wants to merge 3 commits into
masterfrom
blog/minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Day-0 MiniMax M3 benchmark post for the InferenceX blog. Anchors the headline on the H200-vs-B200 low-concurrency inversion seen in the launch-window discussion and answers why it happens.

Headline: On vLLM FP8, 1024/1024, non-MTP, H200 delivers up to 3.5x the throughput per GPU of B200 at 70 tok/s/user. At matched recipe (TP=8, conc 4) H200 runs 113 tok/s/GPU @ 8.4ms TPOT vs B200's 59 @ 16.2ms — 1.9x throughput, half the latency, on the weaker chip.

Root cause (the learning): vLLM defaults Blackwell's FP8 block-scale MoE GEMM to DeepGEMM (large-batch-tuned, high fixed-latency floor at small batch); Hopper runs Marlin (low-concurrency-tuned). Identical M3 weights / MSA / routing on both SKUs — the low-batch spread is kernel selection. On-paper specs confirm it's software: B200 carries 2.27x H200's dense FP8 FLOPS and 1.67x the HBM bandwidth yet returns half the tokens at low batch. Fix in flight: flashinfer PR #3504 (MXFP8 MoE SwiGLU gated-activation params).

Honest crossover: the 8192/1024 table shows H200 winning conc 4–16 and B200 retaking conc 32–64 (1.34x at conc 64) where DeepGEMM amortizes. MTP raises the crossover to ~60 tok/s/user.

Data provenance

MCP DB tools and the GitHub data dump were unavailable (dump predates M3), so numbers were pulled live from the production API (/api/v1/benchmarks?model=MiniMax-M3, run 27451860491, date 2026-06-13). Iso-interactivity figures run through the bundled spline helper so they match the dashboard chart.

⚠️ Before merge

  • Chart images not yet addedpackages/app/public/images/minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency/benchmark-{light,dark}.png. The two <Figure> blocks reference these paths; they will 404 on the preview until the PNGs are dropped in.
  • Verify the engineer attributions in Acknowledgments (Roger Wang, Thien, Yongye Zhu).

Overlay support: N/A — this is a static MDX blog post, not an inference-chart feature.


Note

Low Risk
Content-only addition (static MDX); no application logic, auth, or data pipeline changes. Main merge risk is broken image paths until assets are committed.

Overview
Adds a new InferenceX blog post (minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency.mdx) covering day-0 MiniMax M3 benchmarks on vLLM FP8 (2026-06-13 run), with the headline that H200 can deliver up to ~3.5× B200 throughput per GPU in the low-interactivity / low-concurrency regime while B200/B300 still win at high batch.

The post explains the inversion as a vLLM kernel-selection gap: Blackwell defaults FP8 block-scale MoE to DeepGEMM (high fixed latency at small batch) vs Hopper’s Marlin path, with identical M3 weights. It includes TP=8 tables for 1024/1024 and 8192/1024, iso-interactivity throughput and $/M token comparisons, MTP crossover notes, links to InferenceX and flashinfer PR #3504, plus DashboardCTA, Figure assets (light/dark, including 8k/1k charts), and FAQ JsonLd.

Before merge: chart PNGs under public/images/minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency/ are referenced but not in this diff (preview 404s until added).

Reviewed by Cursor Bugbot for commit 6aeb395. Bugbot is set up for automated code reviews on this repo. Configure here.

@Oseltamivir Oseltamivir requested a review from adibarra as a code owner June 13, 2026 08:10
@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jun 13, 2026 8:25am

Request Review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e16a580. Configure here.

| 50 | 368 | 560 | 278 | 463 | 2.02x |
| 60 | 297 | 444 | 179 | 336 | 2.48x |
| **70** | **241** | **371** | **106** | **209** | **3.50x** |
| 80 | 199 | 317 | 32 | 80 | 10.00x |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iso table ratio mismatch

Medium Severity

At 80 tok/s/user the iso-interactivity table shows H200 / B200 as 10.00x while the same row lists 317 and 32 tok/s/GPU; dividing those printed values gives about 9.91x, so the ratio does not match the displayed throughputs unlike other rows (e.g. 3.50x at 70).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e16a580. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant