feat(blog): MiniMax M3 day-0 — H200 beats B200 at low concurrency on vLLM FP8#454
Open
Oseltamivir wants to merge 3 commits into
Open
feat(blog): MiniMax M3 day-0 — H200 beats B200 at low concurrency on vLLM FP8#454Oseltamivir wants to merge 3 commits into
Oseltamivir wants to merge 3 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e16a580. Configure here.
| | 50 | 368 | 560 | 278 | 463 | 2.02x | | ||
| | 60 | 297 | 444 | 179 | 336 | 2.48x | | ||
| | **70** | **241** | **371** | **106** | **209** | **3.50x** | | ||
| | 80 | 199 | 317 | 32 | 80 | 10.00x | |
There was a problem hiding this comment.
Iso table ratio mismatch
Medium Severity
At 80 tok/s/user the iso-interactivity table shows H200 / B200 as 10.00x while the same row lists 317 and 32 tok/s/GPU; dividing those printed values gives about 9.91x, so the ratio does not match the displayed throughputs unlike other rows (e.g. 3.50x at 70).
Reviewed by Cursor Bugbot for commit e16a580. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Day-0 MiniMax M3 benchmark post for the InferenceX blog. Anchors the headline on the H200-vs-B200 low-concurrency inversion seen in the launch-window discussion and answers why it happens.
Headline: On vLLM FP8, 1024/1024, non-MTP, H200 delivers up to 3.5x the throughput per GPU of B200 at 70 tok/s/user. At matched recipe (TP=8, conc 4) H200 runs 113 tok/s/GPU @ 8.4ms TPOT vs B200's 59 @ 16.2ms — 1.9x throughput, half the latency, on the weaker chip.
Root cause (the learning): vLLM defaults Blackwell's FP8 block-scale MoE GEMM to DeepGEMM (large-batch-tuned, high fixed-latency floor at small batch); Hopper runs Marlin (low-concurrency-tuned). Identical M3 weights / MSA / routing on both SKUs — the low-batch spread is kernel selection. On-paper specs confirm it's software: B200 carries 2.27x H200's dense FP8 FLOPS and 1.67x the HBM bandwidth yet returns half the tokens at low batch. Fix in flight: flashinfer PR #3504 (MXFP8 MoE SwiGLU gated-activation params).
Honest crossover: the 8192/1024 table shows H200 winning conc 4–16 and B200 retaking conc 32–64 (1.34x at conc 64) where DeepGEMM amortizes. MTP raises the crossover to ~60 tok/s/user.
Data provenance
MCP DB tools and the GitHub data dump were unavailable (dump predates M3), so numbers were pulled live from the production API (
/api/v1/benchmarks?model=MiniMax-M3, run27451860491, date 2026-06-13). Iso-interactivity figures run through the bundled spline helper so they match the dashboard chart.packages/app/public/images/minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency/benchmark-{light,dark}.png. The two<Figure>blocks reference these paths; they will 404 on the preview until the PNGs are dropped in.Overlay support: N/A — this is a static MDX blog post, not an inference-chart feature.
Note
Low Risk
Content-only addition (static MDX); no application logic, auth, or data pipeline changes. Main merge risk is broken image paths until assets are committed.
Overview
Adds a new InferenceX blog post (
minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency.mdx) covering day-0 MiniMax M3 benchmarks on vLLM FP8 (2026-06-13 run), with the headline that H200 can deliver up to ~3.5× B200 throughput per GPU in the low-interactivity / low-concurrency regime while B200/B300 still win at high batch.The post explains the inversion as a vLLM kernel-selection gap: Blackwell defaults FP8 block-scale MoE to DeepGEMM (high fixed latency at small batch) vs Hopper’s Marlin path, with identical M3 weights. It includes TP=8 tables for 1024/1024 and 8192/1024, iso-interactivity throughput and $/M token comparisons, MTP crossover notes, links to InferenceX and flashinfer PR #3504, plus DashboardCTA, Figure assets (light/dark, including 8k/1k charts), and FAQ JsonLd.
Before merge: chart PNGs under
public/images/minimax-m3-vllm-fp8-h200-vs-b200-low-concurrency/are referenced but not in this diff (preview 404s until added).Reviewed by Cursor Bugbot for commit 6aeb395. Bugbot is set up for automated code reviews on this repo. Configure here.