[Klaud Cold][Experimental][DNM] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1)#1762
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515117946 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515119215 |
First sweep failure — diagnosed & fixedThe first disagg sweep (run 27515119215) failed — not a recipe bug. The day-zero
Fix:
Scoped to the vllm-disagg branch; pre-staged models (M2.5/Kimi) never reach this path. Re-running the sweep. |
8118fa3 to
a4f66bd
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519206250 |
a4f66bd to
409561f
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27520697241 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27521167091 |
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The first MI355X disagg sweep (run 27515119215) failed: the day-zero
MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so
job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found.
Searched: ...") before the engine ever started. The single-node recipes
hf-download inside the serving container, but the disagg path historically
required ops to pre-stage checkpoints.
Add an on-demand fallback to the vllm-disagg model-resolution block: when the
checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name
-> org/name) and download into MODEL_DIR in HF cache layout, then resolve the
snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir
that is bind-mounted into the serving container as /models, so the existing
-v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve.
Implementation notes:
- The host has no hf CLI, so the download runs in a one-shot container of the
serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub.
- flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a
re-check of snapshots/ under the lock makes it idempotent (resumable).
- hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed
through for gated repos.
- Scoped to the vllm-disagg branch only; pre-staged models never reach this
path (the search finds them first), so sglang/atom and existing vLLM disagg
models (M2.5/Kimi) are unaffected.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The disagg auto-download reached hf download but failed all 3 attempts: the one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not override the image ENTRYPOINT, so the vllm-openai API server ran with the bash command as its args and died with "Failed to infer device type" (no GPU mounted in the download container). Add --entrypoint "" (as the serving container does) so bash actually runs hf download. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
7b33cf1 to
01ed5b8
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 01ed5b8. Configure here.
| logger.error("Transfer %s failed: %s", status, e) | ||
| raise""" | ||
|
|
||
| new_wait = """ def waiting_for_transfer_complete(self): |
There was a problem hiding this comment.
Patches removed for pinned images
Medium Severity
This PR drops all runtime MoRI-IO vLLM patches from setup_deps.sh and switches Kimi/M2.5 serve flags to all2all-backend mori_low_latency, while amd-master.yaml still pins minimaxm2.5-fp8-mi355x-vllm-disagg and kimik2.5-fp4-mi355x-vllm-disagg to older nightly digests. Those jobs share the same vllm-disagg path, so they may hit unfixed hangs/assertions or unsupported CLI values without an image bump in this change.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 01ed5b8. Configure here.
Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27525928087 |


What
MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (
vllm/vllm-openai-rocm:minimax-m3):minimaxm3-fp8-mi355x-vllm-disaggLayered on #1585 (remove vLLM-disagg MoRI patches)
This PR brings in #1585's MoRI-patch-removal infra (that PR is very stale vs
main, so the changes are applied selectively rather than by merge):amd_utils/{setup_deps.sh, server_vllm.sh, submit.sh, models_vllm.yaml}— taken from [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585 (mainis untouched here since the merge-base, so these equalmain+ the mori removal). Includes--all2all-backend mori→mori_low_latencyfor the existing M2.5/Kimi entries.amd_utils/job.slurm— [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585's two vLLM-disagg hunks applied onto currentmain(keepingmain's atom-disagg support): vllm-router imagenightly-20260511-e667ebb→nightly-20260603-e667ebb, and drop theVLLM_MORIIO_CONNECTOR_READ_MODEenv from thevllm-disaggcontainer block.M3 recipe
benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh— model-agnostic disagg boilerplate (byte-identical to the M2.5 disagg script; the launcher resolves the per-SKU script by name).models_vllm.yamlMiniMax-M3-MXFP8— per-worker serve flags:--block-size 128(MSA sparse/index cache),--language-model-only(text-only benchmark),--kv-cache-dtype fp8(gfx950),--attention-backend TRITON_ATTN,minimax_m3tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). Env:VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_BREAKABLE_CUDAGRAPH=0 VLLM_ENGINE_READY_TIMEOUT_S=3600.Scope guard
perf-changelog.yamland.github/configs/amd-master.yamlcontain only M3 changes vsmain.Validation
bash -non the disagg script ✓generate_sweep_configs test-config→ exactly 1 disagg config (exp-name minimaxm3_1k1k, runnermi355x-disagg, 1P TP8 + 1D TP8, conc 1) ✓minimaxm3 / fp8 / vllm-disagg→benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh✓process_changelog.pyselectsminimaxm3-fp8-mi355x-vllm-disagg✓🤖 Generated with Claude Code
Note
Medium Risk
Removes MoRI workarounds and changes read-mode wiring for all vLLM-disagg jobs, which could regress existing Kimi/M2.5 disagg runs if the newer image/router assumptions are wrong; scope is benchmark/infra only.
Overview
Adds MiniMax-M3 MXFP8 vLLM prefill/decode benchmarking on MI355X: new config key
minimaxm3-fp8-mi355x-vllm-disagg, runnerminimaxm3_fp8_mi355x_vllm-disagg.sh(model weights via/it-share/hf-hub-cache), andMiniMax-M3-MXFP8serve flags inmodels_vllm.yaml. Sweeps conc 1–16 at 1k/1k and 8k/1k with 1×TP8 prefill + 1×TP8 decode onvllm/vllm-openai-rocm:minimax-m3.Refactors shared vLLM-disagg plumbing to match upstream MoRI behavior: drops runtime vLLM MoRI-IO Python patches from
setup_deps.sh, enables KV transferread_mode: trueinserver_vllm.shinstead ofVLLM_MORIIO_CONNECTOR_READ_MODE, bumps vllm-router tonightly-20260603-e667ebb, and switches Kimi/M2.5 decode MoEall2all-backendfrommoritomori_low_latency. Documents the new config inperf-changelog.yaml.Reviewed by Cursor Bugbot for commit 5778199. Bugbot is set up for automated code reviews on this repo. Configure here.