Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .github/workflows/benchmark-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@ on:
required: false
type: string
default: '1800'
m3-aiter-ar-rms-mode:
description: "MiniMax M3 MI300X AITER AR+RMS experiment mode"
required: false
type: string
default: 'off'
env:
RANDOM_RANGE_RATIO: 0.8
HF_TOKEN: ${{ secrets.INFERENCEX_OFFICIAL_RO_HF_TOKEN }}
Expand Down Expand Up @@ -114,6 +119,8 @@ env:
OFFLOADING: ${{ inputs.offloading }}
TOTAL_CPU_DRAM_GB: ${{ inputs.total-cpu-dram-gb }}
DURATION: ${{ inputs.duration }}
M3_AITER_AR_RMS_MODE: ${{ inputs.m3-aiter-ar-rms-mode }}
EXPERIMENT_SUFFIX: ${{ inputs.m3-aiter-ar-rms-mode != 'off' && format('_m3ar-{0}', inputs.m3-aiter-ar-rms-mode) || '' }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result suffix ignores concurrency override

Medium Severity

When m3-aiter-ar-rms-mode is fused and concurrency is 1, the MI300X recipe forces M3_AITER_AR_RMS_MODE to off, but EXPERIMENT_SUFFIX and RESULT_FILENAME still use the workflow input (_m3ar-fused). Stored artifacts and job labels can describe a fused run while the server actually used the default path.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9f83809. Configure here.

RESULT_DIR: /workspace/results
PYTHONDONTWRITEBYTECODE: '1'
PYTHONPYCACHEPREFIX: /tmp/inferencex-pycache
Expand All @@ -125,7 +132,7 @@ jobs:
benchmark:
runs-on: ${{ inputs.runner }}
timeout-minutes: 500
name: "${{ inputs.exp-name }} ${{ inputs.precision }} ${{ inputs.runner }} ${{ inputs.framework }} | tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} | disagg-${{ inputs.disagg }} spec-${{ inputs.spec-decoding }} conc-${{ inputs.conc }}${{ inputs.eval-only && ' | eval-only' || (inputs.run-eval && ' | eval' || '') }}"
name: "${{ inputs.exp-name }} ${{ inputs.precision }} ${{ inputs.runner }} ${{ inputs.framework }} | tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} | disagg-${{ inputs.disagg }} spec-${{ inputs.spec-decoding }} conc-${{ inputs.conc }}${{ inputs.m3-aiter-ar-rms-mode != 'off' && format(' | m3ar-{0}', inputs.m3-aiter-ar-rms-mode) || '' }}${{ inputs.eval-only && ' | eval-only' || (inputs.run-eval && ' | eval' || '') }}"
steps:
- name: Resource cleanup (pre-run)
run: &resource-cleanup |
Expand Down Expand Up @@ -177,7 +184,7 @@ jobs:
RUNNER_NAME: ${{ runner.name }}
RUNNER_TYPE: ${{ inputs.runner }}
# Hash uniquely on {EXP_NAME}_{PRECISION}_{FRAMEWORK}_tp{}-ep{}-dpa{}_disagg-{}_spec-{}_conc{}_{runner}
RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_tp${{ env.TP }}-ep${{ env.EP_SIZE }}-dpa${{ env.DP_ATTENTION }}_disagg-${{ env.DISAGG }}_spec-${{ env.SPEC_DECODING }}_conc${{ env.CONC }}_${{ runner.name }}
RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_tp${{ env.TP }}-ep${{ env.EP_SIZE }}-dpa${{ env.DP_ATTENTION }}_disagg-${{ env.DISAGG }}_spec-${{ env.SPEC_DECODING }}_conc${{ env.CONC }}_${{ runner.name }}${{ env.EXPERIMENT_SUFFIX }}
# Suppress per-job eval markdown from being appended to the step summary.
# We'll publish a single combined eval table in the collection job instead.
GITHUB_STEP_SUMMARY: ''
Expand Down
38 changes: 38 additions & 0 deletions .github/workflows/e2e-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@ on:
required: false
type: string
default: ""
m3-aiter-ar-rms-mode:
description: "MiniMax M3 MI300X AITER AR+RMS experiment mode"
required: false
type: choice
options:
- "off"
- control
- fused
default: "off"
workflow_call:
inputs:
generate-cli-command:
Expand All @@ -40,6 +49,11 @@ on:
required: false
type: string
default: ""
m3-aiter-ar-rms-mode:
description: "MiniMax M3 MI300X AITER AR+RMS experiment mode"
required: false
type: string
default: "off"

jobs:
get-jobs:
Expand All @@ -65,10 +79,32 @@ jobs:
ref: ${{ github.sha }}

- id: get-jobs
env:
M3_AITER_AR_RMS_MODE: ${{ inputs.m3-aiter-ar-rms-mode }}
run: |
pip install pydantic
CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})
if [ "$M3_AITER_AR_RMS_MODE" != "off" ]; then
python3 -c '
import json
import sys

data = json.load(sys.stdin)
invalid = [
item
for item in data
if item.get("runner") != "mi300x"
or item.get("model-prefix") != "minimaxm3"
or item.get("framework") != "vllm"
]
if invalid:
raise SystemExit(
"M3 AITER AR+RMS mode only supports MiniMax M3 "
"vLLM jobs on MI300X"
)
' <<< "$CONFIG_JSON"
fi
AGENTIC=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if x.get('scenario-type') == 'agentic-coding' and 'prefill' not in x]))")
MULTI_AGENTIC=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if x.get('scenario-type') == 'agentic-coding' and 'prefill' in x]))")
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and x.get('scenario-type') != 'agentic-coding' and not x.get('eval-only', False)]))")
Expand Down Expand Up @@ -263,6 +299,7 @@ jobs:
spec-decoding: ${{ matrix.config.spec-decoding }}
disagg: ${{ matrix.config.disagg }}
run-eval: false
m3-aiter-ar-rms-mode: ${{ inputs.m3-aiter-ar-rms-mode }}
ref: ${{ inputs.ref }}

test-sweep-evals:
Expand Down Expand Up @@ -294,6 +331,7 @@ jobs:
disagg: ${{ matrix.config.disagg }}
run-eval: true
eval-only: true
m3-aiter-ar-rms-mode: ${{ inputs.m3-aiter-ar-rms-mode }}
ref: ${{ inputs.ref }}

collect-results:
Expand Down
83 changes: 83 additions & 0 deletions benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
# is mandatory for MSA sparse attention. Keep the default BF16 KV cache on
# gfx942: the checkpoint has no calibrated q/prob scales for ROCm FP8
# attention, and vLLM's fallback scale of 1.0 corrupts model accuracy.
# Target image vLLM revision: 4a560dd8db67c270f5e2afb614558271b76f2294.

source "$(dirname "$0")/../../benchmark_lib.sh"

Expand Down Expand Up @@ -34,6 +35,88 @@ SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0

M3_AITER_AR_RMS_MODE="${M3_AITER_AR_RMS_MODE:-off}"
if [ "$M3_AITER_AR_RMS_MODE" = "fused" ] && [ "$CONC" -eq 1 ]; then
# The graph-safe two-stage AITER primitive regresses same-node c1 by 2.2%.
M3_AITER_AR_RMS_MODE=off
echo "M3 AITER AR+RMS graph policy: using off for concurrency 1"
fi
export M3_AITER_AR_RMS_MODE
case "$M3_AITER_AR_RMS_MODE" in
off)
;;
control|fused)
VLLM_PACKAGE_ROOT="$(
python - <<'PY'
from pathlib import Path
import vllm
print(Path(vllm.__file__).resolve().parent.parent)
PY
)"

# Enable only the AITER custom all-reduce dependency. M3 does not
# support torch.compile, so the runtime patch invokes this primitive
# directly from the existing allreduce+Gemma RMSNorm helper.
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_PAGED_ATTN=0
export VLLM_ROCM_USE_AITER_LINEAR=0
export VLLM_ROCM_USE_AITER_LINEAR_HIPBMM=0
export VLLM_ROCM_USE_AITER_MOE=0
export VLLM_ROCM_USE_AITER_RMSNORM=0
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=0
export VLLM_ROCM_USE_AITER_TRITON_ROPE=0
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
export VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_TRITON_GEMM=0

if [ "$M3_AITER_AR_RMS_MODE" = "fused" ]; then
# The image's AITER build predates the two-stage memory-ordering fix.
python3 /workspace/utils/install_minimaxm3_aiter.py
fi

python3 /workspace/utils/patch_minimaxm3_aiter_ar_rms.py

DEFERRED_FFN_AR_PATCH="$(dirname "$0")/minimaxm3_mi300x_deferred_ffn_ar.patch"
M3_MODEL_SOURCE="$VLLM_PACKAGE_ROOT/vllm/models/minimax_m3/amd/model.py"
M3_MODEL_SOURCE_SHA256="91d81f8613e32f7afbd65c289f7885c5371263f70503bd053f97880989bf7536"
M3_MODEL_PATCHED_SHA256="d26aa77cfce7c6162b0d1ebe2b403b854f5abe8f656b3a8deda2db1d89318ea8"
m3_model_sha256="$(sha256sum "$M3_MODEL_SOURCE" | awk '{print $1}')"
if [ "$m3_model_sha256" = "$M3_MODEL_SOURCE_SHA256" ]; then
if ! patch --batch --dry-run -d "$VLLM_PACKAGE_ROOT" -p1 \
< "$DEFERRED_FFN_AR_PATCH"; then
echo "Failed to validate the M3 deferred FFN allreduce patch" >&2
exit 1
fi
if ! patch --batch -d "$VLLM_PACKAGE_ROOT" -p1 \
< "$DEFERRED_FFN_AR_PATCH"; then
echo "Failed to apply the M3 deferred FFN allreduce patch" >&2
exit 1
fi
elif [ "$m3_model_sha256" != "$M3_MODEL_PATCHED_SHA256" ]; then
echo "M3 model source fingerprint mismatch: $m3_model_sha256" >&2
exit 1
fi
python3 -m py_compile "$M3_MODEL_SOURCE"
m3_model_sha256="$(sha256sum "$M3_MODEL_SOURCE" | awk '{print $1}')"
if [ "$m3_model_sha256" != "$M3_MODEL_PATCHED_SHA256" ]; then
echo "M3 model patched fingerprint mismatch: $m3_model_sha256" >&2
exit 1
fi
echo "M3 deferred FFN allreduce patch ready: $m3_model_sha256"
echo "M3 AITER AR+RMS experiment mode: $M3_AITER_AR_RMS_MODE"
;;
*)
echo "Invalid M3_AITER_AR_RMS_MODE: $M3_AITER_AR_RMS_MODE" >&2
exit 2
;;
esac

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
diff --git a/vllm/models/minimax_m3/amd/model.py b/vllm/models/minimax_m3/amd/model.py
index 27650c8e6..ac5e260a8 100644
--- a/vllm/models/minimax_m3/amd/model.py
+++ b/vllm/models/minimax_m3/amd/model.py
@@ -17,6 +17,7 @@ The MiniMax-M3-preview config selects a single set of branches:
"index" attention branch.
"""

+import os
from collections.abc import Iterable

import torch
@@ -37,10 +38,12 @@ from vllm.model_executor.layers.attention_layer_base import AttentionLayerBase
from vllm.model_executor.layers.fused_allreduce_gemma_rms_norm import (
fused_allreduce_gemma_rms_norm,
+ initialize_m3_aiter_allreduce,
)
from vllm.model_executor.layers.fused_moe import (
FusedMoE,
GateLinear,
+ MoERunner,
fused_moe_make_expert_params_mapping,
)
)
from vllm.model_executor.layers.linear import (
@@ -117,6 +119,17 @@ def _is_moe_layer(config: PretrainedConfig, layer_id: int) -> bool:
return moe_layer_freq[layer_id] != 0


+def _defer_ffn_allreduce() -> bool:
+ """Whether M3 FFN reductions are completed by the following Gemma norm."""
+ parallel_config = get_current_vllm_config().parallel_config
+ return (
+ os.getenv("M3_AITER_AR_RMS_MODE") in {"control", "fused"}
+ and parallel_config.tensor_parallel_size > 1
+ and parallel_config.pipeline_parallel_size == 1
+ and parallel_config.data_parallel_size == 1
+ )
+
+
def _build_rotary_emb(config: PretrainedConfig, head_dim: int):
"""Build the (partial NeoX) RoPE, honoring an optional ``rope_scaling`` config.

@@ -243,6 +256,25 @@ class MiniMaxM3MLP(nn.Module):
return x


+class MiniMaxM3DeferredMoERunner(MoERunner):
+ """Leave the M3 MoE output rank-local for the following fused AR+RMSNorm."""
+
+ def _maybe_reduce_final_output(
+ self,
+ states: torch.Tensor,
+ trunc_size: int | None,
+ ) -> torch.Tensor:
+ if self._fused_output_is_reduced:
+ raise RuntimeError(
+ "M3 deferred MoE allreduce requires an unreduced MoE backend"
+ )
+ if self.moe_config.is_sequence_parallel:
+ raise RuntimeError(
+ "M3 deferred MoE allreduce does not support sequence parallelism"
+ )
+ return states[..., :trunc_size] if trunc_size is not None else states
+
+
class MiniMaxM3MoE(nn.Module):
"""Sigmoid-routed MoE block with a routing-bias correction and a shared
expert."""
@@ -316,6 +348,9 @@ class MiniMaxM3MoE(nn.Module):
shared_experts=self.shared_experts,
quant_config=quant_config,
prefix=f"{prefix}.experts",
+ runner_cls=MiniMaxM3DeferredMoERunner
+ if _defer_ffn_allreduce()
+ else None,
)

@staticmethod
@@ -732,6 +767,8 @@ class MiniMaxM3DecoderLayer(nn.Module):
# with the layer's index.
layer_id = int(prefix.split(sep=".")[-1])
self.layer_id = layer_id
+ self.defer_ffn_allreduce = _defer_ffn_allreduce()
+ self.fuse_input_allreduce = self.defer_ffn_allreduce and layer_id > 0

is_sparse_attention_layer = (
force_sparse_attn or layer_id in _sparse_attention_layer_ids(config)
@@ -769,6 +806,7 @@ class MiniMaxM3DecoderLayer(nn.Module):
config=config,
intermediate_size=config.dense_intermediate_size,
quant_config=quant_config,
+ reduce_results=not self.defer_ffn_allreduce,
prefix=f"{prefix}.mlp",
)

@@ -787,11 +825,16 @@ class MiniMaxM3DecoderLayer(nn.Module):
residual: torch.Tensor | None,
) -> tuple[torch.Tensor, torch.Tensor]:
# Self Attention
- if residual is None:
- residual = hidden_states
- hidden_states = self.input_layernorm(hidden_states)
+ if self.fuse_input_allreduce and residual is not None:
+ hidden_states, residual = fused_allreduce_gemma_rms_norm(
+ hidden_states, residual, self.input_layernorm
+ )
else:
- hidden_states, residual = self.input_layernorm(hidden_states, residual)
+ if residual is None:
+ residual = hidden_states
+ hidden_states = self.input_layernorm(hidden_states)
+ else:
+ hidden_states, residual = self.input_layernorm(hidden_states, residual)
hidden_states = self.self_attn(
positions=positions,
hidden_states=hidden_states,
@@ -815,6 +858,9 @@ class MiniMaxM3Model(nn.Module):
cache_config = vllm_config.cache_config
quant_config = vllm_config.quant_config
self.config = config
+ self.defer_ffn_allreduce = _defer_ffn_allreduce()
+ if self.defer_ffn_allreduce:
+ initialize_m3_aiter_allreduce()

self.vocab_size = config.vocab_size

@@ -856,7 +902,12 @@ class MiniMaxM3Model(nn.Module):
for layer in self.layers[self.start_layer : self.end_layer]:
hidden_states, residual = layer(positions, hidden_states, residual)

- hidden_states, _ = self.norm(hidden_states, residual)
+ if self.defer_ffn_allreduce:
+ hidden_states, _ = fused_allreduce_gemma_rms_norm(
+ hidden_states, residual, self.norm
+ )
+ else:
+ hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states

def get_expert_mapping(self) -> list[tuple[str, str, int, str]]:
Loading
Loading