feat/step3p5 by HAOCHENYE · Pull Request #1870 · InternLM/xtuner

HAOCHENYE · 2026-06-03T08:05:07Z

[Feature] Add MHA head-wise output gate and Step3.5 clipped SwiGLU
[Feature] Add Step-3.5-Flash (step3p5) MoE model
[Test] Add Step-3.5-Flash parity tests
[CI] Add Step-3.5-Flash training config
[Docs] Note in add_hf_model skill: avoid the deprecated RopeScalingConfig

Add a per-head attention output gate (head_gate) to MHAConfig/MultiHeadAttention: a dedicated g_proj: Linear(hidden, num_heads) whose per-head sigmoid scales the attention output before o_proj. Distinct from the existing per-element with_gate (fused into a doubled q_proj); the two are mutually exclusive. Add a swiglu_clip activation (silu(gate).clamp(max=limit) * up.clamp(+/-limit)) and an optional swiglu_limit on MoEMLP, for models that clip the SwiGLU on a subset of layers. Unlike clipped_swiglu (gpt-oss, sigmoid-GLU + (up+1)), the clamp is applied after the silu activation.

Port stepfun-ai/Step-3.5-Flash, a trust_remote_code MoE: 45 layers (first 3 dense, rest MoE), 288 routed experts top-8 + 1 shared expert, hybrid attention (full vs sliding with different head counts), head-wise attention gate, qk_norm, zero-centered RMSNorm, sigmoid NoAuxRouter with per-expert bias and scaling, and per-layer SwiGLU clipping on the last layers. Two architecture aspects are contained in the model rather than generalized into the base for now (to be generalized after precision alignment): the two attention profiles are selected per layer via layers_type in an overridden build_layers, and the per-profile RoPE (full: theta 5e6 / partial 0.5 / llama3; sliding: theta 1e4 / partial 1.0 / default) lives inside Step3.5-specific decoder layers that recompute position embeddings from seq_ctx.position_ids. HF stores experts as separate gate_proj/up_proj 3-D tensors; safetensors_to_params de-interleaves them into XTuner's fused expert-major grouped-linear weight on load. hf_config returns None (no built-in transformers config class), so save_hf copies the source config/tokenizer/modeling files. Design doc: docs/design/model/step3p5.md.

Single-rank baseline parity vs HuggingFace (STEP3P5_PATH env var): - rotary inv_freq bitwise vs the canonical default / llama3 formulas; - attention sub-block bitwise (head gate + qk_norm + per-layer RoPE + partial rotary + sliding window) for full / sliding / clamp layers under XTUNER_HF_IMPL; - full MoE decoder layer within tolerance (bf16 grouped GEMM vs HF fp32 expert loop) with exact router top-k indices. The shipped modeling_step3p5.py is incompatible with the installed transformers (its rotary init crashes), so the test replaces only the HF rotary with a version-independent implementation using the canonical formulas.

Drop-in TrainerConfig for Step-3.5-Flash. `load_from`/`tokenizer_path` read STEP3P5_PATH, which must point at the split / per-expert checkpoint produced by `.dev_scripts/convert_step3p5_to_split.py` (the released fused-expert layout cannot be sharded across ranks). Uses expert parallelism (ep_size=8, all2all); torch.compile is left off (a §8 optimization for the hybrid per-layer-RoPE decoder layers). A reduced 8-GPU overfit smoke (5 layers, full 288 experts) confirmed the forward -> loss -> backward -> FSDP-reduce -> optimizer-step loop runs and the loss descends; a full convergence run on the ~200B model needs a multi-node cluster.

…nfig Record the caveat surfaced by the Step-3.5 port: RopeScalingConfig is deprecated (use RopeParametersConfig), and when a per-layer value only selects one module behavior (e.g. partial_rotary_factor -> apply_rotary_emb), set that behavior directly on the module instead of threading a (deprecated) rope config through build.

HAOCHENYE added 5 commits June 2, 2026 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/step3p5#1870

feat/step3p5#1870
HAOCHENYE wants to merge 5 commits into
InternLM:mainfrom
HAOCHENYE:feat/step3p5

HAOCHENYE commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HAOCHENYE commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant