Skip to content

feat/step3p5#1870

Open
HAOCHENYE wants to merge 5 commits into
InternLM:mainfrom
HAOCHENYE:feat/step3p5
Open

feat/step3p5#1870
HAOCHENYE wants to merge 5 commits into
InternLM:mainfrom
HAOCHENYE:feat/step3p5

Conversation

@HAOCHENYE

Copy link
Copy Markdown
Collaborator
  • [Feature] Add MHA head-wise output gate and Step3.5 clipped SwiGLU
  • [Feature] Add Step-3.5-Flash (step3p5) MoE model
  • [Test] Add Step-3.5-Flash parity tests
  • [CI] Add Step-3.5-Flash training config
  • [Docs] Note in add_hf_model skill: avoid the deprecated RopeScalingConfig

HAOCHENYE added 5 commits June 2, 2026 06:19
Add a per-head attention output gate (head_gate) to MHAConfig/MultiHeadAttention:
a dedicated g_proj: Linear(hidden, num_heads) whose per-head sigmoid scales the
attention output before o_proj. Distinct from the existing per-element with_gate
(fused into a doubled q_proj); the two are mutually exclusive.

Add a swiglu_clip activation (silu(gate).clamp(max=limit) * up.clamp(+/-limit))
and an optional swiglu_limit on MoEMLP, for models that clip the SwiGLU on a
subset of layers. Unlike clipped_swiglu (gpt-oss, sigmoid-GLU + (up+1)), the
clamp is applied after the silu activation.
Port stepfun-ai/Step-3.5-Flash, a trust_remote_code MoE: 45 layers (first 3
dense, rest MoE), 288 routed experts top-8 + 1 shared expert, hybrid attention
(full vs sliding with different head counts), head-wise attention gate, qk_norm,
zero-centered RMSNorm, sigmoid NoAuxRouter with per-expert bias and scaling, and
per-layer SwiGLU clipping on the last layers.

Two architecture aspects are contained in the model rather than generalized into
the base for now (to be generalized after precision alignment): the two attention
profiles are selected per layer via layers_type in an overridden build_layers, and
the per-profile RoPE (full: theta 5e6 / partial 0.5 / llama3; sliding: theta 1e4 /
partial 1.0 / default) lives inside Step3.5-specific decoder layers that recompute
position embeddings from seq_ctx.position_ids.

HF stores experts as separate gate_proj/up_proj 3-D tensors; safetensors_to_params
de-interleaves them into XTuner's fused expert-major grouped-linear weight on load.
hf_config returns None (no built-in transformers config class), so save_hf copies
the source config/tokenizer/modeling files.

Design doc: docs/design/model/step3p5.md.
Single-rank baseline parity vs HuggingFace (STEP3P5_PATH env var):
- rotary inv_freq bitwise vs the canonical default / llama3 formulas;
- attention sub-block bitwise (head gate + qk_norm + per-layer RoPE + partial
  rotary + sliding window) for full / sliding / clamp layers under XTUNER_HF_IMPL;
- full MoE decoder layer within tolerance (bf16 grouped GEMM vs HF fp32 expert
  loop) with exact router top-k indices.

The shipped modeling_step3p5.py is incompatible with the installed transformers
(its rotary init crashes), so the test replaces only the HF rotary with a
version-independent implementation using the canonical formulas.
Drop-in TrainerConfig for Step-3.5-Flash. `load_from`/`tokenizer_path` read
STEP3P5_PATH, which must point at the split / per-expert checkpoint produced by
`.dev_scripts/convert_step3p5_to_split.py` (the released fused-expert layout
cannot be sharded across ranks). Uses expert parallelism (ep_size=8, all2all);
torch.compile is left off (a §8 optimization for the hybrid per-layer-RoPE
decoder layers). A reduced 8-GPU overfit smoke (5 layers, full 288 experts)
confirmed the forward -> loss -> backward -> FSDP-reduce -> optimizer-step loop
runs and the loss descends; a full convergence run on the ~200B model needs a
multi-node cluster.
…nfig

Record the caveat surfaced by the Step-3.5 port: RopeScalingConfig is deprecated
(use RopeParametersConfig), and when a per-layer value only selects one module
behavior (e.g. partial_rotary_factor -> apply_rotary_emb), set that behavior
directly on the module instead of threading a (deprecated) rope config through
build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant