feat/step3p5#1870
Open
HAOCHENYE wants to merge 5 commits into
Open
Conversation
HAOCHENYE
commented
Jun 3, 2026
Collaborator
- [Feature] Add MHA head-wise output gate and Step3.5 clipped SwiGLU
- [Feature] Add Step-3.5-Flash (step3p5) MoE model
- [Test] Add Step-3.5-Flash parity tests
- [CI] Add Step-3.5-Flash training config
- [Docs] Note in add_hf_model skill: avoid the deprecated RopeScalingConfig
Add a per-head attention output gate (head_gate) to MHAConfig/MultiHeadAttention: a dedicated g_proj: Linear(hidden, num_heads) whose per-head sigmoid scales the attention output before o_proj. Distinct from the existing per-element with_gate (fused into a doubled q_proj); the two are mutually exclusive. Add a swiglu_clip activation (silu(gate).clamp(max=limit) * up.clamp(+/-limit)) and an optional swiglu_limit on MoEMLP, for models that clip the SwiGLU on a subset of layers. Unlike clipped_swiglu (gpt-oss, sigmoid-GLU + (up+1)), the clamp is applied after the silu activation.
Port stepfun-ai/Step-3.5-Flash, a trust_remote_code MoE: 45 layers (first 3 dense, rest MoE), 288 routed experts top-8 + 1 shared expert, hybrid attention (full vs sliding with different head counts), head-wise attention gate, qk_norm, zero-centered RMSNorm, sigmoid NoAuxRouter with per-expert bias and scaling, and per-layer SwiGLU clipping on the last layers. Two architecture aspects are contained in the model rather than generalized into the base for now (to be generalized after precision alignment): the two attention profiles are selected per layer via layers_type in an overridden build_layers, and the per-profile RoPE (full: theta 5e6 / partial 0.5 / llama3; sliding: theta 1e4 / partial 1.0 / default) lives inside Step3.5-specific decoder layers that recompute position embeddings from seq_ctx.position_ids. HF stores experts as separate gate_proj/up_proj 3-D tensors; safetensors_to_params de-interleaves them into XTuner's fused expert-major grouped-linear weight on load. hf_config returns None (no built-in transformers config class), so save_hf copies the source config/tokenizer/modeling files. Design doc: docs/design/model/step3p5.md.
Single-rank baseline parity vs HuggingFace (STEP3P5_PATH env var): - rotary inv_freq bitwise vs the canonical default / llama3 formulas; - attention sub-block bitwise (head gate + qk_norm + per-layer RoPE + partial rotary + sliding window) for full / sliding / clamp layers under XTUNER_HF_IMPL; - full MoE decoder layer within tolerance (bf16 grouped GEMM vs HF fp32 expert loop) with exact router top-k indices. The shipped modeling_step3p5.py is incompatible with the installed transformers (its rotary init crashes), so the test replaces only the HF rotary with a version-independent implementation using the canonical formulas.
Drop-in TrainerConfig for Step-3.5-Flash. `load_from`/`tokenizer_path` read STEP3P5_PATH, which must point at the split / per-expert checkpoint produced by `.dev_scripts/convert_step3p5_to_split.py` (the released fused-expert layout cannot be sharded across ranks). Uses expert parallelism (ep_size=8, all2all); torch.compile is left off (a §8 optimization for the hybrid per-layer-RoPE decoder layers). A reduced 8-GPU overfit smoke (5 layers, full 288 experts) confirmed the forward -> loss -> backward -> FSDP-reduce -> optimizer-step loop runs and the loss descends; a full convergence run on the ~200B model needs a multi-node cluster.
…nfig Record the caveat surfaced by the Step-3.5 port: RopeScalingConfig is deprecated (use RopeParametersConfig), and when a per-layer value only selects one module behavior (e.g. partial_rotary_factor -> apply_rotary_emb), set that behavior directly on the module instead of threading a (deprecated) rope config through build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.