Add Gemma 4 Megatron model support by FurtherAI · Pull Request #736 · OpenPipe/ART

FurtherAI · 2026-06-23T03:15:06Z

Summary

Adds Gemma 4 MoE support to ART's Megatron backend.

The branch also brings in the supporting Megatron/vLLM runtime infrastructure needed for this model family: sliding-window attention masks (including in CP), vLLM token-id based RL tokenization, native vLLM LoRA support for Gemma4 MoE, train/inference mismatch validation, and the length trainability workflow to replace the yes-no trainability test.

Semantic Change Groups

Gemma 4 model handler and Megatron bridge support

Adds the Gemma 4 model-support handler and registry wiring. The handler covers Gemma4-specific provider setup, layer-family discovery, proportional RoPE handling, router replay, fused expert loading, shared expert overlap, full activation recompute, and LoRA export.

This was needed because Gemma 4 differs from existing Qwen handlers in several important ways: K equals V behavior, fused expert layout, tuple rotary outputs, SWA/global attention layer mix, and bridge/runtime config expectations.

Sliding-window attention and context parallelism

Adds ART flex-attention SWA mask support and wires it through shared-prefix state and CP mask preparation. The CP path now prepares masks up front and the forward path requires the prepared mask, avoiding host-side work and accidental runtime mask construction.

This was needed so Gemma 4's SWA layers match HF/vLLM behavior while preserving the GPU-only forward path and keeping CP planning outside the hot model forward.

RL tokenization via vLLM token ids

Cuts the RL tokenization path over to vLLM-returned token ids. ART now requests vLLM token ids and stores the native vLLM fields on Choice.model_extra:

prompt_token_ids
token_ids

Tokenization then uses those ids directly, including append-only multi-turn collapse when prompt ids prove equivalence.

This removes fragile chat-template re-rendering for RL trajectories and makes multi-turn/tool/thinking behavior follow the actual serving prompt seen by vLLM.

@Kovbo Should have changed the RL path mostly, but SFT now has its own, simple tokenize path. Can you check this out?

vLLM runtime and LoRA serving

Upgrades the ART vLLM runtime to vLLM 0.23.0 and updates runtime patches for the new API surface. Adds an isolated Gemma4 MoE LoRA patch in vllm_runtime so native LoRA serving works for Gemma4 MoE until upstream support exists.

The branch also adds compact LoRA delta publishing and merged-weight transfer improvements (send the lora to vLLM, merge and apply there), but Gemma4 is now configured to use native vLLM LoRA by default after validation.

Train/inference mismatch validation

Extends the real-path train-inf mismatch stage for Gemma4, including long prompts so SWA is exercised, routed-expert replay, CP scoring and native-LoRA rollout settings.

Trainability workflow

Replaces the awkward yes/no trainability default with the length trainability workflow. The new test trains only on generated length error, uses dedicated Megatron/PipelineTrainer mode, stops early once target error is reached, and has explicit initial/final error assertions.

This gives a cleaner trainability signal for Gemma4 and avoids relying on a prompt/task shape that Gemma4 starts unusually high on (it was getting 0.9375 step 0 reward and no signal).

Runtime config and sequence-length handling

Adds an explicit Megatron runtime config singleton and removes scattered topology/packed-length mutation paths. Model max sequence length is derived from model config, while packed sequence length is treated as a runtime packing capacity rather than a model-context constraint.

This removes the annoying and misleading need to set max sequence length to packed sequence length and the singleton design prevents subtle recompilation and throughput regressions from runtime topology or packed-length changes.

All model support workflow stages were passed. Additional full model throughput was measured, 22k+ tok/s compared to Qwen 3.6 at 27k tok/s (CP4 EP4 5k + 16x100, repeated to ~196k).

…support # Conflicts: # src/art/megatron/service.py

This reverts commit b9f279e.

This reverts commit 3654bbd.

FurtherAI added 30 commits May 15, 2026 06:54

Fail fast on Megatron job failures

a889a30

Fix CP exchange collective participation

a35be61

Fix CP empty-rank collective participation

6cc1e49

Add streaming weight offload validation hooks

b03853b

Fix qwen35 gdn compile boundaries

83da657

Narrow expert lora compile boundary

0c527d8

Fix oracle routing trace for variable micros

c325859

Refine oracle LoRA reference controls

646b748

Use native Megatron MoE routing replay

c872195

Add production MoE routing replay plumbing

f6a369f

Expose trajectory routing replay train flag

a5d6a26

Make expert replay a backend setting

211b7e2

Add real-path train inf mismatch test

f3f619c

Disable async scheduling for expert replay

9ab0308

Forward false vLLM runtime flags

f5f1714

Use nonzero advantages in real mismatch test

3b84202

Align real mismatch rollout chat template

45627c8

Allow replay to omit terminal generated route

200494c

Replay known routes and live-route terminal gaps

cde0316

Gather TP logits in mismatch extractor

2d043df

Run real mismatch test without opt-in env

cb815e4

Make routing replay native and cp2 by default

3f3cc5f

Fix mismatch test topology world size

3470a2b

Restore tp2 ep2 mismatch defaults

b72a01a

Fix CP attention backward grad layout

8125f8a

Wire weight offload config into attention oracle

d9dbdb6

Document mismatch threshold diagnostics

f61d43c

Fix CP flash grad handoff

bec322b

Default oracle validation to Qwen3.5

85583fb

Allow streaming offload with compiled layers

75a4abb

FurtherAI added 30 commits June 18, 2026 20:44

Decode vLLM routed expert responses

17b0915

Derive max sequence length from model config

2db6f19

Fix dense shared topology expectation

d479300

Add explicit Megatron runtime config

b01b641

Lazy import optional Unsloth service in runtime test

ef6d9f3

Keep live length trainability varied

d1ccaea

Stop length trainability after target error

081e911

Mark Gemma4 native vLLM LoRA wip

f09bf76

Use native LoRA default for Gemma4 trainability

f6f0407

Patch Gemma4 MoE LoRA metadata in vLLM runtime

fa6e65a

Tie Gemma4 k-eq-v LoRA export for vLLM

749ee9b

Tighten Gemma4 train-inf mismatch thresholds

e5454d5

Enable Gemma4 LoRA grads through preprocess

e955012

Support Gemma4 full activation recompute

f260f6b

Use length trainability in model support workflow

799bad0

Set Gemma 4 mismatch thresholds

08f30ef

Clean up Gemma 4 model support branch

e4c8eaa

Merge remote-tracking branch 'origin/main' into austin/gemma_4_model_…

f25ca8e

…support # Conflicts: # src/art/megatron/service.py

Restore strict CP block mask preparation

9d98029

Use backend-only Triton flex options

5faa770

Use direct vLLM token metadata fields

41742d6

Clean up Gemma4 branch typing diagnostics

ab605bb

Preserve chat template kwargs coverage

086eb15

Fix tinker token id return type

edee967

Probe Gemma4 HF text token types

3654bbd

Use Gemma4-compatible backend deps

b9f279e

Revert "Use Gemma4-compatible backend deps"

239d612

This reverts commit b9f279e.

Revert "Probe Gemma4 HF text token types"

f2947f9

This reverts commit 3654bbd.

Split Unsloth and Megatron dependency extras

9f818a2

Remove duplicate Unsloth extra

ef085ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 Megatron model support#736

Add Gemma 4 Megatron model support#736
FurtherAI wants to merge 498 commits into
mainfrom
austin/gemma_4_model_support

FurtherAI commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FurtherAI commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Semantic Change Groups

Gemma 4 model handler and Megatron bridge support

Sliding-window attention and context parallelism

RL tokenization via vLLM token ids

vLLM runtime and LoRA serving

Train/inference mismatch validation

Trainability workflow

Runtime config and sequence-length handling

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FurtherAI commented Jun 23, 2026 •

edited

Loading