Skip to content

feat: add TrainerRank#731

Draft
bradhilton wants to merge 11 commits into
mainfrom
feat/trainer-rank-gdn-tree
Draft

feat: add TrainerRank#731
bradhilton wants to merge 11 commits into
mainfrom
feat/trainer-rank-gdn-tree

Conversation

@bradhilton

Copy link
Copy Markdown
Collaborator

Summary

  • add art.megatron.trainer_rank plus a minimal dev/trainer_rank.py torchrun demo
  • add shared-prefix packing/tree helpers and unify GDN execution around the generic tree path
  • add TrainerRank request-head support for target logprobs, multi-target labels, top-k, logits, and hidden states
  • add topology/perf/parity dev harnesses and unit/integration coverage

Validation

  • uv run ruff check src/art/megatron/context_parallel/builder.py src/art/megatron/shared_prefix_state.py tests/unit/test_shared_prefix_attention_builder.py dev/trainer_rank_perf.py
  • uv run pytest tests/unit/test_shared_prefix_packing.py tests/unit/test_shared_prefix_tree.py tests/unit/test_shared_prefix_grad_parity.py tests/unit/test_trainer_rank_validation.py tests/unit/test_shared_prefix_attention_builder.py (34 passed, 8 skipped locally; Megatron-only attention builder passes on H200)
  • H200: shared-prefix attention builder 7 passed
  • H200: GDN CP packed correctness 7 passed, 2 skipped
  • H200: real GDN/native FLA CP 4 passed, 2 skipped
  • H200: Qwen35 full-model CP1 packed vs flattened 1 passed
  • H200: TrainerRank topology matrix 120/120 passed across DP/TP/CP <= 4 and depths 0..3
  • H200 35B/A3B CP=4 EP=4 perf guards: Austin 198k, depth-3 random, no-sharing 90k, mixed hidden/logits/top-k outputs

@bradhilton bradhilton changed the title feat: add TrainerRank and generic tree GDN feat: add TrainerRank Jun 22, 2026
@bradhilton bradhilton force-pushed the feat/trainer-rank-gdn-tree branch from 2cecc5a to 94afb0f Compare June 22, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant