Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment#1376
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
3d9b66d to
b20a4d9
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1376 +/- ##
==========================================
- Coverage 76.93% 76.88% -0.05%
==========================================
Files 474 474
Lines 51506 51538 +32
==========================================
- Hits 39625 39624 -1
- Misses 11881 11914 +33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a8bdf39 to
314bd84
Compare
### What does this PR do?
Type of change: New feature, new tests, documentation.
OMNIML-4108: Extends the Minitron NAS pruner to support pruning by
**active parameter count** (`active_params`) and **memory footprint**
(`memory_mb`) in addition to the existing total parameter count
(`params`) constraint. Also adds standalone utilities for analytical
model stats.
#### Changes
**New pruning constraint keys**
- `active_params`: prune to a target number of active (routed) params —
useful for MoE models where total ≫ active; when present,
`active_params` is the **primary sort/display metric** for candidates
(priority: `active_params` > `params` > `memory_mb`)
- `memory_mb`: prune to fit a memory budget (BF16 weights + KV-cache +
Mamba state at a given sequence length and batch size)
- Constraints can be combined (AND logic): e.g. `{"params": 6e9,
"memory_mb": 12288}`
**New standalone utilities**
(`modelopt.torch.nas.plugins.megatron_model_stats`)
- `mcore_param_count`: analytically computes total and active parameter
counts for GPT and Mamba/hybrid MCore models
- `mcore_memory_footprint_mb`: estimates memory in MB (weights +
KV-cache + Mamba state)
- `print_mcore_model_stats`: rich-formatted model stats panel
**Rich-formatted pruning logs** — search space, top-k candidate tables,
and best subnet panel printed on rank 0
**`prune_score_func` format update** — now `mmlu_<N>pct_bs<bs>` (e.g.
`mmlu_10pct_bs32`) to explicitly control batch size for MMLU evaluation;
old `mmlu_<N>pct` format removed
**Infrastructure**
- NeMo container bumped to `nvcr.io/nvidia/nemo:26.04` in CI and docs
- Added `examples/megatron_bridge/requirements.txt` with
`transformers<5.0` (required for saving some Nemotron-3-Nano models)
### Usage
```python
# Prune to 3B active params (MoE-aware) — active_params is the primary sort metric
mtp.prune(model, mode=[("mcore_minitron", ss_config)], constraints={"active_params": 3e9}, config=pruning_config)
# Prune to fit a 12 GB memory budget
mtp.prune(model, mode=[("mcore_minitron", ss_config)], constraints={"memory_mb": 12288}, config=pruning_config)
```
### Testing
Pruned Nemotron-3-Nano-30B-A3B (31.6B, A3.6B) --> A3.0B. Takes <1hr on
8x H100 (more details in #1376)
```bash
torchrun --nproc_per_node 8 examples/megatron_bridge/prune_minitron.py \
--pp_size 8 \
--hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--trust_remote_code \
--prune_target_params 28e9 \
--prune_target_active_params 3e9 \
--hparams_to_skip num_attention_heads \
--seq_length 8192 \
--output_hf_path pruned/Nemotron-3-Nano-30B-A3B-Pruned-28B-A3B-top20-max15depth-max30width-mmlu_10pct_bs32 \
--top_k 20 \
--max_depth_pruning 0.15 \
--max_width_pruning 0.30 \
--prune_score_func mmlu_10pct_bs32 \
--num_layers_in_first_pipeline_stage 5 \
--num_layers_in_last_pipeline_stage 5
```
```
╭──────────────────────────────────────────────────── Original Model Stats ─────────────────────────────────────────────────────╮
│ Total Parameters 31.58B │
│ Active Parameters 3.58B │
│ Memory (BF16, seq_length=8192, batch_size=1) weights: 60230.1 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 60301.9 MB │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Top 20 Candidates with Scores
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ # ┃ export_config ┃ active_params ┃ params ┃ score ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ 1 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 120, │ 3.00B │ 27.06B │ 0.3399 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 2 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.4650 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 3 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2343 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 4 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 56, 'mamba_head_dim': 48, 'num_moe_experts': 96, │ 3.00B │ 20.09B │ 0.2552 │
│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 5 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 21.61B │ 0.2601 │
│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 6 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 64, 'num_moe_experts': 96, │ 3.00B │ 19.28B │ 0.3762 │
│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
│ 7 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, │ 3.00B │ 22.28B │ 0.4783 │
│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 8 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 96, │ 3.00B │ 21.99B │ 0.2420 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │
│ 9 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2399 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
│ 10 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112, │ 3.00B │ 26.17B │ 0.2601 │
│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │
│ 11 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2503 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 12 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.4329 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 13 │ {'num_layers': 46, 'hidden_size': 2688, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 128, │ 3.00B │ 26.17B │ 0.2587 │
│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 2816} │ │ │ │
│ 14 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2336 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 15 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 96, │ 3.00B │ 20.09B │ 0.2559 │
│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 16 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 96, │ 3.00B │ 20.70B │ 0.4608 │
│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
│ 17 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2455 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
│ 18 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104, │ 3.00B │ 24.42B │ 0.2503 │
│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │
│ 19 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 120, │ 3.00B │ 27.92B │ 0.2587 │
│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
│ 20 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2469 │
│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
└────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────┴────────┴────────┘
╭──────────────────────────────────────────────────────────────────────── Best Subnet ─────────────────────────────────────────────────────────────────────────╮
│ export_config {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, 'moe_ffn_hidden_size': 1856, │
│ 'moe_shared_expert_intermediate_size': 3072} │
│ active_params 3.00B │
│ params 22.28B │
│ score 0.4783 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────────────────────── Pruned Model Stats ──────────────────────────────────────────────────────╮
│ Total Parameters 22.28B │
│ Active Parameters 3.00B │
│ Memory (BF16, seq_length=8192, batch_size=1) weights: 42489.7 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 42561.6 MB │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ✅
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
---------
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
e4e789a to
ed37d85
Compare
ed37d85 to
687a883
Compare
…tron-Nano-9B-v2 docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
687a883 to
9aed4f2
Compare
What does this PR do?
Type of change: example/tutorial
Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment