Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment by kevalmorabia97 · Pull Request #1376 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-04-30T10:53:48Z

What does this PR do?

Type of change: example/tutorial

Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment

copy-pr-bot · 2026-04-30T10:53:59Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-30T10:54:38Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbf2c667-7f7d-4a35-97b3-6e1cc3e4382e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/nemotron3-nano-pruning-tutorial

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-30T10:57:35Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1376/
Built to branch `gh-pages` at 2026-05-17 12:16 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-04-30T11:15:09Z

Codecov Report

❌ Patch coverage is 0% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.88%. Comparing base (7038dec) to head (9aed4f2).

Files with missing lines	Patch %	Lines
...pt/torch/utils/plugins/megatron_preprocess_data.py	0.00%	35 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1376      +/-   ##
==========================================
- Coverage   76.93%   76.88%   -0.05%     
==========================================
  Files         474      474              
  Lines       51506    51538      +32     
==========================================
- Hits        39625    39624       -1     
- Misses      11881    11914      +33

Flag	Coverage Δ
unit	`52.60% <0.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

### What does this PR do? Type of change: New feature, new tests, documentation. OMNIML-4108: Extends the Minitron NAS pruner to support pruning by **active parameter count** (`active_params`) and **memory footprint** (`memory_mb`) in addition to the existing total parameter count (`params`) constraint. Also adds standalone utilities for analytical model stats. #### Changes **New pruning constraint keys** - `active_params`: prune to a target number of active (routed) params — useful for MoE models where total ≫ active; when present, `active_params` is the **primary sort/display metric** for candidates (priority: `active_params` > `params` > `memory_mb`) - `memory_mb`: prune to fit a memory budget (BF16 weights + KV-cache + Mamba state at a given sequence length and batch size) - Constraints can be combined (AND logic): e.g. `{"params": 6e9, "memory_mb": 12288}` **New standalone utilities** (`modelopt.torch.nas.plugins.megatron_model_stats`) - `mcore_param_count`: analytically computes total and active parameter counts for GPT and Mamba/hybrid MCore models - `mcore_memory_footprint_mb`: estimates memory in MB (weights + KV-cache + Mamba state) - `print_mcore_model_stats`: rich-formatted model stats panel **Rich-formatted pruning logs** — search space, top-k candidate tables, and best subnet panel printed on rank 0 **`prune_score_func` format update** — now `mmlu_<N>pct_bs<bs>` (e.g. `mmlu_10pct_bs32`) to explicitly control batch size for MMLU evaluation; old `mmlu_<N>pct` format removed **Infrastructure** - NeMo container bumped to `nvcr.io/nvidia/nemo:26.04` in CI and docs - Added `examples/megatron_bridge/requirements.txt` with `transformers<5.0` (required for saving some Nemotron-3-Nano models) ### Usage ```python # Prune to 3B active params (MoE-aware) — active_params is the primary sort metric mtp.prune(model, mode=[("mcore_minitron", ss_config)], constraints={"active_params": 3e9}, config=pruning_config) # Prune to fit a 12 GB memory budget mtp.prune(model, mode=[("mcore_minitron", ss_config)], constraints={"memory_mb": 12288}, config=pruning_config) ``` ### Testing Pruned Nemotron-3-Nano-30B-A3B (31.6B, A3.6B) --> A3.0B. Takes <1hr on 8x H100 (more details in #1376) ```bash torchrun --nproc_per_node 8 examples/megatron_bridge/prune_minitron.py \ --pp_size 8 \ --hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ --trust_remote_code \ --prune_target_params 28e9 \ --prune_target_active_params 3e9 \ --hparams_to_skip num_attention_heads \ --seq_length 8192 \ --output_hf_path pruned/Nemotron-3-Nano-30B-A3B-Pruned-28B-A3B-top20-max15depth-max30width-mmlu_10pct_bs32 \ --top_k 20 \ --max_depth_pruning 0.15 \ --max_width_pruning 0.30 \ --prune_score_func mmlu_10pct_bs32 \ --num_layers_in_first_pipeline_stage 5 \ --num_layers_in_last_pipeline_stage 5 ``` ``` ╭──────────────────────────────────────────────────── Original Model Stats ─────────────────────────────────────────────────────╮ │ Total Parameters 31.58B │ │ Active Parameters 3.58B │ │ Memory (BF16, seq_length=8192, batch_size=1) weights: 60230.1 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 60301.9 MB │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Top 20 Candidates with Scores ┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ ┃ # ┃ export_config ┃ active_params ┃ params ┃ score ┃ ┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ 1 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 120, │ 3.00B │ 27.06B │ 0.3399 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 2 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.4650 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 3 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2343 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 4 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 56, 'mamba_head_dim': 48, 'num_moe_experts': 96, │ 3.00B │ 20.09B │ 0.2552 │ │ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 5 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 21.61B │ 0.2601 │ │ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 6 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 64, 'num_moe_experts': 96, │ 3.00B │ 19.28B │ 0.3762 │ │ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │ │ 7 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, │ 3.00B │ 22.28B │ 0.4783 │ │ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 8 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 96, │ 3.00B │ 21.99B │ 0.2420 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │ │ 9 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2399 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │ │ 10 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112, │ 3.00B │ 26.17B │ 0.2601 │ │ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │ │ 11 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2503 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 12 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.4329 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 13 │ {'num_layers': 46, 'hidden_size': 2688, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 128, │ 3.00B │ 26.17B │ 0.2587 │ │ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 2816} │ │ │ │ │ 14 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2336 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 15 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 96, │ 3.00B │ 20.09B │ 0.2559 │ │ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 16 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 96, │ 3.00B │ 20.70B │ 0.4608 │ │ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ │ 17 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2455 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │ │ 18 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104, │ 3.00B │ 24.42B │ 0.2503 │ │ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │ │ 19 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 120, │ 3.00B │ 27.92B │ 0.2587 │ │ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │ │ 20 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2469 │ │ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │ └────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────┴────────┴────────┘ ╭──────────────────────────────────────────────────────────────────────── Best Subnet ─────────────────────────────────────────────────────────────────────────╮ │ export_config {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, 'moe_ffn_hidden_size': 1856, │ │ 'moe_shared_expert_intermediate_size': 3072} │ │ active_params 3.00B │ │ params 22.28B │ │ score 0.4783 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────────────────────────── Pruned Model Stats ──────────────────────────────────────────────────────╮ │ Total Parameters 22.28B │ │ Active Parameters 3.00B │ │ Memory (BF16, seq_length=8192, batch_size=1) weights: 42489.7 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 42561.6 MB │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ``` ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…tron-Nano-9B-v2 docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested review from a team as code owners April 30, 2026 10:53

kevalmorabia97 requested review from AAnoosheh and ChenhanYu April 30, 2026 10:53

kevalmorabia97 marked this pull request as draft April 30, 2026 10:53

kevalmorabia97 removed request for AAnoosheh and ChenhanYu April 30, 2026 10:54

kevalmorabia97 force-pushed the docs/nemotron3-nano-pruning-tutorial branch from 3d9b66d to b20a4d9 Compare April 30, 2026 11:01

kevalmorabia97 mentioned this pull request May 1, 2026

Enable active-param and memory based Minitron pruning constraint #1377

Merged

kevalmorabia97 force-pushed the docs/nemotron3-nano-pruning-tutorial branch 3 times, most recently from a8bdf39 to 314bd84 Compare May 4, 2026 19:35

kevalmorabia97 force-pushed the docs/nemotron3-nano-pruning-tutorial branch 6 times, most recently from e4e789a to ed37d85 Compare May 11, 2026 11:04

kevalmorabia97 force-pushed the docs/nemotron3-nano-pruning-tutorial branch from ed37d85 to 687a883 Compare May 13, 2026 17:21

Add Nemotron-3-Nano-30B-A3B-BF16 e2e pruning tutorial and update Nemo…

9aed4f2

…tron-Nano-9B-v2 docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the docs/nemotron3-nano-pruning-tutorial branch from 687a883 to 9aed4f2 Compare May 17, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment#1376

Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment#1376
kevalmorabia97 wants to merge 1 commit into
mainfrom
docs/nemotron3-nano-pruning-tutorial

kevalmorabia97 commented Apr 30, 2026

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-17 12:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kevalmorabia97 commented Apr 30, 2026

What does this PR do?

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-17 12:16 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-17 12:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Apr 30, 2026 •

edited

Loading