Skip to content

feat(utils): pack=True calibration mode for get_dataset_dataloader#1501

Open
kevalmorabia97 wants to merge 10 commits into
mainfrom
kmorabia/fix-mcore-minitron-hybrid-and-qwen-te-spec
Open

feat(utils): pack=True calibration mode for get_dataset_dataloader#1501
kevalmorabia97 wants to merge 10 commits into
mainfrom
kmorabia/fix-mcore-minitron-hybrid-and-qwen-te-spec

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented May 15, 2026

Summary

Add pack: bool option to modelopt.torch.utils.dataset_utils.get_dataset_dataloader. When True, raw samples from each source are concatenated into a per-source token stream (separated by tokenizer.eos_token_id) and sliced into uniform max_sample_length chunks, contiguous by source (preserving the requested per-source ratio in num_samples). Default False for back-compat; opt in via pack=True for cleaner calibration statistics.

The non-pack fixes originally bundled here (HybridModel pruning, GPT-family fused-TE-spec import/export, warn_rank_0 stacklevel, mmlu_lower_bound bump, etc.) have been split to #1518.

Why

The default pack=False path tokenizes each sample with truncate-and-pad, which has two problems for amax / sensitivity calibration:

  • Truncation dominates on long-document corpora. At production sizes (1024 samples × 512 max_sample_length) on cnn_dailymail, ~70 % of articles get truncated at token 512 — the long-range context the pruner / quantizer cares about is silently dropped.
  • Padding contaminates activation statistics. ~8 % of forward-pass positions are pad tokens whose activations get mixed into amax aggregation.

pack=True keeps 100 % of the budget as real tokens — every chunk is naturally-length context, document boundaries are explicit eos separators (a token, not a mask boundary), and the per-source ratio in num_samples is preserved instead of letting one long-document source dominate.

Measured on Qwen3-8B Minitron prune to 30L / 3584 / 11776 (cnn_dailymail, 256 samples, seq_length=512):

pack=False pack=True
MMLU 0.486 0.544 (+5.8 pts)
Megatron-Bridge reference 0.563

Behavior details

  • num_samples semantics shift between modes (documented in the docstring): with pack=False it's the number of raw samples to fetch and tokenize; with pack=True it's the number of max_sample_length-token chunks to produce per source. Migrating a call site to pack=True may need a different value to hit the same total-token calibration budget.
  • No BOS at position 0 of packed chunks (consequence of add_special_tokens=False). Fine for amax / sensitivity calibration where boundary tokens are statistically dominated; callers that need BOS-prefixed sequences should keep pack=False.
  • attention_mask is unconditionally all-ones — attention crosses document boundaries (eos is a token, not a mask boundary). Documented in the docstring; asserted in the blending test so a future boundary-aware-mask change is forced to update the contract.
  • Lazy chunked tokenization: long-document sources break out of the inner loop once the chunk budget is full instead of paying tokenization cost for the full num_samples * 2 oversample.
  • Raises ValueError when no source yields enough tokens for one chunk.
  • Warns when a source under-fills (rank-0 only) or when tokenizer.eos_token_id is None (documents would concatenate without a separator).
  • create_forward_loop forwards pack so internal PTQ callers can opt in.

Consumer

Megatron-LM PR NVIDIA/Megatron-LM#4807 calls get_dataset_dataloader(..., pack=True) from prune.py. Once 0.45 ships and the modelopt pin is bumped, the inline WAR shipped against released 0.44 collapses to one kwarg.

Related: #1518 (Minitron HybridModel + GPT-family fused-TE-spec import/export, also targeting main).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a packing option to dataloader producing EOS-separated fixed-length token chunks, introduces preferred fused LayerNorm mapping keys and importer fallbacks, conditionally registers HybridModel for NAS/pruning, updates MCore pruning hybrid detection/defaults, and tweaks generation and MMLU utilities.

Changes

0.45 Release Updates

Layer / File(s) Summary
Dataset Packing Feature
modelopt/torch/utils/dataset_utils.py, tests/unit/torch/utils/test_dataset_utils.py, CHANGELOG.rst
New pack: bool parameter enables concatenation of raw samples into EOS-separated token streams chunked to max_sample_length. Non-packed mode emits a rank-0 warning. Tests updated to validate packed-chunk behavior, fixed sequence lengths, and all-ones attention masks.
Megatron Export: Fused LayerNorm Rule Keys
modelopt/torch/export/plugins/megatron_importer.py, modelopt/torch/export/plugins/mcore_qwen.py, CHANGELOG.rst
Fused TE LayerNorm weight loading now prefers dedicated rule keys (fused_input_layernorm, fused_pre_mlp_layernorm) when available, falling back to fused_norm. Qwen3/25 model mappings added to support the new keys.
Megatron NAS: HybridModel Support
modelopt/torch/nas/plugins/megatron.py
Conditionally imports HybridModel, registers it in SUPPORTED_MODELS with HAS_HYBRID flag, and registers in_features in the TE QKV dynamic layer for forward-shape consistency under pruning.
MCore Minitron pruning: hybrid detection & defaults
modelopt/torch/prune/plugins/mcore_minitron.py
Removes MambaModel import, adds _get_hybrid_pattern_key(model) helper for hybrid pattern attribute detection, updates pruning and metric code to use the helper, and extends default pruning divisors for HybridModel when available.
Utility Enhancements
modelopt/torch/utils/plugins/megatron_mmlu.py, modelopt/torch/utils/plugins/megatron_generate.py
megatron_mmlu accepts mmlu_dataset parameter. megatron_prefill detaches and makes logits contiguous on final pipeline stage before broadcast.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title states 'pack=True calibration mode' but the PR includes significant pruning fixes (HybridModel support, fused-TE-spec importer fixes, dynamic in_features tracking, contiguity fixes) and an MMLU dataset parameter, not just calibration packing. Revise the title to better reflect the full scope, such as: 'fix(prune,export,utils): HybridModel support, fused-TE-spec fixes, and pack=True calibration' or similar to capture the main pruning and importer contributions.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All 8 modified files pass security audit. No unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, # nosec, or new non-permissive dependencies found in the PR changes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/fix-mcore-minitron-hybrid-and-qwen-te-spec

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/ok to test df7ab63

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1501/

Built to branch gh-pages at 2026-05-19 09:21 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

kevalmorabia97 added a commit to kevalmorabia97/Megatron-LM that referenced this pull request May 15, 2026
Two WARs for modelopt <= 0.44 (fixed upstream in
NVIDIA/Model-Optimizer#1501):

- `prune.py`: after `import_mcore_gpt_from_hf` returns, walk the model
  and copy `model.layers.{i}.input_layernorm.weight` and
  `model.layers.{i}.post_attention_layernorm.weight` from HF into the
  fused `TELayerNormColumnParallelLinear.layer_norm_weight` parameters
  on `attention.linear_qkv` and `mlp.linear_fc1`. Without this the
  fused LayerNorm weights stay at random init for GPT-family models
  (Qwen3, Llama, ...) since modelopt 0.44's importer only loads
  `fused_norm` for Nemotron-H, leaving post-prune MMLU at chance.
  The WAR fails soft on missing HF keys, so it is a no-op on
  Nemotron-H (which uses `backbone.layers.{i}.norm.weight`).

- `mmlu.py`: load `modelopt.torch.utils.plugins.megatron_generate` via
  `importlib.import_module` to grab the submodule rather than the
  function the package re-exports under the same name. The previous
  `from ... import megatron_generate as _mg_plugin` form raised
  `AttributeError: 'function' object has no attribute
  'broadcast_from_last_pipeline_stage'` at import time.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 91.11111% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.32%. Comparing base (81c4fb2) to head (cd3a9e0).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/utils/dataset_utils.py 91.11% 4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (81c4fb2) and HEAD (cd3a9e0). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (81c4fb2) HEAD (cd3a9e0)
unit 2 1
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1501      +/-   ##
==========================================
- Coverage   76.92%   67.32%   -9.60%     
==========================================
  Files         473      474       +1     
  Lines       51439    52625    +1186     
==========================================
- Hits        39567    35428    -4139     
- Misses      11872    17197    +5325     
Flag Coverage Δ
examples 41.58% <22.22%> (+0.88%) ⬆️
gpu 27.15% <4.44%> (-33.20%) ⬇️
regression 14.97% <4.44%> (+0.06%) ⬆️
unit 52.66% <88.88%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 added a commit to kevalmorabia97/Megatron-LM that referenced this pull request May 15, 2026
`get_dataset_dataloader` tokenizes each calibration sample individually
with `padding=True, truncation=True, max_length=512`. For long-document
datasets like cnn_dailymail (typical article: ~700-1000 tokens), that
truncates most of each article and pads short ones — feeding the
importance estimator a heavily padded, contextually-impoverished batch.

Pack samples into uniform `calib_max_sequence_length` chunks from the
concatenated token stream (with `eos_token_id` as document separator)
the way Megatron-Bridge's calibration loop does. This exposes the model
to many more distinct contexts per `calib_size` samples and eliminates
padding-token contamination of activation statistics.

Measured impact on Qwen3-8B pruned to 30L/3584/11776 (5.99B params):
  before (trunc+pad):    MMLU 0.486
  after  (packed):       MMLU 0.544   (+5.8 pts, M-Bridge ref 0.563)

The proper upstream fix is to add a `pack` mode to
`get_dataset_dataloader` in modelopt
(NVIDIA/Model-Optimizer#1501); this inline
change makes prune.py work today against released modelopt 0.44.0.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 changed the title fix(prune): support HybridModel in mcore_minitron + Qwen3 fused-TE import fix(prune): mcore_minitron HybridModel + Qwen3 fused-TE import + calibration packing May 15, 2026
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/ok to test 20d3c5b

@kevalmorabia97 kevalmorabia97 marked this pull request as ready for review May 15, 2026 20:24
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners May 15, 2026 20:24
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/fix-mcore-minitron-hybrid-and-qwen-te-spec branch from 20d3c5b to 8b537ce Compare May 15, 2026 20:26
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/prune/plugins/mcore_minitron.py`:
- Around line 59-60: The current enablement enables HybridModel defaults but
pattern updates and metric accounting are still gated by isinstance(...,
MambaModel), causing plain HybridModel instances to skip hybrid-pattern logic;
update the checks in the functions that perform hybrid pattern updates and
candidate metric accounting (the places currently using isinstance(obj,
MambaModel)) to detect HybridModel instead (e.g., isinstance(obj, HybridModel))
so plain HybridModel instances get the same pattern handling and metric updates,
and if any MambaModel-specific behavior is required keep a secondary
isinstance(obj, MambaModel) branch for those special cases.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 492-526: The code can silently produce zero-length output when
token_stream is too short: after computing n_chunks from token_stream, check for
the edge cases and fail fast or warn; specifically, in the packing branch after
computing n_chunks (and before building input_ids/batch_encoded) add logic that
raises a clear exception if n_chunks == 0 (mentioning tokenizer.encode,
token_stream, total_chunks, and max_sample_length) and log/warn when n_chunks <
total_chunks to inform the caller they received fewer chunks than requested;
ensure the exception/warning uses the existing logging mechanism or raises a
ValueError so callers cannot proceed with empty tensors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d3ed2625-6222-473c-80cd-32fb4d6fbd4c

📥 Commits

Reviewing files that changed from the base of the PR and between a451a2b and 20d3c5b.

📒 Files selected for processing (9)
  • CHANGELOG.rst
  • modelopt/torch/export/plugins/mcore_qwen.py
  • modelopt/torch/export/plugins/megatron_importer.py
  • modelopt/torch/nas/plugins/megatron.py
  • modelopt/torch/prune/plugins/mcore_minitron.py
  • modelopt/torch/utils/dataset_utils.py
  • modelopt/torch/utils/plugins/megatron_generate.py
  • modelopt/torch/utils/plugins/megatron_mmlu.py
  • tests/unit/torch/utils/test_dataset_utils.py

Comment thread modelopt/torch/prune/plugins/mcore_minitron.py Outdated
Comment thread modelopt/torch/utils/dataset_utils.py
device: torch.device | None = None,
include_labels: bool = False,
apply_chat_template: bool = False,
pack: bool = False,
Copy link
Copy Markdown
Collaborator Author

@kevalmorabia97 kevalmorabia97 May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjluo-nv @ChenhanYu should we make this default? Will quantization results see any improvement with packing? I am seeing big improvement in M-LM pruning accuracy if I enable packing

@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude review summary

Findings: CRITICAL: 1, IMPORTANT: 2, SUGGESTION: 2

Most impactful

  1. Depth-pruning HybridModel silently keeps stale hybrid_override_pattern (mcore_minitron.py:372, :689). The PR registers HybridModel as a supported NAS root so convert_to_dynamic builds a real search space — but the depth-pruning post-processing still gates on isinstance(self.model, MambaModel). Per the PR description, newer Megatron-LM instantiates Nemotron-H et al. as plain HybridModel (a sibling/parent of MambaModel), so those checks miss the model and the saved checkpoint's layer pattern won't agree with the pruned config.num_layers. Width-only prune workflows (the path the PR description's Nemotron-H validation seems to exercise) are unaffected; depth pruning is the at-risk path.

  2. Qwen2.5 (and other GPT-family) import paths still hit the chance-accuracy bug. The fused-TE-spec rule keys (fused_input_layernorm / fused_pre_mlp_layernorm) are added only to qwen3_causal_lm_import; qwen25_causal_lm_import, mcore_llama, mcore_deepseek, mcore_gptoss still fall back to the old fused_norm-only path. HF naming is identical, so the same MMLU-at-chance failure is one config flip away on those archs.

  3. pack=True silently produces fewer chunks than requested when raw-text supply runs short of 2 × num_samples. A warn_rank_0 (or grow-the-oversample-factor loop) keeps this from being an undebuggable downstream ablation surprise.

Strengths

  • The fused-norm rule-key fallback (fused_input_layernormfused_norm) is the right shape — it's strictly additive for Nemotron-H and explicit-per-norm for GPT-family models, no behavior change for existing import paths.
  • The _DynamicTEQKVLayerNormColumnParallelLinear.in_features registration is consistent with the existing _DynamicTEParallelLinear pattern (both bind in_features to mod.input_size).
  • The .contiguous() fix on the megatron_prefill logits slice has a clear root-cause comment (SP padding + broadcast contiguity assert) and is a one-line, reversible change.
  • CHANGELOG entries are placed under correct sections in 0.45 with accurate scope.

Risk assessment

Medium. The Qwen3 / Nemotron-H prune paths the PR explicitly validated end-to-end look solid, and the calibration-packing primitive is well-isolated behind an opt-in flag. The pre-existing isinstance(model, MambaModel) checks in depth-pruning are the main concern — they should be widened to cover plain HybridModel in the same PR that promotes HybridModel to a first-class supported type, otherwise the bug "fixed" for convert_to_dynamic re-emerges in the depth-prune codepath.

Comment thread modelopt/torch/prune/plugins/mcore_minitron.py Outdated
Comment thread modelopt/torch/export/plugins/mcore_qwen.py Outdated
Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment thread tests/unit/torch/utils/test_dataset_utils.py
Comment thread modelopt/torch/utils/dataset_utils.py Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude review summary

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 3

Traced the algorithmic and structural changes end-to-end:

  • Importer fused-norm fallback (megatron_importer.py + plugin maps): the per-context fused_input_layernorm / fused_pre_mlp_layernorm keys with fallback to legacy fused_norm are wired symmetrically on import (_import_transformer_layer) and export (_get_fused_norm_weight). Importantly, the importer now raises KeyError instead of silently leaving linear_qkv.layer_norm_weight at random init when a fused-TE model has no rule registered — this is the right call (the previous behavior shipped chance-accuracy checkpoints). The Mamba path correctly retains the legacy "fused_norm" in self.rules gate since Nemotron-H configs do register it.
  • HybridModel registration (nas/plugins/megatron.py): correctly slotted after MambaModel so DMRegistry's first-match dispatch for MambaModel instances still resolves to MambaModel (_get_registered_nn_class filter at opt/dynamic.py:937 checks nn_cls.forward is nn_cls_.forward, which holds when MambaModel inherits forward unchanged). The _HYBRID_DIVISORS constant dedupes what was a copy-pasted block. Comment in megatron.py:82-86 is the right level of detail; see inline note about the missing assertion the PR description references.
  • in_features dynamic attribute on _DynamicTEQKVLayerNormColumnParallelLinear: correct fix — TE's forward asserts inp_shape[-1] == self.in_features, and pruning hidden_size mutates input_size but previously left in_features stale.
  • .contiguous() on logits slice in megatron_prefill: correct — SP padding seq_length to a multiple of TP creates a non-contiguous view stride that broadcast_from_last_pipeline_stage rejects.
  • _gated_mlp_merging _extra_state handling: see inline note — comment says "preserve" but code overwrites with None.
  • pack=True packing logic: math is correct; per-source ratio is preserved by extending per_source_chunks source-by-source (no interleave). EOS separator added even after the last sample within a chunk budget — fine for amax. Two minor readability nudges in the inline comment.
  • warn_rank_0 stacklevel auto-bump: works for kwarg-style stacklevel=N but would TypeError if a caller passed stacklevel as a 3rd positional arg (warnings.warn signature is (message, category, stacklevel, source)). Realistically no caller does this; not flagged inline.
  • mmlu_dataset parameter + mmlu_lower_bound: 0.68 → 0.75: clean additions, validated by PR description's end-to-end MMLU sweep.

Risk: low. All three suggestions are clarity / forward-compat nudges, not blockers. The importer's switch from silent-pass to KeyError on missing fused-norm rules is a deliberate breaking change for any out-of-tree plugin that uses fused TE without registering a norm rule, but the PR description owns this and the legacy fused_norm fallback covers the in-tree Nemotron-H path.

LGTM aside from the inline notes.

Comment thread modelopt/torch/nas/plugins/megatron.py Outdated
Comment thread modelopt/torch/export/plugins/megatron_importer.py Outdated
Comment thread modelopt/torch/utils/dataset_utils.py
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Bundle of independent Minitron-pruning fixes for Megatron-LM compat: HybridModel registration, fused-TE-spec import/export rules per-context (Qwen3/Llama/DeepSeek/GptOss), in_features dynamic-attr tracking, .contiguous() before pp broadcast, pack=True calibration packing, and a few quality-of-life tweaks (mmlu_dataset kwarg, warn_rank_0 stacklevel auto-bump). Each change is well-explained and small. Spot checks:

  • The HybridModel registration relies on _DMRegistryCls._get_registered_nn_class returning the first match in self._registry insertion order, with MambaModel registered before HybridModel via SUPPORTED_MODELS dict insertion order. Verified — Mamba dispatch is preserved if/when MambaModel still inherits HybridModel's forward.
  • _get_fused_norm_weight returns a 2-tuple now; all call sites updated correctly. Mamba path explicitly sticks with the legacy fused_norm key (correct: Nemotron-H still ships a single shared norm slot).
  • warn_rank_0 stacklevel default change (1→2) correctly points warnings at user call sites; existing callers passing explicit stacklevel get bumped consistently.
  • _build_packed_input_ids 2x oversample with chunk-bounded tokenization is reasonable; add_special_tokens=False rationale is documented including the no-BOS caveat.
  • .contiguous() before broadcast_from_last_pipeline_stage is correct for SP-padded slices.

Why nudge instead of approve:

  1. Design-review gate fired (9 dirs). Apart from the pack feature there's no architectural change — the rest extends existing systems. But:
  2. The pruning-side fixes have no unit tests in this repo; they're integration-validated only by the linked Megatron-LM PR #4807 and the Qwen3-8B launcher's MMLU bound bump (0.68→0.75). For HybridModel and fused-TE-spec, that's fine but worth a human eyeballing the consumer wiring.
  3. PR body claims "MambaModel.forward is HybridModel.forward is asserted at import" — I don't see such an assert in the diff. Either the assert was dropped, or the description is stale. Worth confirming this is intentional, since the comment in megatron.py calls the inheritance "load-bearing".
  4. pack=False now emits a FutureWarning on every call. Intentional ("will become default in a future release"), but may surface as test-suite churn for downstream users.

PR size 395/-120 is on the higher side for a "fixes" PR but the changes are individually small and cohesive across the Minitron pipeline.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review of #1501. All previously-flagged critical/important comments have been addressed in the latest diff:

Addressed (critical):

  • HybridModel pruning pattern stale-string bug (isinstance(MambaModel) gates at lines 372/689) — resolved via new _get_hybrid_pattern_key() helper using hasattr detection on both hybrid_override_pattern and hybrid_layer_pattern. Decoupled from class hierarchy as suggested.
  • pack=True zero-chunk silent-empty-tensor footgun — now raises ValueError on n_chunks == 0 and emits warn_rank_0 on n_chunks < total_chunks.
  • Fused-TE-spec importer demotion to warning — now raises KeyError for unambiguous misconfig (matches what previous review explicitly requested).
  • Cross-arch coverage gap — fused_input_layernorm/fused_pre_mlp_layernorm rules added to mcore_llama.py, mcore_deepseek.py, mcore_gptoss.py, mcore_llama4, qwen25_causal_lm_import (not just qwen3 as in the original PR).
  • Importer/exporter asymmetry on fused-norm fallback — unified_export_megatron._get_fused_norm_weight now takes a primary_key arg with the same fallback chain, called per-context (fused_input_layernorm for attn, fused_pre_mlp_layernorm for MLP, fused_norm for Mamba).
  • _extra_state zeroing in qkv-merging — now preserves module_state_dict.get("_extra_state") instead of unconditionally None.

Author-declined with sound reasoning (visible in inline replies): import-time MambaModel.forward is HybridModel.forward assert (would break consumers on plausible MCore changes; dispatch is correct either way per _get_registered_nn_class's forward is forward filter), hard-fail on eos_token_id is None, setdefault semantics for warn_rank_0 stacklevel.

Why nudge instead of approve:

  1. PR body still claims MambaModel.forward is HybridModel.forward is "asserted at import" — that assertion is not present (intentionally, per author's reply). Description should be updated for the next reader.
  2. CHANGELOG silently changes 0.44 release date 2026-05-18 → 2026-05-14 — outside the advertised PR scope; needs owner confirmation that this is intentional and not a merge artifact.
  3. Pruning-side fixes (HybridModel registration, in_features dyn-attr, .contiguous()) are integration-validated only via consumer Megatron-LM PR #4807 and the Qwen3-8B launcher's MMLU bound bump (0.68→0.75); no in-repo unit tests added for the pruning paths.
  4. mmlu_lower_bound tightening to 0.75 turns CI red on routine numerical drift if post-PTQ Qwen3-8B doesn't clear with margin; only the prune-time MMLU (0.544) is documented in the PR body.
  5. Bundle-of-fixes PR spanning 9 directories (design-review gate fired); the only architectural addition is pack=True, which is well-isolated and uses no new abstractions, but a human reviewer with cross-cutting MCore context should still sign off.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: all previously-flagged critical issues are addressed in the latest diff.

Addressed (critical):

  • HybridModel pruning: _get_hybrid_pattern_key() helper uses hasattr detection on both hybrid_override_pattern and hybrid_layer_pattern; replaces both isinstance(MambaModel) gates (lines ~372 and ~689). Decoupled from class hierarchy.
  • HybridModel dynamic-conversion registration: HAS_HYBRID flag in nas/plugins/megatron.py; HybridModel registered in SUPPORTED_MODELS after MambaModel (insertion order + DMRegistry's forward is forward filter make dispatch unambiguous either way).
  • pack=True zero-chunk footgun: raises ValueError; n_chunks < total_chunks emits warn_rank_0.
  • Fused-TE-spec importer: KeyError (not warn) when neither fused_input_layernorm / fused_pre_mlp_layernorm nor legacy fused_norm is registered — matches what the previous review explicitly requested.
  • Cross-arch coverage: fused-norm rules added to mcore_qwen.py, mcore_llama.py (incl. llama4), mcore_deepseek.py, mcore_gptoss.py, qwen25_causal_lm_import.
  • Importer/exporter fused-norm symmetry: _get_fused_norm_weight takes primary_key arg with the same fallback chain; called per-context (attn / MLP / Mamba).
  • _extra_state preserved in qkv-merging instead of unconditionally None.
  • in_features registered as a dynamic attribute on _DynamicTEQKVLayerNormColumnParallelLinear.
  • .contiguous() on logits slice before broadcast_from_last_pipeline_stage.

Per operator instructions, treated as resolved:

  • PR body's stale MambaModel.forward is HybridModel.forward assertion claim has been removed; the current PR body correctly states dispatch is robust either way.
  • CHANGELOG 0.44 release-date shift 2026-05-18 → 2026-05-14 is intentional.
  • mmlu_lower_bound 0.68 → 0.75 tightening is appropriate (previous bound was too loose to catch regressions).
  • Pruning-side fixes are integration-validated via Megatron-LM PR #4807, which has been thoroughly tested out-of-tree.

Design review (9-dir gate): Only architectural addition is pack=True, which is a feature flag on an existing function — no new subsystem, abstraction, DSL, or registry. Implementation lives in one helper (_build_packed_input_ids). No existing in-repo system or already-imported dependency would simplify this; it's a tokenization-flow change. Design protocol satisfied.

Complex PR: spans 9 directories (≥ 5). Looping in a human for approval.

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude Review Summary

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 4

Overall assessment

Low-risk, well-considered PR. The algorithmic fixes are sound and the back-compat story is explicit:

  • Fused-TE-spec norm import — correctly raises KeyError instead of silently random-init when a model uses TELayerNormColumnParallelLinear with no fused-norm rule. The branch only fires when linear_qkv.layer_norm_weight/linear_fc1.layer_norm_weight exists, so a missing rule is unambiguously a plugin misconfig. Legacy fused_norm fallback preserves Nemotron-H. Importer/exporter sides are symmetric, and the per-arch rules in mcore_qwen.py / mcore_llama.py / mcore_deepseek.py / mcore_gptoss.py line up with each model's HF norm path.
  • HybridModel registrationDMRegistry._get_registered_nn_class filters by issubclass(...) and forward is forward, and MambaModel is inserted into SUPPORTED_MODELS first, so MambaModel instances dispatch to MambaModel and pure HybridModel instances dispatch to HybridModel regardless of forward-method override. The _HYBRID_DIVISORS dedup is a clean DRY refactor.
  • _DynamicTEQKVLayerNormColumnParallelLinear.in_features — fixes the missing dynamic attribute parallel to _DynamicTEColumnParallelLinear (line 184 in the same file already does this); the TE forward-time inp_shape[-1] == in_features assertion now holds when hidden_size is pruned.
  • _qkv_merging / _gated_mlp_merging _extra_state preservation — preserving rather than blanking is the right call; during HF import the value is initial defaults so this is functionally a no-op today, but it's safer if TE versions later populate it pre-load.
  • megatron_prefill .contiguous() — correct one-liner; the broadcast asserts contiguity and SP padding to a TP multiple breaks the unsliced stride.
  • Calibration packing — clean implementation. The 2x oversample + chunked-tokenize budget is reasonable, the per-source-ratio preservation is a real improvement over the previous "first source wins" behavior, and the missing-eos / under-fill warnings give actionable user guidance. The +5.8 MMLU pts on Qwen3-8B prune is a meaningful signal.

Suggestions posted inline

  1. _get_hybrid_pattern_key should also check the value is non-None, not just hasattr — minor robustness.
  2. _FakeTokenizer.__call__ now pads to max_length instead of "longest in batch", which diverges from HF's padding=True semantics. Test-only, no prod impact.
  3. CHANGELOG 0.44 date moved backward from 2026-05-182026-05-14; worth confirming intentional.
  4. _build_packed_input_ids short-source under-fill discards trailing tokens past take * max_sample_length — not a bug, but flagged for future maintainers.

LGTM modulo the suggestions above.

Comment thread modelopt/torch/prune/plugins/mcore_minitron.py Outdated
Comment thread tests/unit/torch/utils/test_dataset_utils.py
Comment thread modelopt/torch/utils/dataset_utils.py
- _get_hybrid_pattern_key: check `getattr(model, attr, None) is not None`
  instead of just `hasattr`, since HybridModel subclasses could
  initialize the attribute to None and either call site (string-format
  / iterate) would then break.
- _FakeTokenizer: match HF padding=True semantic (pad to longest in
  batch, optionally bounded by max_length) instead of always padding to
  max_length. Keeps the test fake faithful to the production tokenizer
  surface.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Previous review suggestions have been addressed in commits 3af90dbad6 and 0f43f3119e:

  • megatron_importer.py: _extra_state is now preserved from the existing module state dict (not blanked to None) at both call sites, matching the "preserve" wording in the comment and protecting TE FP8 bookkeeping on TE versions that populate _extra_state at init.
  • dataset_utils.py: dropped the pack=False FutureWarning (default flip is not finalized); renamed chunk_sizetokenize_batch_size in _build_packed_input_ids for clarity; added a worked example to the docstring explaining how num_samples interacts with the internal 2x raw-sample draw.
  • tests/.../test_dataset_utils.py: _FakeTokenizer.__call__ now matches HF's padding=True semantic (pad to longest in batch, optionally bounded by max_length) instead of always padding to max_length.
  • mcore_minitron.py::_get_hybrid_pattern_key: now uses getattr(model, attr, None) is not None instead of hasattr, so a HybridModel subclass that initializes the attribute to None doesn't get a key that downstream code can't use.
  • nas/plugins/megatron.py: dropped the import-time assert MambaModel.forward is HybridModel.forwardDMRegistry._get_registered_nn_class already filters by forward is forward, so MambaModel instances dispatch correctly regardless of inheritance.
  • CHANGELOG: dropped the "with a warning" hint about an upcoming default flip.
  • PR description updated to remove the stale "assert at import" claim.

Please re-evaluate the latest commit (0f43f3119e).

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude review summary

Findings: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 2.

The PR is a tightly scoped set of pruning fixes plus one well-defined feature, each independently motivated and end-to-end validated (Qwen3-8B MMLU 0.486 → 0.544 with pack=True, plus the fused-TE-spec import fix taking post-prune Qwen3 off chance accuracy). The Minitron-side bug fixes (HybridModel registration via insertion-order DMRegistry filtering, in_features dynamic attribute on the QKV linear, .contiguous() on the prefill logits slice) are surgical and the rationale comments are unusually clear. Test changes correctly parametrize over pack and the _FakeTokenizer extensions match HF semantics.

Most impactful finding

Import/export key asymmetry for fused TE spec (unified_export_megatron.py). The PR registers fused_input_layernorm / fused_pre_mlp_layernorm in import mappings and the importer raises KeyError if they're missing. The exporter's _get_fused_norm_weight falls back to fused_norm (not in any of these arches' export dicts either) and silently returns (None, None). A Qwen3 (or other arch) trained with --export-default-te-spec and exported via this path will silently lose input_layernorm.weight / post_attention_layernorm.weight. Pre-existing on the export side, but this PR makes the import path fully functional, so users will now reach the export silent-drop for the first time. Recommend mirroring the import mapping additions on the export dicts (or matching the importer's loud KeyError).

Other

  • _get_hybrid_pattern_key swaps isinstance(model, MambaModel) for an attribute-truthiness check; behaviorally fine today, but the cost (silent acceptance of falsy non-None values) is worth a sharper comment than "avoids fragile isinstance checks."
  • pack=True + apply_chat_template=True will produce <eos><eos> at every doc boundary because chat-templated text already ends in EOS and the packer appends another. Likely harmless for amax statistics, but worth documenting.

Risk

Low. The Minitron-side fixes are loud-by-design (fused-norm KeyError, contiguous broadcast) and the calibration packing path is fully gated behind pack=False default. The one IMPORTANT finding is surfaced silently only at export time and is pre-existing for these arches.

Comment thread modelopt/torch/export/unified_export_megatron.py Outdated
Comment thread modelopt/torch/utils/dataset_utils.py
Comment thread modelopt/torch/prune/plugins/mcore_minitron.py Outdated
- Export mappings: add fused_input_layernorm / fused_pre_mlp_layernorm
  to mcore_qwen (qwen3, qwen25), mcore_llama (llama, llama4),
  mcore_deepseek, mcore_gptoss export dicts so fused-TE-spec models
  round-trip cleanly. Llama4 and GptOss only get fused_input_layernorm
  (MoE-only on MLP side, fused_pre_mlp_layernorm path is unreachable).
- dataset_utils: docstring notes (a) chat-template + pack=True can
  double the EOS at doc boundaries (harmless for calibration stats but
  worth flagging), (b) tokenize_batch_size formula's intent and floor.
- mcore_minitron: tighten _get_hybrid_pattern_key to also reject empty
  patterns via Python truthiness (rejects None, "", [] in one check).

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Round 11 addressed in commit 2a46dd1e9b:

  • Export-side fused-norm parity (IMPORTANT): added fused_input_layernorm / fused_pre_mlp_layernorm to the export mappings for Qwen3, Qwen2.5, Llama, Llama4, DeepSeek, GptOss — mirroring the import side, so fused-TE-spec models now round-trip cleanly through HF export instead of silently dropping input_layernorm.weight / post_attention_layernorm.weight. Llama4 and GptOss only get fused_input_layernorm (MoE-only MLP path is unreachable on those archs).
  • dataset_utils._build_packed_input_ids docstring + comment: noted that apply_chat_template=True + pack=True can produce <eos><eos> at document boundaries (harmless for calibration stats) and that the tokenize_batch_size = max(8, num_sample // 4) floor exists so small calibrations still hit the Rust-batched tokenizer.
  • _get_hybrid_pattern_key: now if getattr(model, attr, None): so empty string/list patterns are rejected alongside None.

Request for this review round: please do a single consolidated pass against 2a46dd1e9b covering everything in this PR — not just inline-comment-and-re-fire on each round. Earlier rounds have already produced 40+ inline comments; if there are remaining concerns I'd rather see them all together in one summary so we can close them out as a batch instead of iterating piecemeal.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary — overall low risk, LGTM with three minor SUGGESTION-level notes posted inline.

Findings

  • CRITICAL: 0
  • IMPORTANT: 0
  • SUGGESTION: 3 (pack/include_labels interaction, explicit padding=False on tokenizer call, attribute-presence vs isinstance for hybrid detection)

Trace coverage

I traced each of the four threads end-to-end:

  1. Calibration packing_build_packed_input_ids correctly preserves per-source ratio, early-exits the 2x oversample, and falls back gracefully when eos_token_id is missing. The non-pack code path is unchanged so existing call sites are unaffected. Test parametrization keeps both branches covered. Docstrings address the BOS-absence and apply_chat_template <eos><eos> edge cases honestly.

  2. HybridModel registration — Insertion-order argument for DMRegistry._get_registered_nn_class is sound: GPTModel → MambaModel (if HAS_MAMBA) → HybridModel (if HAS_HYBRID). For a MambaModel instance, issubclass(MambaModel, HybridModel) returning True doesn't matter because the iteration hits MambaModel first. For a plain HybridModel instance, MambaModel is correctly skipped (it's the child, so issubclass(HybridModel, MambaModel) is False). _HYBRID_DIVISORS extraction is a clean refactor.

  3. Importer fused-norm dispatch — Per-context-key + legacy fused_norm fallback is implemented symmetrically on import (megatron_importer.py) and export (unified_export_megatron.py). The KeyError-instead-of-warn change is appropriate — the branch only fires when the model uses fused TELayerNormColumnParallelLinear, so a missing rule would otherwise ship a chance-accuracy checkpoint silently. Mamba export still uses the default primary_key="fused_norm", preserving Nemotron-H behavior. The new _extra_state preservation in _gated_mlp_merging and the change from None to module_state_dict.get("_extra_state") in _qkv_merging correctly avoid blowing away TE FP8 bookkeeping.

  4. Misc.contiguous() fix on the megatron_prefill logits slice is correct (broadcast asserts contiguity, SP padding breaks the stride). The in_features dynamic attribute on _DynamicTEQKVLayerNormColumnParallelLinear is the right fix for TE's forward-time inp_shape[-1] == in_features assertion under hidden_size pruning. warn_rank_0 stacklevel auto-bump is back-compatible (no in-tree caller passes stacklevel). mmlu_lower_bound 0.68→0.75 is justified by the fused-norm import fix.

Backward compatibility: pack defaults to False, all importer/exporter rule additions are net-new keys, and the mmlu_dataset arg has a back-compat default. No modelopt_state schema changes.

Risk level: low. The fused-norm + HybridModel changes are corrective fixes to paths that were producing broken checkpoints, validated by the PR author's MMLU numbers. Suggest landing.

Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment thread modelopt/torch/prune/plugins/mcore_minitron.py Outdated
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review focused on the operator's question about attention-mask correctness under sequential packing, plus the design-review gate.

Attention mask in packed mode (get_dataset_dataloader(pack=True)):

  • _build_packed_input_ids returns only input_ids; the dataloader sets attention_mask = torch.ones_like(input_ids) (dataset_utils.py around the if pack: branch).
  • This is consistent with the "no padding in packed chunks" contract (every position is a real token), but it is a naive pack: no block-diagonal / per-document reset mask. Attention crosses document boundaries (separated only by eos_token_id as a token, not as a mask boundary). For amax/sensitivity calibration this matches industry practice and the docstring is honest about the BOS/eos caveats — but a reader expecting reset-on-eos masking will be surprised, and the docstring doesn't say the mask is unconditionally all-ones.

Test coverage of the mask:

  • Exactly one assertion exercises the packed attention mask: in TestGetDatasetDataloaderBlending.test_single_jsonl, the pack=True branch asserts batches[0]["input_ids"].shape == (2, 16) and (batches[0]["attention_mask"] == 1).all(). That covers the "all-ones, no padding" contract.
  • The blending and HF-tiny tests assert chunk counts but never inspect the mask. A regression that, e.g., zeroed out positions after eos_token_id would not be caught outside test_single_jsonl.

Re-review status of prior comments: All previously-flagged critical items (HybridModel pruning pattern via _get_hybrid_pattern_key, n_chunks==0 ValueError + warn_rank_0 on shortfall, KeyError on missing fused-norm rule, cross-arch fused-norm coverage in Llama/DeepSeek/GptOss/llama4/qwen25, importer/exporter symmetry via _get_fused_norm_weight(primary_key=...), _extra_state preservation, in_features dynamic attr, .contiguous() before pp broadcast) are addressed in the current diff. Author-declined items (import-time MambaModel.forward is HybridModel.forward assert, hard-fail on missing eos_token_id, setdefault for stacklevel) have explicit, defensible rationale on the PR.

Why nudge instead of approve:

  1. Design-review gate fired (9 dirs). The architectural addition is just pack=True on an existing function — implementation lives in one helper, no new subsystem/registry/DSL — so the protocol is satisfied. But cross-cutting MCore changes still warrant a human signoff.
  2. Attention-mask semantics under packing should ideally be one of: (a) explicitly documented as "all-ones, no per-doc reset" in the pack docstring, or (b) an opt-in for boundary-aware masks if/when a downstream consumer needs it. Today's behavior is fine for calibration but undocumented.
  3. Pruning-side fixes (HybridModel, in_features, .contiguous) are integration-validated only via consumer Megatron-LM PR #4807; no in-repo unit tests for the pruning paths.

Comment thread modelopt/torch/utils/dataset_utils.py
Comment thread tests/unit/torch/utils/test_dataset_utils.py
Don't trust tokenizer subclass __call__ defaults — a custom wrapper that
flips the HF default could silently inject pad tokens into the packed
stream. Be explicit.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
- Docstring: note that attention_mask is unconditionally all-ones —
  attention crosses document boundaries (the eos separator is a token,
  not a mask boundary). Prevents future readers from assuming pack
  implies per-doc mask reset.
- test_list_of_jsonl_blends: assert (attention_mask == 1).all() and
  shape parity so a future boundary-aware masking change is forced to
  update the contract rather than silently passing.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Split #1501 so this PR keeps only the pack=True calibration dataloader
change. The HybridModel pruning, fused-TE-spec import/export, and
related fixes are now in #1518 (also targeting main).

This commit reverts to main the files that belong to #1518:
  - modelopt/torch/export/plugins/{mcore_deepseek,mcore_gptoss,mcore_llama,mcore_qwen,megatron_importer,unified_export_megatron}.py
  - modelopt/torch/nas/plugins/megatron.py
  - modelopt/torch/prune/plugins/mcore_minitron.py
  - modelopt/torch/utils/logging.py
  - modelopt/torch/utils/plugins/{megatron_generate,megatron_mmlu}.py
  - tools/launcher/examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml

CHANGELOG: drop the Bug Fixes entry and the 0.44 date correction
(those go with #1518). Keep the pack=True New Features entry.

History is preserved — earlier commits with the dropped changes remain
on this branch's log, this commit just rolls the working state back to
"pack only".

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 changed the title fix(prune): Minitron HybridModel + Qwen3 fused-TE import + calibration data packing feat(utils): pack=True calibration mode for get_dataset_dataloader May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants