feat: rules-based MTP export for quantized models#1494
Conversation
Replace the hacky _get_mtp_state_dict that copied BF16 weights from the HF pretrained model with a proper rules-based export that handles quantized MTP weights (NVFP4, FP8) through the existing export rules system. Supports both repeated MTP (Nemotron nested HybridStack) and non-repeated MTP (DeepSeek style). Uses backbone→mtp prefix replacement to reuse decoder layer export methods for MTP inner layers, mirroring the import side's is_mtp=True behavior. Signed-off-by: Ye Yu <yey@nvidia.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1494 +/- ##
==========================================
- Coverage 75.69% 71.22% -4.47%
==========================================
Files 467 479 +12
Lines 50334 57875 +7541
==========================================
+ Hits 38099 41221 +3122
- Misses 12235 16654 +4419
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The modelopt_gpt_hybrid_builder creates a HybridModel (not MambaModel) when --export-model-type is MambaModel/HybridModel. Since MambaModel inherits from HybridModel, the isinstance check needs to include HybridModel directly. Signed-off-by: Ye Yu <yeyu@nvidia.com>
Summary
_get_mtp_state_dict()that copied BF16 weights from HF pretrained model with a proper rules-based export that handles quantized MTP weights (NVFP4, FP8)backbone→mtpin rule prefixes, mirroring the import side'sis_mtp=TruebehaviorContext
Nemotron-3.5-Nano NVFP4 QAD pipeline needs to export quantized MTP weights. The old hack just copied BF16 weights from the original HF checkpoint, ignoring any quantization applied to MTP layers.
Verified against the Nemotron-3.5-Nano HF checkpoint — all 270 MTP weight keys match the expected naming convention.
Test plan
🤖 Generated with Claude Code