Fridah/layerwise mse fix by Fridah-nv · Pull Request #1504 · NVIDIA/Model-Optimizer

Fridah-nv · 2026-05-15T22:45:23Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Registers nemotron-{sft-instruction-following-chat-v2, science-v1, competitive-programming-v1, sft-agentic-v2, math-v2, sft-swe-v2, sft-multilingual-v1} so hf_ptq.py's --dataset flag (which enumerates get_supported_datasets() automatically) can select these for PTQ calibration. Splits with heterogeneous parquet schemas that crash streaming CastError mid-iteration are excluded per inline comments. Adds a parametrized smoke test that skips when HF_TOKEN is unset since the nvidia/Nemotron-* datasets are gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings over from origin/fridah/glm5.1-tmp: - nvfp4_tensor.py: FP8 underflow guard before float8_e4m3fn cast (clamp min=2**-9). Kept main's clamp(max=448) too, so we have both guards. - layerwise_calib.py: CPU offload of captured inputs, lazy device-move on replay, resume detection when all layers are already complete, separate quantizer_amaxes.pt per layer for fast amax-only restore, new restore_weights=False parameter on full_restore (skips reloading 2+ TB of unchanged expert weights at export time). - tensor_quantizer.py: NVFP4 static-quantizer dispatch in forward, for per-expert quantizers whose _amax is set by MSE after max_calibrate (experts not routed during max_calibrate stay as plain TensorQuantizer and need this dispatch). Kept is_nvfp4_static property — main uses it in model_calib.py, conversion.py, calib_utils.py, core_utils.py. - model_calib.py: MO_DEBUG_MAX_LAYERS env var hatch to limit layerwise calibration to first N layers (smoke-testing only). Does NOT bring over the moe_utils.py changes from c273ddb — those added an over-aggressive _min_valid_amax=2e-3 invalidity threshold + clamp(min=2e-3) on the fallback path which floored the effective per-block weight scale at 2e-3/6 ~= 3.3e-4 and produced the cliff seen in glm-5.1-nvfp4-MSE-expert-only-7ds-0509. Main's existing moe_utils.py (post #1340, #1421) handles uncalibrated experts gently via None / torch.all(_amax == 0) without any magnitude floor. Does NOT bring the per-expert MSE discovery hunks from cfe4a4a — main's #1421 evolved that further and supersedes the glm5.1-tmp version. unified_export_hf.py and moe_utils.py are intentionally kept at main: main already has both _disable_use_cache and _sanitize_generation_config helpers; pulling glm5.1-tmp's older version of these files would have regressed those features.

…path Adds _safe_cpu_amax helper + null-then-deepcopy pattern to _export_fused_experts to avoid cudaErrorIllegalAddress when the per-expert _amax came back from a layerwise checkpoint as a CUDA tensor with non-zero storage offset / corrupt storage. Pre-extracts amax to CPU with explicit synchronize() before any torch.all() / deepcopy() touches it. This is a strict subset of c273ddb's moe_utils changes — the cliff-creating _min_valid_amax=2e-3 invalidity threshold + clamp(min=2e-3) on the fallback are deliberately NOT brought over. Uncalibrated experts still fall back to weight_slice.abs().amax() without any magnitude floor, matching main's existing semantics. model_quant: ensure output_dir exists before writing .quant_summary.txt. Fixes FileNotFoundError when print_quant_summary runs before the export step creates the directory (FIXES Fix 4).

…PU extract

…mismatch with _global_amax)

…e export The big-model run hit a cuda:0 vs cpu device mismatch in get_weights_scaling_factor_from_quantizer (per_block_scale * 448 / per_block_scale_max). Root cause: the big model is large enough that device_map='sequential' offloads some params to CPU, so _amax and global_amax can land on different devices after deepcopy + injection. Pin both to weight_slice.device right before calling _export_quantized_weight. No magnitude clamp.

…ntized_modules Reverted the safe-CPU-amax / global_amax-sync / device-pinning patches in moe_utils.py — those were working around a symptom: touching the per-expert quantizers of layers that were never visited by the layerwise loop (their _amax is unset). When MO_DEBUG_MAX_LAYERS=N is set, simply skip _export_fused_experts for any *.layers.{>=N}.* module. Layers 0..N-1 all have _bootstrap_uncalibrated_weight_quantizers + MSE-applied amaxes so the existing main moe_utils.py code path works.

The env-var-gated early-break (model_calib.layerwise_calibrate) and export skip (unified_export_hf._process_quantized_modules) were only needed to bound wall-clock during the cliff-fix smoke test. The bug fix itself is purely about not bringing over glm5.1-tmp's clamps in moe_utils.py — which we already don't. Removing the debug hatches keeps the branch a clean superset of main's production behavior.

copy-pr-bot · 2026-05-15T22:45:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-15T22:45:29Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: afbb688f-bf1c-4e77-b15a-03fcc687a169

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fridah/layerwise-mse-fix

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Fridah-nv and others added 8 commits May 15, 2026 00:02

moe_utils: sync global_amax to same device as per-block _amax after C…

42bcd9b

…PU extract

moe_utils: keep _amax on its native device when safe (avoid cuda/cpu …

f5ab0e5

…mismatch with _global_amax)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fridah/layerwise mse fix#1504

Fridah/layerwise mse fix#1504
Fridah-nv wants to merge 8 commits into
mainfrom
fridah/layerwise-mse-fix

Fridah-nv commented May 15, 2026

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fridah-nv commented May 15, 2026

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant