Fridah/layerwise mse fix#1504
Draft
Fridah-nv wants to merge 8 commits into
Draft
Conversation
Registers nemotron-{sft-instruction-following-chat-v2, science-v1,
competitive-programming-v1, sft-agentic-v2, math-v2, sft-swe-v2,
sft-multilingual-v1} so hf_ptq.py's --dataset flag (which enumerates
get_supported_datasets() automatically) can select these for PTQ
calibration. Splits with heterogeneous parquet schemas that crash
streaming CastError mid-iteration are excluded per inline comments.
Adds a parametrized smoke test that skips when HF_TOKEN is unset since
the nvidia/Nemotron-* datasets are gated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings over from origin/fridah/glm5.1-tmp: - nvfp4_tensor.py: FP8 underflow guard before float8_e4m3fn cast (clamp min=2**-9). Kept main's clamp(max=448) too, so we have both guards. - layerwise_calib.py: CPU offload of captured inputs, lazy device-move on replay, resume detection when all layers are already complete, separate quantizer_amaxes.pt per layer for fast amax-only restore, new restore_weights=False parameter on full_restore (skips reloading 2+ TB of unchanged expert weights at export time). - tensor_quantizer.py: NVFP4 static-quantizer dispatch in forward, for per-expert quantizers whose _amax is set by MSE after max_calibrate (experts not routed during max_calibrate stay as plain TensorQuantizer and need this dispatch). Kept is_nvfp4_static property — main uses it in model_calib.py, conversion.py, calib_utils.py, core_utils.py. - model_calib.py: MO_DEBUG_MAX_LAYERS env var hatch to limit layerwise calibration to first N layers (smoke-testing only). Does NOT bring over the moe_utils.py changes from c273ddb — those added an over-aggressive _min_valid_amax=2e-3 invalidity threshold + clamp(min=2e-3) on the fallback path which floored the effective per-block weight scale at 2e-3/6 ~= 3.3e-4 and produced the cliff seen in glm-5.1-nvfp4-MSE-expert-only-7ds-0509. Main's existing moe_utils.py (post #1340, #1421) handles uncalibrated experts gently via None / torch.all(_amax == 0) without any magnitude floor. Does NOT bring the per-expert MSE discovery hunks from cfe4a4a — main's #1421 evolved that further and supersedes the glm5.1-tmp version. unified_export_hf.py and moe_utils.py are intentionally kept at main: main already has both _disable_use_cache and _sanitize_generation_config helpers; pulling glm5.1-tmp's older version of these files would have regressed those features.
…path Adds _safe_cpu_amax helper + null-then-deepcopy pattern to _export_fused_experts to avoid cudaErrorIllegalAddress when the per-expert _amax came back from a layerwise checkpoint as a CUDA tensor with non-zero storage offset / corrupt storage. Pre-extracts amax to CPU with explicit synchronize() before any torch.all() / deepcopy() touches it. This is a strict subset of c273ddb's moe_utils changes — the cliff-creating _min_valid_amax=2e-3 invalidity threshold + clamp(min=2e-3) on the fallback are deliberately NOT brought over. Uncalibrated experts still fall back to weight_slice.abs().amax() without any magnitude floor, matching main's existing semantics. model_quant: ensure output_dir exists before writing .quant_summary.txt. Fixes FileNotFoundError when print_quant_summary runs before the export step creates the directory (FIXES Fix 4).
…mismatch with _global_amax)
…e export The big-model run hit a cuda:0 vs cpu device mismatch in get_weights_scaling_factor_from_quantizer (per_block_scale * 448 / per_block_scale_max). Root cause: the big model is large enough that device_map='sequential' offloads some params to CPU, so _amax and global_amax can land on different devices after deepcopy + injection. Pin both to weight_slice.device right before calling _export_quantized_weight. No magnitude clamp.
…ntized_modules
Reverted the safe-CPU-amax / global_amax-sync / device-pinning patches in
moe_utils.py — those were working around a symptom: touching the per-expert
quantizers of layers that were never visited by the layerwise loop (their
_amax is unset). When MO_DEBUG_MAX_LAYERS=N is set, simply skip
_export_fused_experts for any *.layers.{>=N}.* module. Layers 0..N-1 all
have _bootstrap_uncalibrated_weight_quantizers + MSE-applied amaxes so the
existing main moe_utils.py code path works.
The env-var-gated early-break (model_calib.layerwise_calibrate) and export skip (unified_export_hf._process_quantized_modules) were only needed to bound wall-clock during the cliff-fix smoke test. The bug fix itself is purely about not bringing over glm5.1-tmp's clamps in moe_utils.py — which we already don't. Removing the debug hatches keeps the branch a clean superset of main's production behavior.
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information