Add active-MoE AutoQuant cost accounting by meenchen · Pull Request #1497 · NVIDIA/Model-Optimizer

meenchen · 2026-05-14T22:31:10Z

What does this PR do?

• Type of change: new feature

Adds an active_moe cost model for auto_quantize effective-bits search. This lets AutoQuant account for routed MoE expert weights by active decode weight traffic instead of total checkpoint weight
size, using active_moe_expert_ratio = num_experts_per_tok / num_experts.

The default behavior is unchanged: cost_model="weight" still counts all quantizable weights equally.

Usage

import modelopt.torch.quantization as mtq

model, search_state = mtq.auto_quantize(
model,
constraints={"effective_bits": 5.0},
quantization_formats=[
mtq.NVFP4_DEFAULT_CFG,
mtq.FP8_DEFAULT_CFG,
],
data_loader=calib_dataloader,
forward_step=forward_step,
loss_func=loss_func,
cost_model="active_moe",
# Optional. If omitted, ModelOpt tries to infer this from model.config.
active_moe_expert_ratio=2 / 64,
)

The HF PTQ example also exposes:

--auto_quantize_cost_model active_moe
--auto_quantize_active_moe_expert_ratio 0.03125

Testing

python -m pytest tests/unit/torch/quantization/test_autoquant.py -q -k 'active_moe or quant_recipe_hparam_cost_weight'
python -m pytest tests/unit/torch/quantization/test_autoquant.py -q -k 'not data_parallel_auto_quantize'

Results:

4 passed
58 passed, 1 deselected

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added active-MoE cost model option for auto-quantization with configurable expert ratio; API and CLI accept cost_model and active_moe_expert_ratio
- Unified auto-quantize supports new quant format w4a16_nvfp4
Bug Fixes
- Ensure labels are moved to the logits device for base models without an lm_head
- CLI enforces valid expert-ratio range and requires active-MoE mode when a ratio is provided
Tests
- Added unit tests for active-MoE behavior, cost-weighting, ratio handling, and search budget selection

copy-pr-bot · 2026-05-14T22:31:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-14T22:31:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an active-MoE cost model option to auto-quantization: detects routed MoE modules, applies per-module cost weighting using an expert-activity ratio, threads cost_model and active_moe_expert_ratio through the searcher and API/CLI, and adds unit tests covering behavior and searcher selection.

Changes

Active-MoE Cost Model Support

Layer / File(s)	Summary
MoE cost model foundation `modelopt/torch/quantization/algorithms.py`	Introduces `_is_routed_moe_module_name()` and `_get_active_moe_cost_weight()` utilities for MoE detection and scaling. Extends `QuantRecipeHparam` with `cost_weight` parameter for per-module cost scaling and updates `get_cost()` to accept optional cost weight override. Adds `cost_model` and `active_moe_expert_ratio` to search configuration defaults and validation.
Searcher cost computation and integration `modelopt/torch/quantization/algorithms.py`	Updates hparam insertion to compute per-group `cost_weight` from routed MoE modules and pass it into `QuantRecipeHparam`. Extends candidate stats initialization to track both constraint costs and active costs with `cost_weight` recorded. Modifies `before_search` to validate and set cost model fields, and `run_search` to branch weight-size computation based on `cost_model` using new helpers `_get_total_weight_size_from_named_modules()` and `_get_search_lower_bounds()`. Updates LP lower-bound retry logic and best-recipe resolution to prefer persisted cost denominator.
User API and CLI integration `modelopt/torch/quantization/model_quant.py`, `examples/llm_ptq/hf_ptq.py`	Extends `auto_quantize()` with `cost_model` and `active_moe_expert_ratio` parameters, adds internal helpers to infer ratio from model config attributes, and validates inputs. Adds CLI arguments `--auto_quantize_cost_model` and `--auto_quantize_active_moe_expert_ratio` with post-parse validation ensuring ratio is in `(0.0, 1.0]` and only set when `cost_model` is `"active_moe"`. Parameters propagate through to searcher configuration.
Tests for active-MoE cost model `tests/unit/torch/quantization/test_autoquant.py`	Adds `_AutoQuantMoeModel` fixture with routed expert and shared expert submodules. Validates `QuantRecipeHparam.get_cost()` scaling with `cost_weight` across recipes. Tests `auto_quantize()` with `cost_model="active_moe"` verifying expert/shared-expert cost-weight assignments (0.25 and 1.0 respectively) and active-cost tracking in search history. Verifies `AutoQuantizeGradientSearcher` selects budget-lower-bound recipes under MoE cost scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

ajrasane
cjluo-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.72% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: adding active-MoE cost accounting to the AutoQuant system, which is the central feature across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns detected. All modified files pass checks: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code, no eval/exec, no nosec comments, no unsafe dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch weimingc/autoquant_edge

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T22:35:19Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1497/
Built to branch `gh-pages` at 2026-05-18 19:38 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 300-315: _infer_active_moe_expert_ratio currently calls
_get_first_numeric_config_attr twice which can pick values from two different
config objects; instead iterate the same configs (use _iter_model_configs) and
for each config check both attribute groups (_ACTIVE_MOE_TOP_K_ATTRS and
_ACTIVE_MOE_NUM_EXPERTS_ATTRS) on that single config object, ensure both are
numeric and num_experts > 0, then return min(num_active_experts / num_experts,
1.0); if no single config contains both numeric values return None.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5b320520-fd7c-4c67-b182-efe01e721d39

📥 Commits

Reviewing files that changed from the base of the PR and between e27f76f and 9eb1ee0.

📒 Files selected for processing (4)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/model_quant.py
tests/unit/torch/quantization/test_autoquant.py

codecov · 2026-05-15T23:04:17Z

Codecov Report

❌ Patch coverage is 95.48872% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.99%. Comparing base (f5650bd) to head (6f791d1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_quant.py	93.22%	4 Missing ⚠️
modelopt/torch/quantization/algorithms.py	97.29%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1497      +/-   ##
==========================================
+ Coverage   76.95%   76.99%   +0.04%     
==========================================
  Files         474      474              
  Lines       51503    51625     +122     
==========================================
+ Hits        39632    39749     +117     
- Misses      11871    11876       +5

Flag	Coverage Δ
unit	`52.72% <95.48%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen requested review from a team as code owners May 14, 2026 22:31

meenchen requested a review from Edwardf0t1 May 14, 2026 22:31

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/model_quant.py Outdated

realAsma reviewed May 15, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/model_quant.py Outdated

meenchen force-pushed the weimingc/autoquant_edge branch 4 times, most recently from b721f1d to f681009 Compare May 15, 2026 22:50

Add active-MoE AutoQuant cost accounting

6f791d1

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen force-pushed the weimingc/autoquant_edge branch from f681009 to 6f791d1 Compare May 18, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add active-MoE AutoQuant cost accounting#1497

Add active-MoE AutoQuant cost accounting#1497
meenchen wants to merge 1 commit into
mainfrom
weimingc/autoquant_edge

meenchen commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-18 19:38 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

meenchen commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-18 19:38 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

meenchen commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-18 19:38 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 15, 2026 •

edited

Loading