Skip to content

Add active-MoE AutoQuant cost accounting#1497

Open
meenchen wants to merge 1 commit into
mainfrom
weimingc/autoquant_edge
Open

Add active-MoE AutoQuant cost accounting#1497
meenchen wants to merge 1 commit into
mainfrom
weimingc/autoquant_edge

Conversation

@meenchen
Copy link
Copy Markdown
Contributor

@meenchen meenchen commented May 14, 2026

What does this PR do?

• Type of change: new feature

Adds an active_moe cost model for auto_quantize effective-bits search. This lets AutoQuant account for routed MoE expert weights by active decode weight traffic instead of total checkpoint weight
size, using active_moe_expert_ratio = num_experts_per_tok / num_experts.

The default behavior is unchanged: cost_model="weight" still counts all quantizable weights equally.

Usage

import modelopt.torch.quantization as mtq

model, search_state = mtq.auto_quantize(
model,
constraints={"effective_bits": 5.0},
quantization_formats=[
mtq.NVFP4_DEFAULT_CFG,
mtq.FP8_DEFAULT_CFG,
],
data_loader=calib_dataloader,
forward_step=forward_step,
loss_func=loss_func,
cost_model="active_moe",
# Optional. If omitted, ModelOpt tries to infer this from model.config.
active_moe_expert_ratio=2 / 64,
)

The HF PTQ example also exposes:

--auto_quantize_cost_model active_moe
--auto_quantize_active_moe_expert_ratio 0.03125

Testing

python -m pytest tests/unit/torch/quantization/test_autoquant.py -q -k 'active_moe or quant_recipe_hparam_cost_weight'
python -m pytest tests/unit/torch/quantization/test_autoquant.py -q -k 'not data_parallel_auto_quantize'

Results:

  • 4 passed
  • 58 passed, 1 deselected

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added active-MoE cost model option for auto-quantization with configurable expert ratio; API and CLI accept cost_model and active_moe_expert_ratio
    • Unified auto-quantize supports new quant format w4a16_nvfp4
  • Bug Fixes

    • Ensure labels are moved to the logits device for base models without an lm_head
    • CLI enforces valid expert-ratio range and requires active-MoE mode when a ratio is provided
  • Tests

    • Added unit tests for active-MoE behavior, cost-weighting, ratio handling, and search budget selection

Review Change Stack

@meenchen meenchen requested review from a team as code owners May 14, 2026 22:31
@meenchen meenchen requested a review from Edwardf0t1 May 14, 2026 22:31
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an active-MoE cost model option to auto-quantization: detects routed MoE modules, applies per-module cost weighting using an expert-activity ratio, threads cost_model and active_moe_expert_ratio through the searcher and API/CLI, and adds unit tests covering behavior and searcher selection.

Changes

Active-MoE Cost Model Support

Layer / File(s) Summary
MoE cost model foundation
modelopt/torch/quantization/algorithms.py
Introduces _is_routed_moe_module_name() and _get_active_moe_cost_weight() utilities for MoE detection and scaling. Extends QuantRecipeHparam with cost_weight parameter for per-module cost scaling and updates get_cost() to accept optional cost weight override. Adds cost_model and active_moe_expert_ratio to search configuration defaults and validation.
Searcher cost computation and integration
modelopt/torch/quantization/algorithms.py
Updates hparam insertion to compute per-group cost_weight from routed MoE modules and pass it into QuantRecipeHparam. Extends candidate stats initialization to track both constraint costs and active costs with cost_weight recorded. Modifies before_search to validate and set cost model fields, and run_search to branch weight-size computation based on cost_model using new helpers _get_total_weight_size_from_named_modules() and _get_search_lower_bounds(). Updates LP lower-bound retry logic and best-recipe resolution to prefer persisted cost denominator.
User API and CLI integration
modelopt/torch/quantization/model_quant.py, examples/llm_ptq/hf_ptq.py
Extends auto_quantize() with cost_model and active_moe_expert_ratio parameters, adds internal helpers to infer ratio from model config attributes, and validates inputs. Adds CLI arguments --auto_quantize_cost_model and --auto_quantize_active_moe_expert_ratio with post-parse validation ensuring ratio is in (0.0, 1.0] and only set when cost_model is "active_moe". Parameters propagate through to searcher configuration.
Tests for active-MoE cost model
tests/unit/torch/quantization/test_autoquant.py
Adds _AutoQuantMoeModel fixture with routed expert and shared expert submodules. Validates QuantRecipeHparam.get_cost() scaling with cost_weight across recipes. Tests auto_quantize() with cost_model="active_moe" verifying expert/shared-expert cost-weight assignments (0.25 and 1.0 respectively) and active-cost tracking in search history. Verifies AutoQuantizeGradientSearcher selects budget-lower-bound recipes under MoE cost scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • ajrasane
  • cjluo-nv
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding active-MoE cost accounting to the AutoQuant system, which is the central feature across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns detected. All modified files pass checks: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code, no eval/exec, no nosec comments, no unsafe dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch weimingc/autoquant_edge

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1497/

Built to branch gh-pages at 2026-05-18 19:38 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 300-315: _infer_active_moe_expert_ratio currently calls
_get_first_numeric_config_attr twice which can pick values from two different
config objects; instead iterate the same configs (use _iter_model_configs) and
for each config check both attribute groups (_ACTIVE_MOE_TOP_K_ATTRS and
_ACTIVE_MOE_NUM_EXPERTS_ATTRS) on that single config object, ensure both are
numeric and num_experts > 0, then return min(num_active_experts / num_experts,
1.0); if no single config contains both numeric values return None.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5b320520-fd7c-4c67-b182-efe01e721d39

📥 Commits

Reviewing files that changed from the base of the PR and between e27f76f and 9eb1ee0.

📒 Files selected for processing (4)
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/algorithms.py
  • modelopt/torch/quantization/model_quant.py
  • tests/unit/torch/quantization/test_autoquant.py

Comment thread modelopt/torch/quantization/model_quant.py Outdated
Comment thread modelopt/torch/quantization/model_quant.py Outdated
@meenchen meenchen force-pushed the weimingc/autoquant_edge branch 4 times, most recently from b721f1d to f681009 Compare May 15, 2026 22:50
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 95.48872% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.99%. Comparing base (f5650bd) to head (6f791d1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_quant.py 93.22% 4 Missing ⚠️
modelopt/torch/quantization/algorithms.py 97.29% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1497      +/-   ##
==========================================
+ Coverage   76.95%   76.99%   +0.04%     
==========================================
  Files         474      474              
  Lines       51503    51625     +122     
==========================================
+ Hits        39632    39749     +117     
- Misses      11871    11876       +5     
Flag Coverage Δ
unit 52.72% <95.48%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen force-pushed the weimingc/autoquant_edge branch from f681009 to 6f791d1 Compare May 18, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants