Skip to content

Move model-specific PTQ overrides from llm_ptq to YAML recipes#1506

Draft
shengliangxu wants to merge 2 commits into
shengliangx/all-yaml-configsfrom
shengliangx/model-specific-ptq-recipes
Draft

Move model-specific PTQ overrides from llm_ptq to YAML recipes#1506
shengliangxu wants to merge 2 commits into
shengliangx/all-yaml-configsfrom
shengliangx/model-specific-ptq-recipes

Conversation

@shengliangxu
Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: new feature

Replaces the hardcoded model-type branches in examples/llm_ptq/ with opt-in declarative recipes under modelopt_recipes/huggingface/<model_type>/ptq/. Users who want the model-specific tweaks now pass --recipe huggingface/<model_type>/ptq/<recipe>; users on the plain --qformat path get the generic numerics.

What moved out of Python (examples/llm_ptq/example_utils.py::build_quant_cfg and examples/llm_ptq/hf_ptq.py::mono_quantize):

  • gemma / mpt w4a8_awqawq_lite with alpha_step=1 (coarser search to avoid TRT-LLM overflow).
  • gemma int8_sq → SmoothQuant alpha=0.5 (default 1.0 regresses Gemma 7B).
  • phi4mm → disable *speech*, *audio*, *image*, *vision* (quantize only the language model).
  • Nemotron VL → disable *vision*, *image*, *radio*, *visual*, *encoder*, *model_encoder* (quantize only the decoder).

What stayed in Python:

  • MTP dynamic layer exclusion in hf_ptq.py (depends on runtime-detected layer indices).
  • is_nemotron_vl(full_model) detection itself, which still drives the VLM calibration loop and the post-quantize full_model update — only the quant_cfg tweak it triggered was migrated.

Recipe layout (modelopt_recipes/huggingface/):

gemma/ptq/{w4a8_awq,int8_sq}-kv_fp8_cast.yaml
mpt/ptq/w4a8_awq-kv_fp8_cast.yaml
phi4mm/ptq/{disabled_quantizers,nvfp4-kv_fp8_cast}.yaml
nemotron_vl/ptq/{disabled_quantizers,nvfp4-kv_fp8_cast}.yaml

All recipes ship with FP8 KV-cache cast (kv_fp8_cast). For phi4mm and nemotron_vl, disabled_quantizers.yaml is a merged unit that includes the standard default_disabled_quantizers exclusions plus the model-specific ones, so each recipe imports a single disabled-quantizer slot instead of layering two. Each ptq/ folder has a README.md describing exactly what is model-specific.

Usage

# Gemma W4A8 AWQ with the Gemma-specific algorithm tuning + FP8 KV cache:
python examples/llm_ptq/hf_ptq.py \
  --pyt_ckpt_path google/gemma-7b \
  --recipe huggingface/gemma/ptq/w4a8_awq-kv_fp8_cast \
  --export_path ./out

# Nemotron VL with vision branches excluded automatically:
python examples/llm_ptq/hf_ptq.py \
  --pyt_ckpt_path nvidia/<nemotron-vl-model> \
  --recipe huggingface/nemotron_vl/ptq/nvfp4-kv_fp8_cast \
  --export_path ./out

Testing

  • Pre-commit recipe validator (tools/precommit/check_modelopt_recipes.py) loads every new recipe via load_recipe() — passes for all 7 new YAMLs.
  • yamlfmt + markdownlint + bandit + license-insertion hooks all pass.
  • No tests reference the removed build_quant_cfg(qformat, ..., model_type, ...) signature; the only call sites (hf_ptq.py, multinode_ptq.py) were updated to the new 2/3-arg form.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ❌ — users who relied on automatic model-specific quant_cfg tweaks via --qformat (gemma/mpt AWQ, gemma SmoothQuant, phi4mm exclusions, Nemotron VL exclusions) now need to pass --recipe huggingface/<model_type>/ptq/<recipe> to get them. The flag itself is unchanged; only the implicit behavior was removed.
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ❌ — relies on the existing pre-commit recipe validator that loads each new YAML.
  • Did you update Changelog?: ❌ — please flag if needed.
  • Did you get Claude approval on this PR?: ❌

Additional Information

Merging into shengliangx/all-yaml-configs. Built on top of fc2fd4ad3 ("set paths stright"); that commit only moves Step3.5-Flash into huggingface/step3p5/ and adds the huggingface/README.md, so the migration commit is the substantive change.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Replace the hardcoded model-type branches in examples/llm_ptq (gemma/mpt
AWQ alpha tuning, gemma SmoothQuant alpha, phi4mm exclusions, Nemotron VL
exclusions) with opt-in declarative recipes under
modelopt_recipes/huggingface/<model_type>/ptq/. Users select them with
--recipe huggingface/<model_type>/ptq/<recipe>.

- Per-model recipes ship with FP8 KV-cache cast (kv_fp8_cast) and the
  algorithm/numerics each model needs.
- phi4mm and nemotron_vl each include a merged disabled_quantizers.yaml
  unit so recipes import a single disabled-quantizer slot instead of
  layering default + model-specific exclusions.
- Each ptq/ folder has a README describing what is model-specific.
- Drop now-unused qformat/model_type parameters from build_quant_cfg and
  the Nemotron VL append block in mono_quantize.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 16, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a59d9525-2b9a-4143-8fa9-daafc096cac3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shengliangx/model-specific-ptq-recipes

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant