Move model-specific PTQ overrides from llm_ptq to YAML recipes#1506
Draft
shengliangxu wants to merge 2 commits into
Draft
Move model-specific PTQ overrides from llm_ptq to YAML recipes#1506shengliangxu wants to merge 2 commits into
shengliangxu wants to merge 2 commits into
Conversation
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Replace the hardcoded model-type branches in examples/llm_ptq (gemma/mpt AWQ alpha tuning, gemma SmoothQuant alpha, phi4mm exclusions, Nemotron VL exclusions) with opt-in declarative recipes under modelopt_recipes/huggingface/<model_type>/ptq/. Users select them with --recipe huggingface/<model_type>/ptq/<recipe>. - Per-model recipes ship with FP8 KV-cache cast (kv_fp8_cast) and the algorithm/numerics each model needs. - phi4mm and nemotron_vl each include a merged disabled_quantizers.yaml unit so recipes import a single disabled-quantizer slot instead of layering default + model-specific exclusions. - Each ptq/ folder has a README describing what is model-specific. - Drop now-unused qformat/model_type parameters from build_quant_cfg and the Nemotron VL append block in mono_quantize. Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Type of change: new feature
Replaces the hardcoded model-type branches in
examples/llm_ptq/with opt-in declarative recipes undermodelopt_recipes/huggingface/<model_type>/ptq/. Users who want the model-specific tweaks now pass--recipe huggingface/<model_type>/ptq/<recipe>; users on the plain--qformatpath get the generic numerics.What moved out of Python (
examples/llm_ptq/example_utils.py::build_quant_cfgandexamples/llm_ptq/hf_ptq.py::mono_quantize):w4a8_awq→awq_litewithalpha_step=1(coarser search to avoid TRT-LLM overflow).int8_sq→ SmoothQuantalpha=0.5(default1.0regresses Gemma 7B).*speech*,*audio*,*image*,*vision*(quantize only the language model).*vision*,*image*,*radio*,*visual*,*encoder*,*model_encoder*(quantize only the decoder).What stayed in Python:
hf_ptq.py(depends on runtime-detected layer indices).is_nemotron_vl(full_model)detection itself, which still drives the VLM calibration loop and the post-quantizefull_modelupdate — only the quant_cfg tweak it triggered was migrated.Recipe layout (
modelopt_recipes/huggingface/):All recipes ship with FP8 KV-cache cast (
kv_fp8_cast). For phi4mm and nemotron_vl,disabled_quantizers.yamlis a merged unit that includes the standarddefault_disabled_quantizersexclusions plus the model-specific ones, so each recipe imports a single disabled-quantizer slot instead of layering two. Eachptq/folder has aREADME.mddescribing exactly what is model-specific.Usage
Testing
tools/precommit/check_modelopt_recipes.py) loads every new recipe viaload_recipe()— passes for all 7 new YAMLs.yamlfmt+markdownlint+bandit+ license-insertion hooks all pass.build_quant_cfg(qformat, ..., model_type, ...)signature; the only call sites (hf_ptq.py,multinode_ptq.py) were updated to the new 2/3-arg form.Before your PR is "Ready for review"
--qformat(gemma/mpt AWQ, gemma SmoothQuant, phi4mm exclusions, Nemotron VL exclusions) now need to pass--recipe huggingface/<model_type>/ptq/<recipe>to get them. The flag itself is unchanged; only the implicit behavior was removed.CONTRIBUTING.md: N/AAdditional Information
Merging into
shengliangx/all-yaml-configs. Built on top offc2fd4ad3("set paths stright"); that commit only moves Step3.5-Flash intohuggingface/step3p5/and adds thehuggingface/README.md, so the migration commit is the substantive change.