Add DATASET_COMBOS for grouped calibration datasets by cjluo-nv · Pull Request #1508 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-17T17:57:22Z

Summary

Add DATASET_COMBOS to modelopt.torch.utils.dataset_utils — single --dataset tokens that fan out to several entries in SUPPORTED_DATASET_CONFIG. The per-entry num_samples is split evenly across the members inside get_dataset_dataloader.
Two initial combos:
- default → cnn_dailymail + nemotron-post-training-dataset-v2. Replaces the hardcoded two-element fallback list in hf_ptq.py when --dataset is omitted.
- nemotron-sft-mix → the seven nvidia/Nemotron-* SFT datasets registered in Add 7 nvidia/Nemotron-* calibration datasets to SUPPORTED_DATASET_CONFIG #1498.
get_supported_datasets() now appends combo names so they show up in --dataset help.
hf_ptq.py's default --calib_size bumped from 512 to 1024 so the default combo's even split preserves the previous total sample count (was 512 per-dataset × 2 datasets = 1024; now 1024 split → 512 per-dataset × 2).

Test plan

python -c "from modelopt.torch.utils.dataset_utils import DATASET_COMBOS, get_supported_datasets; assert 'default' in get_supported_datasets() and 'nemotron-sft-mix' in get_supported_datasets()"
hf_ptq.py with no --dataset flag still calibrates on cnn_dailymail + nemotron-post-training-dataset-v2 with the same total sample count as before.
--dataset nemotron-sft-mix --calib_size 1024 allocates ~147/147/146/146/146/146/146 across the seven Nemotron datasets.
--dataset cnn_dailymail,default --calib_size 256,1024 composes correctly: 256 from cnn_dailymail (as a plain entry) plus 512+512 from the combo expansion.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Dataset combo mechanism: single dataset tokens now expand into multiple registered datasets with automatic sample count splitting across members
- Predefined dataset combos available: default, nemotron-sft-mix
Updates
- Increased default calibration size from 512 to 1024 for consistency

Lets a single --dataset token fan out to several entries in SUPPORTED_DATASET_CONFIG, with that entry's num_samples split evenly across the members. Initial combos: - ``default``: cnn_dailymail + nemotron-post-training-dataset-v2. hf_ptq.py now uses this when --dataset is omitted, replacing a hardcoded two-element list. --calib_size default bumped 512 -> 1024 so the per-dataset sample count after the even split matches the prior behavior. - ``nemotron-sft-mix``: the seven nvidia/Nemotron-* SFT datasets registered in #1498. get_supported_datasets() now lists combo names so they appear in --dataset --help. Expansion happens inside get_dataset_dataloader after list normalization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-05-17T17:57:31Z

📝 Walkthrough

Walkthrough

This PR introduces a dataset combo mechanism that allows single --dataset tokens to transparently expand into multiple registered datasets with automatic sample count distribution. The feature is integrated into hf_ptq.py as the new default, with a coordinated calibration size increase to maintain consistent sample counts.

Changes

Dataset Combos Expansion

Layer / File(s)	Summary
Combo registry and dataloader expansion `modelopt/torch/utils/dataset_utils.py`	`DATASET_COMBOS` registry maps combo names (e.g., `default`, `nemotron-sft-mix`) to member dataset lists. `get_dataset_dataloader` detects combo entries and distributes `num_samples` evenly across members, with remainder assigned to earlier members via `divmod`.
Combo discovery in supported datasets `modelopt/torch/utils/dataset_utils.py`	`get_supported_datasets()` returns the union of individual dataset keys and combo names, making combos discoverable via the help system.
hf_ptq.py default dataset and calibration size `examples/llm_ptq/hf_ptq.py`	Default `args.dataset` fallback changes to `["default"]` (from `["cnn_dailymail", "nemotron-post-training-dataset-v2"]`); `--calib_size` default increases from `512` to `1024` to maintain consistent total sample count across expanded combos.
Changelog documentation `CHANGELOG.rst`	Version `0.45` entry documents the DATASET_COMBOS feature, initial combo definitions, help integration, and the calibration size adjustment rationale.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

claude

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: introducing DATASET_COMBOS for grouped calibration datasets, which aligns with the core feature added across the modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All 6 security anti-patterns checked. No violations: no unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, # nosec, or new dependencies. DATASET_COMBOS is static.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/dataset-combos

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

modelopt/torch/utils/dataset_utils.py (1)

198-212: ⚡ Quick win

Add import-time validation for combo integrity.

Please validate combo definitions early (non-empty members, no name collisions with SUPPORTED_DATASET_CONFIG, and all members registered). Without this, typos can silently fall through to auto-detected HF loading and produce hard-to-debug behavior.

🔧 Proposed guardrails

 DATASET_COMBOS: dict[str, list[str]] = {
@@
 }
+
+_dataset_keys = set(SUPPORTED_DATASET_CONFIG)
+_combo_keys = set(DATASET_COMBOS)
+overlap = _dataset_keys & _combo_keys
+if overlap:
+    raise ValueError(f"DATASET_COMBOS keys must not overlap dataset keys: {sorted(overlap)}")
+
+for combo_name, members in DATASET_COMBOS.items():
+    if not members:
+        raise ValueError(f"DATASET_COMBOS['{combo_name}'] must contain at least one dataset.")
+    unknown = [m for m in members if m not in _dataset_keys]
+    if unknown:
+        raise ValueError(
+            f"DATASET_COMBOS['{combo_name}'] contains unknown datasets: {unknown}"
+        )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/dataset_utils.py` around lines 198 - 212, DATASET_COMBOS
needs import-time validation to avoid silent typos and collisions: add a guard
that runs when the module loads to (1) ensure each combo in DATASET_COMBOS has a
non-empty list of members, (2) ensure no combo name conflicts with
SUPPORTED_DATASET_CONFIG keys, and (3) ensure every member string in each combo
is a registered dataset (exists in SUPPORTED_DATASET_CONFIG); raise a clear
ValueError listing offending combo names/members if any checks fail so callers
of get_dataset_dataloader receive immediate, actionable errors.

examples/llm_ptq/hf_ptq.py (1)

536-537: ⚡ Quick win

Avoid hardcoding combo members in the warning text.

This message can drift from the registry over time. Prefer referencing only the combo name (or deriving members from the registry) so runtime messaging stays accurate.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 536 - 537, Replace the hardcoded
combo-members text in the warning string ("No dataset specified. Defaulting to
the 'default' combo (cnn_dailymail + nemotron-post-training-dataset-v2).") so it
no longer lists combo members directly; either simplify the message to only
reference the combo name (e.g., "No dataset specified. Defaulting to the
'default' combo.") or dynamically derive members from the dataset registry
(e.g., lookup the 'default' combo via the registry API and interpolate its
members) and use that value in the log call where the original string appears.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 536-537: Replace the hardcoded combo-members text in the warning
string ("No dataset specified. Defaulting to the 'default' combo (cnn_dailymail
+ nemotron-post-training-dataset-v2).") so it no longer lists combo members
directly; either simplify the message to only reference the combo name (e.g.,
"No dataset specified. Defaulting to the 'default' combo.") or dynamically
derive members from the dataset registry (e.g., lookup the 'default' combo via
the registry API and interpolate its members) and use that value in the log call
where the original string appears.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 198-212: DATASET_COMBOS needs import-time validation to avoid
silent typos and collisions: add a guard that runs when the module loads to (1)
ensure each combo in DATASET_COMBOS has a non-empty list of members, (2) ensure
no combo name conflicts with SUPPORTED_DATASET_CONFIG keys, and (3) ensure every
member string in each combo is a registered dataset (exists in
SUPPORTED_DATASET_CONFIG); raise a clear ValueError listing offending combo
names/members if any checks fail so callers of get_dataset_dataloader receive
immediate, actionable errors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4939a8b8-763d-4295-bcd5-ab868e512468

📥 Commits

Reviewing files that changed from the base of the PR and between 7038dec and df6040d.

📒 Files selected for processing (3)

CHANGELOG.rst
examples/llm_ptq/hf_ptq.py
modelopt/torch/utils/dataset_utils.py

github-actions · 2026-05-17T18:01:47Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1508/
Built to branch `gh-pages` at 2026-05-17 18:01 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-05-17T18:08:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.46%. Comparing base (7038dec) to head (df6040d).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1508      +/-   ##
==========================================
- Coverage   76.93%   76.46%   -0.47%     
==========================================
  Files         474      474              
  Lines       51506    51519      +13     
==========================================
- Hits        39625    39395     -230     
- Misses      11881    12124     +243

Flag	Coverage Δ
examples	`31.77% <100.00%> (-9.06%)`	⬇️
gpu	`59.74% <64.28%> (-0.59%)`	⬇️
unit	`52.64% <57.14%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv requested review from a team as code owners May 17, 2026 17:57

cjluo-nv requested a review from meenchen May 17, 2026 17:57

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

cjluo-nv requested review from Fridah-nv and realAsma May 17, 2026 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DATASET_COMBOS for grouped calibration datasets#1508

Add DATASET_COMBOS for grouped calibration datasets#1508
cjluo-nv wants to merge 1 commit into
mainfrom
feat/dataset-combos

cjluo-nv commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented May 17, 2026

Built to branch `gh-pages` at 2026-05-17 18:01 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 17, 2026

Built to branch gh-pages at 2026-05-17 18:01 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cjluo-nv commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-17 18:01 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 17, 2026 •

edited

Loading