Skip to content

Add DATASET_COMBOS for grouped calibration datasets#1508

Open
cjluo-nv wants to merge 1 commit into
mainfrom
feat/dataset-combos
Open

Add DATASET_COMBOS for grouped calibration datasets#1508
cjluo-nv wants to merge 1 commit into
mainfrom
feat/dataset-combos

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented May 17, 2026

Summary

  • Add DATASET_COMBOS to modelopt.torch.utils.dataset_utils — single --dataset tokens that fan out to several entries in SUPPORTED_DATASET_CONFIG. The per-entry num_samples is split evenly across the members inside get_dataset_dataloader.
  • Two initial combos:
  • get_supported_datasets() now appends combo names so they show up in --dataset help.
  • hf_ptq.py's default --calib_size bumped from 512 to 1024 so the default combo's even split preserves the previous total sample count (was 512 per-dataset × 2 datasets = 1024; now 1024 split → 512 per-dataset × 2).

Test plan

  • python -c "from modelopt.torch.utils.dataset_utils import DATASET_COMBOS, get_supported_datasets; assert 'default' in get_supported_datasets() and 'nemotron-sft-mix' in get_supported_datasets()"
  • hf_ptq.py with no --dataset flag still calibrates on cnn_dailymail + nemotron-post-training-dataset-v2 with the same total sample count as before.
  • --dataset nemotron-sft-mix --calib_size 1024 allocates ~147/147/146/146/146/146/146 across the seven Nemotron datasets.
  • --dataset cnn_dailymail,default --calib_size 256,1024 composes correctly: 256 from cnn_dailymail (as a plain entry) plus 512+512 from the combo expansion.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Dataset combo mechanism: single dataset tokens now expand into multiple registered datasets with automatic sample count splitting across members
    • Predefined dataset combos available: default, nemotron-sft-mix
  • Updates

    • Increased default calibration size from 512 to 1024 for consistency

Review Change Stack

Lets a single --dataset token fan out to several entries in
SUPPORTED_DATASET_CONFIG, with that entry's num_samples split evenly
across the members. Initial combos:

- ``default``: cnn_dailymail + nemotron-post-training-dataset-v2.
  hf_ptq.py now uses this when --dataset is omitted, replacing a
  hardcoded two-element list. --calib_size default bumped 512 -> 1024
  so the per-dataset sample count after the even split matches the
  prior behavior.
- ``nemotron-sft-mix``: the seven nvidia/Nemotron-* SFT datasets
  registered in #1498.

get_supported_datasets() now lists combo names so they appear in
--dataset --help. Expansion happens inside get_dataset_dataloader
after list normalization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested review from a team as code owners May 17, 2026 17:57
@cjluo-nv cjluo-nv requested a review from meenchen May 17, 2026 17:57
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

📝 Walkthrough

Walkthrough

This PR introduces a dataset combo mechanism that allows single --dataset tokens to transparently expand into multiple registered datasets with automatic sample count distribution. The feature is integrated into hf_ptq.py as the new default, with a coordinated calibration size increase to maintain consistent sample counts.

Changes

Dataset Combos Expansion

Layer / File(s) Summary
Combo registry and dataloader expansion
modelopt/torch/utils/dataset_utils.py
DATASET_COMBOS registry maps combo names (e.g., default, nemotron-sft-mix) to member dataset lists. get_dataset_dataloader detects combo entries and distributes num_samples evenly across members, with remainder assigned to earlier members via divmod.
Combo discovery in supported datasets
modelopt/torch/utils/dataset_utils.py
get_supported_datasets() returns the union of individual dataset keys and combo names, making combos discoverable via the help system.
hf_ptq.py default dataset and calibration size
examples/llm_ptq/hf_ptq.py
Default args.dataset fallback changes to ["default"] (from ["cnn_dailymail", "nemotron-post-training-dataset-v2"]); --calib_size default increases from 512 to 1024 to maintain consistent total sample count across expanded combos.
Changelog documentation
CHANGELOG.rst
Version 0.45 entry documents the DATASET_COMBOS feature, initial combo definitions, help integration, and the calibration size adjustment rationale.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • claude
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: introducing DATASET_COMBOS for grouped calibration datasets, which aligns with the core feature added across the modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All 6 security anti-patterns checked. No violations: no unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, # nosec, or new dependencies. DATASET_COMBOS is static.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dataset-combos

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
modelopt/torch/utils/dataset_utils.py (1)

198-212: ⚡ Quick win

Add import-time validation for combo integrity.

Please validate combo definitions early (non-empty members, no name collisions with SUPPORTED_DATASET_CONFIG, and all members registered). Without this, typos can silently fall through to auto-detected HF loading and produce hard-to-debug behavior.

🔧 Proposed guardrails
 DATASET_COMBOS: dict[str, list[str]] = {
@@
 }
+
+_dataset_keys = set(SUPPORTED_DATASET_CONFIG)
+_combo_keys = set(DATASET_COMBOS)
+overlap = _dataset_keys & _combo_keys
+if overlap:
+    raise ValueError(f"DATASET_COMBOS keys must not overlap dataset keys: {sorted(overlap)}")
+
+for combo_name, members in DATASET_COMBOS.items():
+    if not members:
+        raise ValueError(f"DATASET_COMBOS['{combo_name}'] must contain at least one dataset.")
+    unknown = [m for m in members if m not in _dataset_keys]
+    if unknown:
+        raise ValueError(
+            f"DATASET_COMBOS['{combo_name}'] contains unknown datasets: {unknown}"
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/dataset_utils.py` around lines 198 - 212, DATASET_COMBOS
needs import-time validation to avoid silent typos and collisions: add a guard
that runs when the module loads to (1) ensure each combo in DATASET_COMBOS has a
non-empty list of members, (2) ensure no combo name conflicts with
SUPPORTED_DATASET_CONFIG keys, and (3) ensure every member string in each combo
is a registered dataset (exists in SUPPORTED_DATASET_CONFIG); raise a clear
ValueError listing offending combo names/members if any checks fail so callers
of get_dataset_dataloader receive immediate, actionable errors.
examples/llm_ptq/hf_ptq.py (1)

536-537: ⚡ Quick win

Avoid hardcoding combo members in the warning text.

This message can drift from the registry over time. Prefer referencing only the combo name (or deriving members from the registry) so runtime messaging stays accurate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 536 - 537, Replace the hardcoded
combo-members text in the warning string ("No dataset specified. Defaulting to
the 'default' combo (cnn_dailymail + nemotron-post-training-dataset-v2).") so it
no longer lists combo members directly; either simplify the message to only
reference the combo name (e.g., "No dataset specified. Defaulting to the
'default' combo.") or dynamically derive members from the dataset registry
(e.g., lookup the 'default' combo via the registry API and interpolate its
members) and use that value in the log call where the original string appears.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 536-537: Replace the hardcoded combo-members text in the warning
string ("No dataset specified. Defaulting to the 'default' combo (cnn_dailymail
+ nemotron-post-training-dataset-v2).") so it no longer lists combo members
directly; either simplify the message to only reference the combo name (e.g.,
"No dataset specified. Defaulting to the 'default' combo.") or dynamically
derive members from the dataset registry (e.g., lookup the 'default' combo via
the registry API and interpolate its members) and use that value in the log call
where the original string appears.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 198-212: DATASET_COMBOS needs import-time validation to avoid
silent typos and collisions: add a guard that runs when the module loads to (1)
ensure each combo in DATASET_COMBOS has a non-empty list of members, (2) ensure
no combo name conflicts with SUPPORTED_DATASET_CONFIG keys, and (3) ensure every
member string in each combo is a registered dataset (exists in
SUPPORTED_DATASET_CONFIG); raise a clear ValueError listing offending combo
names/members if any checks fail so callers of get_dataset_dataloader receive
immediate, actionable errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4939a8b8-763d-4295-bcd5-ab868e512468

📥 Commits

Reviewing files that changed from the base of the PR and between 7038dec and df6040d.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/utils/dataset_utils.py

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1508/

Built to branch gh-pages at 2026-05-17 18:01 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@cjluo-nv cjluo-nv requested review from Fridah-nv and realAsma May 17, 2026 18:02
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.46%. Comparing base (7038dec) to head (df6040d).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1508      +/-   ##
==========================================
- Coverage   76.93%   76.46%   -0.47%     
==========================================
  Files         474      474              
  Lines       51506    51519      +13     
==========================================
- Hits        39625    39395     -230     
- Misses      11881    12124     +243     
Flag Coverage Δ
examples 31.77% <100.00%> (-9.06%) ⬇️
gpu 59.74% <64.28%> (-0.59%) ⬇️
unit 52.64% <57.14%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant