Add DATASET_COMBOS for grouped calibration datasets#1508
Conversation
Lets a single --dataset token fan out to several entries in SUPPORTED_DATASET_CONFIG, with that entry's num_samples split evenly across the members. Initial combos: - ``default``: cnn_dailymail + nemotron-post-training-dataset-v2. hf_ptq.py now uses this when --dataset is omitted, replacing a hardcoded two-element list. --calib_size default bumped 512 -> 1024 so the per-dataset sample count after the even split matches the prior behavior. - ``nemotron-sft-mix``: the seven nvidia/Nemotron-* SFT datasets registered in #1498. get_supported_datasets() now lists combo names so they appear in --dataset --help. Expansion happens inside get_dataset_dataloader after list normalization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces a dataset combo mechanism that allows single ChangesDataset Combos Expansion
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt/torch/utils/dataset_utils.py (1)
198-212: ⚡ Quick winAdd import-time validation for combo integrity.
Please validate combo definitions early (non-empty members, no name collisions with
SUPPORTED_DATASET_CONFIG, and all members registered). Without this, typos can silently fall through to auto-detected HF loading and produce hard-to-debug behavior.🔧 Proposed guardrails
DATASET_COMBOS: dict[str, list[str]] = { @@ } + +_dataset_keys = set(SUPPORTED_DATASET_CONFIG) +_combo_keys = set(DATASET_COMBOS) +overlap = _dataset_keys & _combo_keys +if overlap: + raise ValueError(f"DATASET_COMBOS keys must not overlap dataset keys: {sorted(overlap)}") + +for combo_name, members in DATASET_COMBOS.items(): + if not members: + raise ValueError(f"DATASET_COMBOS['{combo_name}'] must contain at least one dataset.") + unknown = [m for m in members if m not in _dataset_keys] + if unknown: + raise ValueError( + f"DATASET_COMBOS['{combo_name}'] contains unknown datasets: {unknown}" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/dataset_utils.py` around lines 198 - 212, DATASET_COMBOS needs import-time validation to avoid silent typos and collisions: add a guard that runs when the module loads to (1) ensure each combo in DATASET_COMBOS has a non-empty list of members, (2) ensure no combo name conflicts with SUPPORTED_DATASET_CONFIG keys, and (3) ensure every member string in each combo is a registered dataset (exists in SUPPORTED_DATASET_CONFIG); raise a clear ValueError listing offending combo names/members if any checks fail so callers of get_dataset_dataloader receive immediate, actionable errors.examples/llm_ptq/hf_ptq.py (1)
536-537: ⚡ Quick winAvoid hardcoding combo members in the warning text.
This message can drift from the registry over time. Prefer referencing only the combo name (or deriving members from the registry) so runtime messaging stays accurate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/hf_ptq.py` around lines 536 - 537, Replace the hardcoded combo-members text in the warning string ("No dataset specified. Defaulting to the 'default' combo (cnn_dailymail + nemotron-post-training-dataset-v2).") so it no longer lists combo members directly; either simplify the message to only reference the combo name (e.g., "No dataset specified. Defaulting to the 'default' combo.") or dynamically derive members from the dataset registry (e.g., lookup the 'default' combo via the registry API and interpolate its members) and use that value in the log call where the original string appears.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 536-537: Replace the hardcoded combo-members text in the warning
string ("No dataset specified. Defaulting to the 'default' combo (cnn_dailymail
+ nemotron-post-training-dataset-v2).") so it no longer lists combo members
directly; either simplify the message to only reference the combo name (e.g.,
"No dataset specified. Defaulting to the 'default' combo.") or dynamically
derive members from the dataset registry (e.g., lookup the 'default' combo via
the registry API and interpolate its members) and use that value in the log call
where the original string appears.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 198-212: DATASET_COMBOS needs import-time validation to avoid
silent typos and collisions: add a guard that runs when the module loads to (1)
ensure each combo in DATASET_COMBOS has a non-empty list of members, (2) ensure
no combo name conflicts with SUPPORTED_DATASET_CONFIG keys, and (3) ensure every
member string in each combo is a registered dataset (exists in
SUPPORTED_DATASET_CONFIG); raise a clear ValueError listing offending combo
names/members if any checks fail so callers of get_dataset_dataloader receive
immediate, actionable errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4939a8b8-763d-4295-bcd5-ab868e512468
📒 Files selected for processing (3)
CHANGELOG.rstexamples/llm_ptq/hf_ptq.pymodelopt/torch/utils/dataset_utils.py
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1508 +/- ##
==========================================
- Coverage 76.93% 76.46% -0.47%
==========================================
Files 474 474
Lines 51506 51519 +13
==========================================
- Hits 39625 39395 -230
- Misses 11881 12124 +243
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
DATASET_COMBOStomodelopt.torch.utils.dataset_utils— single--datasettokens that fan out to several entries inSUPPORTED_DATASET_CONFIG. The per-entrynum_samplesis split evenly across the members insideget_dataset_dataloader.default→cnn_dailymail+nemotron-post-training-dataset-v2. Replaces the hardcoded two-element fallback list inhf_ptq.pywhen--datasetis omitted.nemotron-sft-mix→ the sevennvidia/Nemotron-*SFT datasets registered in Add 7 nvidia/Nemotron-* calibration datasets to SUPPORTED_DATASET_CONFIG #1498.get_supported_datasets()now appends combo names so they show up in--datasethelp.hf_ptq.py's default--calib_sizebumped from512to1024so thedefaultcombo's even split preserves the previous total sample count (was 512 per-dataset × 2 datasets = 1024; now 1024 split → 512 per-dataset × 2).Test plan
python -c "from modelopt.torch.utils.dataset_utils import DATASET_COMBOS, get_supported_datasets; assert 'default' in get_supported_datasets() and 'nemotron-sft-mix' in get_supported_datasets()"hf_ptq.pywith no--datasetflag still calibrates on cnn_dailymail + nemotron-post-training-dataset-v2 with the same total sample count as before.--dataset nemotron-sft-mix --calib_size 1024allocates ~147/147/146/146/146/146/146 across the seven Nemotron datasets.--dataset cnn_dailymail,default --calib_size 256,1024composes correctly: 256 from cnn_dailymail (as a plain entry) plus 512+512 from the combo expansion.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
default,nemotron-sft-mixUpdates