Skip to content

FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts#1789

Open
romanlutz wants to merge 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/dataset-config-test-audit
Open

FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts#1789
romanlutz wants to merge 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/dataset-config-test-audit

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz commented May 23, 2026

Why

tests/end_to_end/test_all_datasets.py parametrizes over every registered SeedDatasetProvider but only ever calls provider_cls() with no arguments. Loaders whose non-default constructor arguments select between distinct upstream artifacts — a different HF split, a different JSONL URL, a different instr-resp field on the same row, a different HF config — therefore only have their default config exercised against real upstream data. A schema drift in a non-default variant (CSV column rename, missing split, image directory rename) goes unnoticed until a user hits it in production. Unit tests catch the wiring; they do not catch upstream drift.

What

Apply the OR-Bench-style aliasing pattern across the 8 affected loaders on main: each distinct upstream variant becomes its own registered SeedDatasetProvider subclass with the relevant constructor default pinned and a unique dataset_name. The registry-driven e2e parametrization then picks up every new subclass automatically — no test-runner refactor needed.

Per-loader breakdown (25 new registered subclasses):

Loader New siblings Distinct upstream selector
_AyaRedteamingDataset +7 (Hindi, French, Spanish, Arabic, Russian, Serbian, Tagalog) per-language JSONL URL
_BabelscapeAlertDataset +1 (_BabelscapeAlertOriginalDataset) HF config (alert vs default alert_adversarial)
_CategoricalHarmfulQADataset +2 (Chinese, Vietnamese) HF split
_HiXSTestDataset (gated) +1 (English) column read (Hindi vs English prompt)
_MSTSDataset +10 (German, Russian, Chinese, Hindi, Spanish, Italian, French, Korean, Arabic, Farsi) HF split
_SaladBenchDataset +2 (attackEnhanced, defenseEnhanced) HF split
_VLGuardDataset (gated) +2 (SAFE_UNSAFES, SAFE_SAFES) instr-resp field read on the same metadata zip

Loaders left unchanged because the default already fans out across every upstream artifact: EquityMedQA (subset_name="all" iterates 11 HF configs), MedSafetyBench (same pattern), ComicJailbreak (default fetches all 5 templates), VLSU / HarmBenchMultimodal / VisualLeakBench (all params are post-fetch filters), etc.

Notes

  • Backward compatible at the class level. Existing kwarg-based selection on parent loaders is untouched. No deprecations on the Python API.
  • E2E CI cost. The image-heavy MSTS (10 new variants) and VLGuard (2 new variants) additions could blow up the daily e2e download volume. To bound this, tests/end_to_end/test_all_datasets.py grew a small _CAPPED_PROVIDERS set that applies max_examples=6 in e2e only. Production defaults are unchanged.
  • Aya legibility. The original _AyaRedteamingDataset inlined a giant Literal[...] for harm_categories. Restating that union eight times for each subclass would have been unreadable, so it was factored into module-level AyaLanguage / AyaHarmCategory type aliases that both the parent and all 7 subclasses use.
  • Unit-test coverage. Each new subclass has smoke coverage in the matching tests/unit/datasets/ file: dataset_name, correct pinned config/split/subset forwarded to the parent fetch method, kwarg forwarding for max_examples and token.
  • Docs notebook. doc/code/datasets/1_loading_datasets.ipynb's rendered dataset-name list cell was surgically updated in place so the docs site picks up the new variants without a full re-execution introducing environmental noise (timestamps, UUIDs, Windows paths) into the diff.
  • Gated loaders. _HiXSTest* and _VLGuard* variants inherit the parent's HUGGINGFACE_TOKEN plumbing and will skip in CI without the secret, same behavior as today.

Out of scope

FigStep is on a separate branch (romanlutz/figstep-dataset-loader) and is not yet on main. It is the motivating example for this audit but the refactor is handled in that PR, where the same pattern can be applied more cleanly (clean OR-Bench-style abstract base + 3 concrete subclasses with no deprecation surface).

Naming follow-up

After adding sibling subclasses, the parent classes' generic dataset_names (aya_redteaming, salad_bench, babelscape_alert, categorical_harmful_qa, hixstest, vlguard) sat awkwardly next to their newly-added language/split-specific siblings — there was no way to tell from the registry which variant the generic name referred to without reading source. Each parent's dataset_name is now renamed to spell out the variant it actually fetches by default:

Parent (was) Parent (now) Pinned default
aya_redteaming aya_redteaming_english language="English"
salad_bench salad_bench_base split="base"
babelscape_alert babelscape_alert_adversarial category="alert_adversarial"
categorical_harmful_qa categorical_harmful_qa_english language="en"
hixstest hixstest_hindi reads Hindi column
vlguard vlguard_unsafes subset=UNSAFES

MSTS is intentionally not renamed: its parent's default (size="large") is the all-11-languages aggregate, semantically distinct from any per-language sibling, so msts as the aggregate name remains correct.

Note: This changes the public dataset_name strings for these six loaders. PyRIT is pre-1.0; any external code calling SeedDatasetProvider.get_by_name("aya_redteaming") (etc.) will need to switch to the new explicit name.

Verification

  • uv run pytest tests/unit/datasets/ -q — 576 tests pass
  • uv run ruff check pyrit/ tests/ — All checks passed
  • uv run ruff format --check pyrit/ tests/ — All files already formatted
  • Registry sanity check: SeedDatasetProvider.get_all_providers() returns 91 providers (was 66), all 25 new subclasses register with unique dataset_names.

…tifacts

The daily end-to-end matrix in tests/end_to_end/test_all_datasets.py
only ever calls provider_cls() with no arguments. Loaders whose
constructor parameters select between distinct upstream files (a
different HuggingFace split, a different JSONL URL, a different
instr-resp field on the same row, a different HF config) therefore
only have their default config exercised against real upstream data.
A schema drift in a non-default variant (CSV column rename, missing
split, image directory rename, etc.) goes unnoticed until a user
hits it in production.

This change applies the OR-Bench-style aliasing pattern across the
8 affected loaders on main: each distinct upstream variant becomes
its own registered SeedDatasetProvider subclass with the relevant
constructor default pinned and a unique dataset_name. Because
test_all_datasets.py already parametrizes over the registry, every
new sibling subclass picks up an additional e2e test case
automatically — no test runner refactor needed.

Loaders touched (25 new registered subclasses):

- AyaRedteaming: 7 language siblings (Hindi, French, Spanish, Arabic,
  Russian, Serbian, Tagalog) — each is a distinct JSONL URL.
- BabelscapeAlert: 1 sibling for the original 'alert' HF config
  (default is 'alert_adversarial').
- CategoricalHarmfulQA: 2 language siblings (Chinese, Vietnamese) —
  each is a distinct HF split.
- HiXSTest (gated): 1 sibling for the English column.
- MSTS: 10 language siblings — each is a distinct HF split.
- SaladBench: 2 split siblings (attackEnhanced, defenseEnhanced).
- VLGuard (gated): 2 subset siblings (SAFE_UNSAFES, SAFE_SAFES) —
  each reads a different instr-resp field from the same metadata zip.

Operational notes:

- Existing kwarg-based selection on parent loaders is untouched, so
  no downstream user code breaks.
- For the multimodal additions (10 MSTS variants, 2 VLGuard variants),
  test_all_datasets.py grew a small _CAPPED_PROVIDERS set that applies
  max_examples=6 in e2e only. Production defaults are unchanged.
- Aya's giant inline Literal types were factored into module-level
  AyaLanguage / AyaHarmCategory aliases to keep the new subclasses
  legible without restating the union eight times.
- Each new subclass has unit-test smoke coverage in the matching
  tests/unit/datasets/ file (dataset_name + correct pinned
  config/split/subset forwarding to the parent fetch method).
- doc/code/datasets/1_loading_datasets.ipynb's rendered dataset-name
  list was updated in place so the docs site picks up the new
  variants without rerunning the whole notebook on a Windows box.

Out of scope: FigStep is on a separate branch and is handled in
that PR; its refactor is the same pattern but cleaner because
FigStep is not yet on main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz changed the title Add sibling loaders for dataset variants pulling distinct upstream artifacts FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts May 23, 2026
After adding per-variant sibling subclasses, the parent classes' generic dataset_names (aya_redteaming, salad_bench, babelscape_alert, categorical_harmful_qa, hixstest, vlguard) sat awkwardly next to their newly-added language/split-specific siblings, leaving users unable to tell which variant the generic name refers to without reading source.

Rename each parent's dataset_name to spell out the variant it actually fetches by default:

- aya_redteaming -> aya_redteaming_english (default language='English')

- salad_bench -> salad_bench_base (default split='base')

- babelscape_alert -> babelscape_alert_adversarial (default category='alert_adversarial')

- categorical_harmful_qa -> categorical_harmful_qa_english (default language='en')

- hixstest -> hixstest_hindi (default reads Hindi column)

- vlguard -> vlguard_unsafes (default subset=UNSAFES)

MSTS is intentionally not renamed: its parent's default is size='large' = all 11 languages aggregated, which is semantically distinct from any per-language sibling, so 'msts' as the aggregate name remains correct.

Updated affected unit-test dataset_name assertions and regenerated the docs notebook dataset list cell. No e2e/integration test references to the old names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants