FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts by romanlutz · Pull Request #1789 · microsoft/PyRIT

romanlutz · 2026-05-23T02:48:55Z

Why

tests/end_to_end/test_all_datasets.py parametrizes over every registered SeedDatasetProvider but only ever calls provider_cls() with no arguments. Loaders whose non-default constructor arguments select between distinct upstream artifacts — a different HF split, a different JSONL URL, a different instr-resp field on the same row, a different HF config — therefore only have their default config exercised against real upstream data. A schema drift in a non-default variant (CSV column rename, missing split, image directory rename) goes unnoticed until a user hits it in production. Unit tests catch the wiring; they do not catch upstream drift.

What

Apply the OR-Bench-style aliasing pattern across the 8 affected loaders on main: each distinct upstream variant becomes its own registered SeedDatasetProvider subclass with the relevant constructor default pinned and a unique dataset_name. The registry-driven e2e parametrization then picks up every new subclass automatically — no test-runner refactor needed.

Per-loader breakdown (25 new registered subclasses):

Loader	New siblings	Distinct upstream selector
`_AyaRedteamingDataset`	+7 (Hindi, French, Spanish, Arabic, Russian, Serbian, Tagalog)	per-language JSONL URL
`_BabelscapeAlertDataset`	+1 (`_BabelscapeAlertOriginalDataset`)	HF config (`alert` vs default `alert_adversarial`)
`_CategoricalHarmfulQADataset`	+2 (Chinese, Vietnamese)	HF split
`_HiXSTestDataset` (gated)	+1 (English)	column read (Hindi vs English prompt)
`_MSTSDataset`	+10 (German, Russian, Chinese, Hindi, Spanish, Italian, French, Korean, Arabic, Farsi)	HF split
`_SaladBenchDataset`	+2 (`attackEnhanced`, `defenseEnhanced`)	HF split
`_VLGuardDataset` (gated)	+2 (`SAFE_UNSAFES`, `SAFE_SAFES`)	`instr-resp` field read on the same metadata zip

Loaders left unchanged because the default already fans out across every upstream artifact: EquityMedQA (subset_name="all" iterates 11 HF configs), MedSafetyBench (same pattern), ComicJailbreak (default fetches all 5 templates), VLSU / HarmBenchMultimodal / VisualLeakBench (all params are post-fetch filters), etc.

Notes

Backward compatible at the class level. Existing kwarg-based selection on parent loaders is untouched. No deprecations on the Python API.
E2E CI cost. The image-heavy MSTS (10 new variants) and VLGuard (2 new variants) additions could blow up the daily e2e download volume. To bound this, tests/end_to_end/test_all_datasets.py grew a small _CAPPED_PROVIDERS set that applies max_examples=6 in e2e only. Production defaults are unchanged.
Aya legibility. The original _AyaRedteamingDataset inlined a giant Literal[...] for harm_categories. Restating that union eight times for each subclass would have been unreadable, so it was factored into module-level AyaLanguage / AyaHarmCategory type aliases that both the parent and all 7 subclasses use.
Unit-test coverage. Each new subclass has smoke coverage in the matching tests/unit/datasets/ file: dataset_name, correct pinned config/split/subset forwarded to the parent fetch method, kwarg forwarding for max_examples and token.
Docs notebook. doc/code/datasets/1_loading_datasets.ipynb's rendered dataset-name list cell was surgically updated in place so the docs site picks up the new variants without a full re-execution introducing environmental noise (timestamps, UUIDs, Windows paths) into the diff.
Gated loaders. _HiXSTest* and _VLGuard* variants inherit the parent's HUGGINGFACE_TOKEN plumbing and will skip in CI without the secret, same behavior as today.

Out of scope

FigStep is on a separate branch (romanlutz/figstep-dataset-loader) and is not yet on main. It is the motivating example for this audit but the refactor is handled in that PR, where the same pattern can be applied more cleanly (clean OR-Bench-style abstract base + 3 concrete subclasses with no deprecation surface).

Naming follow-up

After adding sibling subclasses, the parent classes' generic dataset_names (aya_redteaming, salad_bench, babelscape_alert, categorical_harmful_qa, hixstest, vlguard) sat awkwardly next to their newly-added language/split-specific siblings — there was no way to tell from the registry which variant the generic name referred to without reading source. Each parent's dataset_name is now renamed to spell out the variant it actually fetches by default:

Parent (was)	Parent (now)	Pinned default
`aya_redteaming`	`aya_redteaming_english`	`language="English"`
`salad_bench`	`salad_bench_base`	`split="base"`
`babelscape_alert`	`babelscape_alert_adversarial`	`category="alert_adversarial"`
`categorical_harmful_qa`	`categorical_harmful_qa_english`	`language="en"`
`hixstest`	`hixstest_hindi`	reads Hindi column
`vlguard`	`vlguard_unsafes`	`subset=UNSAFES`

MSTS is intentionally not renamed: its parent's default (size="large") is the all-11-languages aggregate, semantically distinct from any per-language sibling, so msts as the aggregate name remains correct.

Note: This changes the public dataset_name strings for these six loaders. PyRIT is pre-1.0; any external code calling SeedDatasetProvider.get_by_name("aya_redteaming") (etc.) will need to switch to the new explicit name.

Verification

uv run pytest tests/unit/datasets/ -q — 576 tests pass
uv run ruff check pyrit/ tests/ — All checks passed
uv run ruff format --check pyrit/ tests/ — All files already formatted
Registry sanity check: SeedDatasetProvider.get_all_providers() returns 91 providers (was 66), all 25 new subclasses register with unique dataset_names.

…tifacts The daily end-to-end matrix in tests/end_to_end/test_all_datasets.py only ever calls provider_cls() with no arguments. Loaders whose constructor parameters select between distinct upstream files (a different HuggingFace split, a different JSONL URL, a different instr-resp field on the same row, a different HF config) therefore only have their default config exercised against real upstream data. A schema drift in a non-default variant (CSV column rename, missing split, image directory rename, etc.) goes unnoticed until a user hits it in production. This change applies the OR-Bench-style aliasing pattern across the 8 affected loaders on main: each distinct upstream variant becomes its own registered SeedDatasetProvider subclass with the relevant constructor default pinned and a unique dataset_name. Because test_all_datasets.py already parametrizes over the registry, every new sibling subclass picks up an additional e2e test case automatically — no test runner refactor needed. Loaders touched (25 new registered subclasses): - AyaRedteaming: 7 language siblings (Hindi, French, Spanish, Arabic, Russian, Serbian, Tagalog) — each is a distinct JSONL URL. - BabelscapeAlert: 1 sibling for the original 'alert' HF config (default is 'alert_adversarial'). - CategoricalHarmfulQA: 2 language siblings (Chinese, Vietnamese) — each is a distinct HF split. - HiXSTest (gated): 1 sibling for the English column. - MSTS: 10 language siblings — each is a distinct HF split. - SaladBench: 2 split siblings (attackEnhanced, defenseEnhanced). - VLGuard (gated): 2 subset siblings (SAFE_UNSAFES, SAFE_SAFES) — each reads a different instr-resp field from the same metadata zip. Operational notes: - Existing kwarg-based selection on parent loaders is untouched, so no downstream user code breaks. - For the multimodal additions (10 MSTS variants, 2 VLGuard variants), test_all_datasets.py grew a small _CAPPED_PROVIDERS set that applies max_examples=6 in e2e only. Production defaults are unchanged. - Aya's giant inline Literal types were factored into module-level AyaLanguage / AyaHarmCategory aliases to keep the new subclasses legible without restating the union eight times. - Each new subclass has unit-test smoke coverage in the matching tests/unit/datasets/ file (dataset_name + correct pinned config/split/subset forwarding to the parent fetch method). - doc/code/datasets/1_loading_datasets.ipynb's rendered dataset-name list was updated in place so the docs site picks up the new variants without rerunning the whole notebook on a Windows box. Out of scope: FigStep is on a separate branch and is handled in that PR; its refactor is the same pattern but cleaner because FigStep is not yet on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After adding per-variant sibling subclasses, the parent classes' generic dataset_names (aya_redteaming, salad_bench, babelscape_alert, categorical_harmful_qa, hixstest, vlguard) sat awkwardly next to their newly-added language/split-specific siblings, leaving users unable to tell which variant the generic name refers to without reading source. Rename each parent's dataset_name to spell out the variant it actually fetches by default: - aya_redteaming -> aya_redteaming_english (default language='English') - salad_bench -> salad_bench_base (default split='base') - babelscape_alert -> babelscape_alert_adversarial (default category='alert_adversarial') - categorical_harmful_qa -> categorical_harmful_qa_english (default language='en') - hixstest -> hixstest_hindi (default reads Hindi column) - vlguard -> vlguard_unsafes (default subset=UNSAFES) MSTS is intentionally not renamed: its parent's default is size='large' = all 11 languages aggregated, which is semantically distinct from any per-language sibling, so 'msts' as the aggregate name remains correct. Updated affected unit-test dataset_name assertions and regenerated the docs notebook dataset list cell. No e2e/integration test references to the old names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

romanlutz changed the title ~~Add sibling loaders for dataset variants pulling distinct upstream artifacts~~ FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts#1789

FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts#1789
romanlutz wants to merge 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/dataset-config-test-audit

romanlutz commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

romanlutz commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Notes

Out of scope

Naming follow-up

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romanlutz commented May 23, 2026 •

edited

Loading