FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts#1789
Open
romanlutz wants to merge 2 commits into
Open
FEAT: Add sibling loaders for dataset variants pulling distinct upstream artifacts#1789romanlutz wants to merge 2 commits into
romanlutz wants to merge 2 commits into
Conversation
…tifacts The daily end-to-end matrix in tests/end_to_end/test_all_datasets.py only ever calls provider_cls() with no arguments. Loaders whose constructor parameters select between distinct upstream files (a different HuggingFace split, a different JSONL URL, a different instr-resp field on the same row, a different HF config) therefore only have their default config exercised against real upstream data. A schema drift in a non-default variant (CSV column rename, missing split, image directory rename, etc.) goes unnoticed until a user hits it in production. This change applies the OR-Bench-style aliasing pattern across the 8 affected loaders on main: each distinct upstream variant becomes its own registered SeedDatasetProvider subclass with the relevant constructor default pinned and a unique dataset_name. Because test_all_datasets.py already parametrizes over the registry, every new sibling subclass picks up an additional e2e test case automatically — no test runner refactor needed. Loaders touched (25 new registered subclasses): - AyaRedteaming: 7 language siblings (Hindi, French, Spanish, Arabic, Russian, Serbian, Tagalog) — each is a distinct JSONL URL. - BabelscapeAlert: 1 sibling for the original 'alert' HF config (default is 'alert_adversarial'). - CategoricalHarmfulQA: 2 language siblings (Chinese, Vietnamese) — each is a distinct HF split. - HiXSTest (gated): 1 sibling for the English column. - MSTS: 10 language siblings — each is a distinct HF split. - SaladBench: 2 split siblings (attackEnhanced, defenseEnhanced). - VLGuard (gated): 2 subset siblings (SAFE_UNSAFES, SAFE_SAFES) — each reads a different instr-resp field from the same metadata zip. Operational notes: - Existing kwarg-based selection on parent loaders is untouched, so no downstream user code breaks. - For the multimodal additions (10 MSTS variants, 2 VLGuard variants), test_all_datasets.py grew a small _CAPPED_PROVIDERS set that applies max_examples=6 in e2e only. Production defaults are unchanged. - Aya's giant inline Literal types were factored into module-level AyaLanguage / AyaHarmCategory aliases to keep the new subclasses legible without restating the union eight times. - Each new subclass has unit-test smoke coverage in the matching tests/unit/datasets/ file (dataset_name + correct pinned config/split/subset forwarding to the parent fetch method). - doc/code/datasets/1_loading_datasets.ipynb's rendered dataset-name list was updated in place so the docs site picks up the new variants without rerunning the whole notebook on a Windows box. Out of scope: FigStep is on a separate branch and is handled in that PR; its refactor is the same pattern but cleaner because FigStep is not yet on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After adding per-variant sibling subclasses, the parent classes' generic dataset_names (aya_redteaming, salad_bench, babelscape_alert, categorical_harmful_qa, hixstest, vlguard) sat awkwardly next to their newly-added language/split-specific siblings, leaving users unable to tell which variant the generic name refers to without reading source. Rename each parent's dataset_name to spell out the variant it actually fetches by default: - aya_redteaming -> aya_redteaming_english (default language='English') - salad_bench -> salad_bench_base (default split='base') - babelscape_alert -> babelscape_alert_adversarial (default category='alert_adversarial') - categorical_harmful_qa -> categorical_harmful_qa_english (default language='en') - hixstest -> hixstest_hindi (default reads Hindi column) - vlguard -> vlguard_unsafes (default subset=UNSAFES) MSTS is intentionally not renamed: its parent's default is size='large' = all 11 languages aggregated, which is semantically distinct from any per-language sibling, so 'msts' as the aggregate name remains correct. Updated affected unit-test dataset_name assertions and regenerated the docs notebook dataset list cell. No e2e/integration test references to the old names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
tests/end_to_end/test_all_datasets.pyparametrizes over every registeredSeedDatasetProviderbut only ever callsprovider_cls()with no arguments. Loaders whose non-default constructor arguments select between distinct upstream artifacts — a different HF split, a different JSONL URL, a differentinstr-respfield on the same row, a different HF config — therefore only have their default config exercised against real upstream data. A schema drift in a non-default variant (CSV column rename, missing split, image directory rename) goes unnoticed until a user hits it in production. Unit tests catch the wiring; they do not catch upstream drift.What
Apply the OR-Bench-style aliasing pattern across the 8 affected loaders on
main: each distinct upstream variant becomes its own registeredSeedDatasetProvidersubclass with the relevant constructor default pinned and a uniquedataset_name. The registry-driven e2e parametrization then picks up every new subclass automatically — no test-runner refactor needed.Per-loader breakdown (25 new registered subclasses):
_AyaRedteamingDataset_BabelscapeAlertDataset_BabelscapeAlertOriginalDataset)alertvs defaultalert_adversarial)_CategoricalHarmfulQADataset_HiXSTestDataset(gated)_MSTSDataset_SaladBenchDatasetattackEnhanced,defenseEnhanced)_VLGuardDataset(gated)SAFE_UNSAFES,SAFE_SAFES)instr-respfield read on the same metadata zipLoaders left unchanged because the default already fans out across every upstream artifact: EquityMedQA (
subset_name="all"iterates 11 HF configs), MedSafetyBench (same pattern), ComicJailbreak (default fetches all 5 templates), VLSU / HarmBenchMultimodal / VisualLeakBench (all params are post-fetch filters), etc.Notes
tests/end_to_end/test_all_datasets.pygrew a small_CAPPED_PROVIDERSset that appliesmax_examples=6in e2e only. Production defaults are unchanged._AyaRedteamingDatasetinlined a giantLiteral[...]forharm_categories. Restating that union eight times for each subclass would have been unreadable, so it was factored into module-levelAyaLanguage/AyaHarmCategorytype aliases that both the parent and all 7 subclasses use.tests/unit/datasets/file:dataset_name, correct pinnedconfig/split/subsetforwarded to the parent fetch method, kwarg forwarding formax_examplesandtoken.doc/code/datasets/1_loading_datasets.ipynb's rendered dataset-name list cell was surgically updated in place so the docs site picks up the new variants without a full re-execution introducing environmental noise (timestamps, UUIDs, Windows paths) into the diff._HiXSTest*and_VLGuard*variants inherit the parent'sHUGGINGFACE_TOKENplumbing and will skip in CI without the secret, same behavior as today.Out of scope
FigStep is on a separate branch (
romanlutz/figstep-dataset-loader) and is not yet onmain. It is the motivating example for this audit but the refactor is handled in that PR, where the same pattern can be applied more cleanly (clean OR-Bench-style abstract base + 3 concrete subclasses with no deprecation surface).Naming follow-up
After adding sibling subclasses, the parent classes' generic
dataset_names (aya_redteaming,salad_bench,babelscape_alert,categorical_harmful_qa,hixstest,vlguard) sat awkwardly next to their newly-added language/split-specific siblings — there was no way to tell from the registry which variant the generic name referred to without reading source. Each parent'sdataset_nameis now renamed to spell out the variant it actually fetches by default:aya_redteamingaya_redteaming_englishlanguage="English"salad_benchsalad_bench_basesplit="base"babelscape_alertbabelscape_alert_adversarialcategory="alert_adversarial"categorical_harmful_qacategorical_harmful_qa_englishlanguage="en"hixstesthixstest_hindivlguardvlguard_unsafessubset=UNSAFESMSTS is intentionally not renamed: its parent's default (
size="large") is the all-11-languages aggregate, semantically distinct from any per-language sibling, somstsas the aggregate name remains correct.Note: This changes the public
dataset_namestrings for these six loaders. PyRIT is pre-1.0; any external code callingSeedDatasetProvider.get_by_name("aya_redteaming")(etc.) will need to switch to the new explicit name.Verification
uv run pytest tests/unit/datasets/ -q— 576 tests passuv run ruff check pyrit/ tests/— All checks passeduv run ruff format --check pyrit/ tests/— All files already formattedSeedDatasetProvider.get_all_providers()returns 91 providers (was 66), all 25 new subclasses register with uniquedataset_names.