Skip to content

MAINT: remove unused per-loader max_examples knobs#1788

Open
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/audit-loader-max-examples
Open

MAINT: remove unused per-loader max_examples knobs#1788
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/audit-loader-max-examples

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

What

Removes the per-loader max_examples / max_prompts row-limit parameters from five remote seed dataset loaders that have no callers anywhere in pyrit/, tests/, or doc/:

  • comic_jailbreak_dataset.pymax_examples
  • msts_dataset.pymax_examples
  • promptintel_dataset.pymax_prompts
  • visual_leak_bench_dataset.pymax_examples
  • vlguard_dataset.pymax_examples

Why

DatasetConfiguration.max_dataset_size (in pyrit/scenario/core/dataset_configuration.py) already provides group-aware random sampling at the scenario layer. The per-loader knobs were a redundant — and footgun-prone — second sampling primitive with different semantics (deterministic prefix vs. random sample).

They were also unreachable from the standard load path: SeedDatasetProvider.fetch_datasets_async instantiates every loader with bare provider_class(). The MOSSBench loader was already stripped of the same param in 364fe3db for the same reason; this PR brings the rest in line.

What is NOT changed

  • vlsu_multimodal_dataset.py retains max_examples — it is the only loader with an actual non-self caller (tests/end_to_end/test_all_datasets.py:92 passes max_examples=6 because the remote image hosting is slow/flaky).

Verification

  • uv run pytest on the five touched dataset test files → 120 passed
  • Pre-commit (ruff format, ruff check, ty) → all green
  • Diff: 10 files, +1/−164

Drops the per-loader max_examples/max_prompts row-limit parameters from five
remote seed dataset loaders that have no callers anywhere in pyrit/, tests/,
or doc/: comic_jailbreak, msts, promptintel, visual_leak_bench, vlguard.

DatasetConfiguration.max_dataset_size already provides group-aware random
sampling at the scenario layer, so the per-loader knobs were a redundant
(and deterministic-prefix, footgun-prone) second sampling primitive. They
were also unreachable from SeedDatasetProvider.fetch_datasets_async, which
instantiates each loader with no args.

vlsu_multimodal is intentionally left alone because its max_examples is the
sole non-self caller anywhere (tests/end_to_end/test_all_datasets.py).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant