FEAT: Add MOSSBench multimodal over-sensitivity dataset loader#1786
Open
romanlutz wants to merge 5 commits into
Open
FEAT: Add MOSSBench multimodal over-sensitivity dataset loader#1786romanlutz wants to merge 5 commits into
romanlutz wants to merge 5 commits into
Conversation
MOSSBench is a 300-example multimodal over-refusal benchmark across three oversensitivity stimulus types (Exaggerated Risk, Negated Harm, Counterintuitive Interpretation). A well-behaved VLM should answer every query normally; refusing indicates over-sensitivity. - Loader fetches the metadata JSON and per-pid PNG images from GitHub at a pinned commit, builds image+text SeedPrompt pairs sharing a prompt_group_id and sequence=0, and skips no SeedObjective is created (mirrors the XSTest/OR-Bench over-refusal convention). - Exposes MossBenchOversensitivityType enum filter and a max_examples cap. - Preserves the raw image attribute flags (human/child/syn/ocr) and the HarmBench-style numeric harm indices in per-seed metadata. - Registers MOSSBench in the remote __init__, references.bib, bibliography.md, and the datasets reference notebook. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PyRIT's Sphinx config is MyST, not reST, so :class: / :func: roles render as raw text. Match the convention used by other dataset loaders (plain double-backtick code spans). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ax_dataset_size) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ataset-loader # Conflicts: # doc/bibliography.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a PyRIT seed dataset loader for MOSSBench — a multimodal over-sensitivity benchmark (not a jailbreak benchmark). It probes whether a VLM incorrectly refuses harmless multimodal queries when the image contains a superficially risky-looking element.
The dataset has 300 manually curated, benign image+question pairs organized around three cognitive-science triggers from the paper:
metadata.overMossBenchOversensitivityType)"type 1"EXAGGERATED_RISK"type 2"NEGATED_HARM"type 3"COUNTERINTUITIVE_INTERPRETATIONA well-behaved VLM should answer normally. The paper reports refusal rates up to 76% on SOTA MLLMs.
Design decisions
Image source: GitHub raw at a pinned commit, not HuggingFace.
All 300 PNGs are present in the upstream
xirui-li/MOSSBenchrepo at commit8d68b061…(verified by countingdata/images/*.png). The HF mirror exists but adds gating friction with no upside since the GitHub copy is complete. Images are fetched once via the existingfetch_and_cache_image_asynchelper.Over-refusal objective convention: no
SeedObjective.Mirrors the text-only over-refusal loaders (
_XSTestDataset,_ORBenchBaseDataset). The "non-refusal expected" semantics live in the dataset's identity (tagover_refusal), not in a per-row objective field. Each example produces an imageSeedPrompt+ a textSeedPromptsharing aprompt_group_idandsequence=0, so the orchestrator delivers them as a single multimodal user turn.No per-loader
max_examplesparameter.The scenario layer already enforces a group-aware row limit via
DatasetConfiguration.max_dataset_size(countsSeedGroups, samples without replacement, resume-safe via seeded RNG). Duplicating that knob on the loader with different (deterministic-prefix) semantics is a footgun. A separate session is auditing the other multimodal loaders that still exposemax_examplesfor the same reason.Harm-category mapping: preserved numerically, not labeled.
Upstream
metadata.harmis a list of harm-category indices (1–7) describing what risky-looking thing is depicted, not what risk the prompt actually poses. The numeric indices are preserved asmetadata["harm_indices"]. No guessed string mapping is published — the upstream code does not document one.Public API
Per-piece metadata preserved on both the image and text
SeedPrompt:oversensitivity_type(snake_case slug)oversensitivity_type_label(paper''s human-readable label)pid,human,child,syn,ocr,harm_indicesshort_descriptionoriginal_image_url(image piece only)Verification
tests/unit/datasets/test_mossbench_dataset.py— mocks network/image fetch, covers type filtering, metadata preservation, required-field validation, the no-SeedObjectiveconvention, and rejection of invalid enum input.Files
pyrit/datasets/seed_datasets/remote/mossbench_dataset.py— loader +MossBenchOversensitivityTypeenumpyrit/datasets/seed_datasets/remote/__init__.py— registered (alphabetical)tests/unit/datasets/test_mossbench_dataset.py— 16 testsdoc/references.bib— added@li2024mossbenchdoc/bibliography.md— citation listeddoc/code/datasets/1_loading_datasets.py+.ipynb— mention addedCitation
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com