Add XL-SafetyBench datasets and judges#1791
Draft
romanlutz wants to merge 1 commit into
Draft
Conversation
Implements Phase 1 (seed datasets) and Phase 2 (judge scorers) of XL-SafetyBench (arXiv:2605.05662), a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity (CC-BY-4.0, AIM-Intelligence/XL-SafetyBench on HuggingFace). Datasets (pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py): - _XLSafetyBenchJailbreakDataset (4,500 prompts, 5 categories, 10 countries) - _XLSafetyBenchCulturalDataset (1,000 prompts, 6 categories, 10 countries) - language_mode='local'|'english' selects scenario language - hidden_violation preserved in metadata for the cultural judge - Shared enums: XLSafetyBenchCountry, XLSafetyBenchJailbreakCategory, XLSafetyBenchCulturalCategory Judges (pyrit/score/true_false/): - XLSafetyBenchJailbreakJudgeScorer: 4-criterion judge (H/NC/S/R), produces a label (Critical/Direct/Soft Unsafe, Ideal/Neutral Safe) with the booleans in score_metadata for downstream ASR/NSR aggregation - XLSafetyBenchCulturalJudgeScorer: relevance + cultural recognition; CSR truthy when the response flags the embedded sensitivity - Judge prompts ported verbatim from the paper's GitHub repo and stored as companion .txt files for readability Tests: 34 new unit tests (17 dataset + 17 scorer) covering filters, metadata propagation, language modes, malformed-JSON retries, and the full jailbreak label / cultural matrix. Phase 3 (Scenarios) and Phase 4 (example notebook) deferred to follow-ups. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the XL-SafetyBench benchmark — a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity — to PyRIT. Co-authored by PyRIT contributor Amanda Minnich.
This PR implements Phase 1 (seed datasets) and Phase 2 (judges). Phase 3 (Scenarios) and Phase 4 (example notebook) are deferred to follow-up PRs to keep this one reviewable.
What's in this PR
Datasets
Two new providers under
pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.pycovering the benchmark's two tracks (5,500 native-validated prompts across 10 country-language pairs):_XLSafetyBenchJailbreakDataset_XLSafetyBenchCulturalDatasetShared enums:
XLSafetyBenchCountry(10 values),XLSafetyBenchJailbreakCategory(5 values),XLSafetyBenchCulturalCategory(6 values). The cultural dataset supportslanguage_mode="local" | "english";hidden_violationfrom the source CSV is preserved inSeedPrompt.metadataso the cultural judge can use it.Judges
XLSafetyBenchJailbreakJudgeScorer— 4-criterion judge (Harmfulness / Non-Compliance / Specificity / Relevance). Returns a single boolean (ASR = H OR NC) with the full label (Critical Unsafe,Direct Unsafe,Soft Unsafe,Ideal Safe,Neutral Safe) and all four binary criteria exposed inscore_metadatafor downstream ASR/NSR aggregation.XLSafetyBenchCulturalJudgeScorer— relevance + cultural-recognition judge that consumes the embeddedhidden_violationfrom the seed metadata. True when the response explicitly flags the cultural sensitivity (the paper's CSR metric).Both judges accept an injectable
chat_target(defaults viaapply_defaults). Judge system/user templates are ported verbatim from the paper's eval repo and live as companion.txtfiles underpyrit/score/true_false/xl_safety_bench_prompts/.Tests
tests/unit/datasets/test_xl_safety_bench_dataset.py— 17 teststests/unit/score/test_xl_safety_bench_judges.py— 17 testsCovers: dataset name, country/category filters, metadata propagation,
language_mode,hidden_violationpropagation, empty/wrong-enum filters, deduplication, all 5 jailbreak labels (parametrized), the relevance × cultural-aware matrix, malformed-JSON retry, and missing-metadata fallback.Notes for reviewers
SeedPrompt.authors/source/description."default"tag (innocuous-by-construction); jailbreak track has it.string.Template(JSON braces +$var/${var}placeholders) and live in.txtfiles (line lengths > 120).Deferred follow-ups
XLSafetyBenchJailbreak/XLSafetyBenchCulturalScenarioclassesdoc/code/scenarios/Verification
ruff check✅ruff format --check✅ty check✅tests/unit/{datasets,score}tests pass (34 new, 1540 total)