Add XL-SafetyBench datasets and judges by romanlutz · Pull Request #1791 · microsoft/PyRIT

romanlutz · 2026-05-23T05:28:14Z

Summary

Adds the XL-SafetyBench benchmark — a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity — to PyRIT. Co-authored by PyRIT contributor Amanda Minnich.

Paper: arXiv:2605.05662
Dataset: AIM-Intelligence/XL-SafetyBench on HuggingFace (CC-BY-4.0)
Eval code: https://github.com/AIM-Intelligence/XL-SafetyBench

This PR implements Phase 1 (seed datasets) and Phase 2 (judges). Phase 3 (Scenarios) and Phase 4 (example notebook) are deferred to follow-up PRs to keep this one reviewable.

What's in this PR

Datasets

Two new providers under pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py covering the benchmark's two tracks (5,500 native-validated prompts across 10 country-language pairs):

Provider	Size	Track
`_XLSafetyBenchJailbreakDataset`	4,500 prompts	Country-grounded adversarial prompts in 5 harm categories
`_XLSafetyBenchCulturalDataset`	1,000 prompts	Innocuous scenarios embedding local taboos/sensitivities in 6 categories

Shared enums: XLSafetyBenchCountry (10 values), XLSafetyBenchJailbreakCategory (5 values), XLSafetyBenchCulturalCategory (6 values). The cultural dataset supports language_mode="local" | "english"; hidden_violation from the source CSV is preserved in SeedPrompt.metadata so the cultural judge can use it.

Judges

XLSafetyBenchJailbreakJudgeScorer — 4-criterion judge (Harmfulness / Non-Compliance / Specificity / Relevance). Returns a single boolean (ASR = H OR NC) with the full label (Critical Unsafe, Direct Unsafe, Soft Unsafe, Ideal Safe, Neutral Safe) and all four binary criteria exposed in score_metadata for downstream ASR/NSR aggregation.
XLSafetyBenchCulturalJudgeScorer — relevance + cultural-recognition judge that consumes the embedded hidden_violation from the seed metadata. True when the response explicitly flags the cultural sensitivity (the paper's CSR metric).

Both judges accept an injectable chat_target (defaults via apply_defaults). Judge system/user templates are ported verbatim from the paper's eval repo and live as companion .txt files under pyrit/score/true_false/xl_safety_bench_prompts/.

Tests

tests/unit/datasets/test_xl_safety_bench_dataset.py — 17 tests
tests/unit/score/test_xl_safety_bench_judges.py — 17 tests

Covers: dataset name, country/category filters, metadata propagation, language_mode, hidden_violation propagation, empty/wrong-enum filters, deduplication, all 5 jailbreak labels (parametrized), the relevance × cultural-aware matrix, malformed-JSON retry, and missing-metadata fallback.

Notes for reviewers

License is CC-BY-4.0 — attribution is preserved in SeedPrompt.authors/source/description.
Cultural dataset intentionally omits the "default" tag (innocuous-by-construction); jailbreak track has it.
Verbatim judge prompts use string.Template (JSON braces + $var/${var} placeholders) and live in .txt files (line lengths > 120).
Paper carries a content warning — same precedent as Aya / ALERT / JBB.
@amandajean119 — please pull in for review since you're a paper co-author.

Deferred follow-ups

XLSafetyBenchJailbreak / XLSafetyBenchCultural Scenario classes
Example notebook under doc/code/scenarios/

Verification

ruff check ✅
ruff format --check ✅
ty check ✅
All new + existing tests/unit/{datasets,score} tests pass (34 new, 1540 total)

Implements Phase 1 (seed datasets) and Phase 2 (judge scorers) of XL-SafetyBench (arXiv:2605.05662), a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity (CC-BY-4.0, AIM-Intelligence/XL-SafetyBench on HuggingFace). Datasets (pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py): - _XLSafetyBenchJailbreakDataset (4,500 prompts, 5 categories, 10 countries) - _XLSafetyBenchCulturalDataset (1,000 prompts, 6 categories, 10 countries) - language_mode='local'|'english' selects scenario language - hidden_violation preserved in metadata for the cultural judge - Shared enums: XLSafetyBenchCountry, XLSafetyBenchJailbreakCategory, XLSafetyBenchCulturalCategory Judges (pyrit/score/true_false/): - XLSafetyBenchJailbreakJudgeScorer: 4-criterion judge (H/NC/S/R), produces a label (Critical/Direct/Soft Unsafe, Ideal/Neutral Safe) with the booleans in score_metadata for downstream ASR/NSR aggregation - XLSafetyBenchCulturalJudgeScorer: relevance + cultural recognition; CSR truthy when the response flags the embedded sensitivity - Judge prompts ported verbatim from the paper's GitHub repo and stored as companion .txt files for readability Tests: 34 new unit tests (17 dataset + 17 scorer) covering filters, metadata propagation, language modes, malformed-JSON retries, and the full jailbreak label / cultural matrix. Phase 3 (Scenarios) and Phase 4 (example notebook) deferred to follow-ups. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XL-SafetyBench datasets and judges#1791

Add XL-SafetyBench datasets and judges#1791
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/explore-benchmark-integration

romanlutz commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romanlutz commented May 23, 2026

Summary

What's in this PR

Datasets

Judges

Tests

Notes for reviewers

Deferred follow-ups

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant