Skip to content

Add XL-SafetyBench datasets and judges#1791

Draft
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/explore-benchmark-integration
Draft

Add XL-SafetyBench datasets and judges#1791
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/explore-benchmark-integration

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Summary

Adds the XL-SafetyBench benchmark — a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity — to PyRIT. Co-authored by PyRIT contributor Amanda Minnich.

This PR implements Phase 1 (seed datasets) and Phase 2 (judges). Phase 3 (Scenarios) and Phase 4 (example notebook) are deferred to follow-up PRs to keep this one reviewable.

What's in this PR

Datasets

Two new providers under pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py covering the benchmark's two tracks (5,500 native-validated prompts across 10 country-language pairs):

Provider Size Track
_XLSafetyBenchJailbreakDataset 4,500 prompts Country-grounded adversarial prompts in 5 harm categories
_XLSafetyBenchCulturalDataset 1,000 prompts Innocuous scenarios embedding local taboos/sensitivities in 6 categories

Shared enums: XLSafetyBenchCountry (10 values), XLSafetyBenchJailbreakCategory (5 values), XLSafetyBenchCulturalCategory (6 values). The cultural dataset supports language_mode="local" | "english"; hidden_violation from the source CSV is preserved in SeedPrompt.metadata so the cultural judge can use it.

Judges

  • XLSafetyBenchJailbreakJudgeScorer — 4-criterion judge (Harmfulness / Non-Compliance / Specificity / Relevance). Returns a single boolean (ASR = H OR NC) with the full label (Critical Unsafe, Direct Unsafe, Soft Unsafe, Ideal Safe, Neutral Safe) and all four binary criteria exposed in score_metadata for downstream ASR/NSR aggregation.
  • XLSafetyBenchCulturalJudgeScorer — relevance + cultural-recognition judge that consumes the embedded hidden_violation from the seed metadata. True when the response explicitly flags the cultural sensitivity (the paper's CSR metric).

Both judges accept an injectable chat_target (defaults via apply_defaults). Judge system/user templates are ported verbatim from the paper's eval repo and live as companion .txt files under pyrit/score/true_false/xl_safety_bench_prompts/.

Tests

  • tests/unit/datasets/test_xl_safety_bench_dataset.py — 17 tests
  • tests/unit/score/test_xl_safety_bench_judges.py — 17 tests

Covers: dataset name, country/category filters, metadata propagation, language_mode, hidden_violation propagation, empty/wrong-enum filters, deduplication, all 5 jailbreak labels (parametrized), the relevance × cultural-aware matrix, malformed-JSON retry, and missing-metadata fallback.

Notes for reviewers

  • License is CC-BY-4.0 — attribution is preserved in SeedPrompt.authors/source/description.
  • Cultural dataset intentionally omits the "default" tag (innocuous-by-construction); jailbreak track has it.
  • Verbatim judge prompts use string.Template (JSON braces + $var/${var} placeholders) and live in .txt files (line lengths > 120).
  • Paper carries a content warning — same precedent as Aya / ALERT / JBB.
  • @amandajean119 — please pull in for review since you're a paper co-author.

Deferred follow-ups

  • XLSafetyBenchJailbreak / XLSafetyBenchCultural Scenario classes
  • Example notebook under doc/code/scenarios/

Verification

  • ruff check
  • ruff format --check
  • ty check
  • All new + existing tests/unit/{datasets,score} tests pass (34 new, 1540 total)

Implements Phase 1 (seed datasets) and Phase 2 (judge scorers) of
XL-SafetyBench (arXiv:2605.05662), a country-grounded cross-cultural
benchmark for LLM safety and cultural sensitivity (CC-BY-4.0,
AIM-Intelligence/XL-SafetyBench on HuggingFace).

Datasets (pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py):
- _XLSafetyBenchJailbreakDataset (4,500 prompts, 5 categories, 10 countries)
- _XLSafetyBenchCulturalDataset (1,000 prompts, 6 categories, 10 countries)
  - language_mode='local'|'english' selects scenario language
  - hidden_violation preserved in metadata for the cultural judge
- Shared enums: XLSafetyBenchCountry, XLSafetyBenchJailbreakCategory,
  XLSafetyBenchCulturalCategory

Judges (pyrit/score/true_false/):
- XLSafetyBenchJailbreakJudgeScorer: 4-criterion judge (H/NC/S/R),
  produces a label (Critical/Direct/Soft Unsafe, Ideal/Neutral Safe)
  with the booleans in score_metadata for downstream ASR/NSR aggregation
- XLSafetyBenchCulturalJudgeScorer: relevance + cultural recognition;
  CSR truthy when the response flags the embedded sensitivity
- Judge prompts ported verbatim from the paper's GitHub repo and stored
  as companion .txt files for readability

Tests: 34 new unit tests (17 dataset + 17 scorer) covering filters,
metadata propagation, language modes, malformed-JSON retries, and the
full jailbreak label / cultural matrix.

Phase 3 (Scenarios) and Phase 4 (example notebook) deferred to follow-ups.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant