Align evaluator metric mapping for standardized single-metric outputs by Copilot · Pull Request #46900 · Azure/azure-sdk-for-python

Copilot · 2026-05-14T20:33:05Z

DocumentRetrievalEvaluator and RougeScoreEvaluator were updated to emit standardized single-primary-metric outputs, but _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS still pointed to legacy multi-metric lists. This caused AOAI conversion paths to treat non-emitted metrics as primary/expected.

Constants alignment (_constants.py)
- Updated:
  - document_retrieval: ["document_retrieval"]
  - rouge_score: ["rouge"]
- Left all other evaluator mappings unchanged.
- Left EVAL_CLASS_NAME_MAP unchanged (DocumentRetrievalEvaluator -> document_retrieval, RougeScoreEvaluator -> rouge_score).
Changelog update (CHANGELOG.md)
- Added an unreleased breaking-change note that these evaluators now report single primary metrics, with prior sub-metrics represented under *_properties.

# before
"document_retrieval": ["xdcg@3", "ndcg@3", "fidelity", ...],
"rouge_score": ["rouge_f1_score", "rouge_precision", "rouge_recall"],

# after
"document_retrieval": ["document_retrieval"],
"rouge_score": ["rouge"],

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

pypi.org
- Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.10/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-chn4lfg6/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
scanning-api.github.com
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

Context

PR #46436 (Standardize Output Schema for Evaluators, branch mohessie/standardize_output_schema) updated several evaluator classes — including DocumentRetrievalEvaluator and RougeScoreEvaluator — to emit the unified output schema (a single primary metric plus _score/_passed/_result/_reason/_status/_threshold/_properties).

However, the _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS dictionary in sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.py still references the old multi-metric lists for these two evaluators. This is the static lookup that powers the AOAI result conversion in _evaluate.py (_extract_metrics_from_evaluator_name, _extract_from_evaluator_base, _is_primary_metric, _calculate_aoai_evaluation_summary, _get_metric_from_criteria), so it must be kept in lockstep with what the evaluators actually emit. Otherwise:

_is_primary_metric keeps treating the old (no-longer-emitted) first list entry ("xdcg@3" / "rouge_f1_score") as the primary metric, which makes _calculate_aoai_evaluation_summary.result_counts.passed/failed drop to 0 for these evaluators.
_add_error_summaries (via _extract_from_evaluator_base) advertises 9 (doc_retrieval) / 3 (rouge_score) metrics that are no longer emitted at the top level, producing spurious error-status entries in _evaluation_results_list.
_get_metric_from_criteria will try prefix-matching the new document_retrieval_* / rouge_* column suffixes against the old metric names and only match by accident (falling through to step 4).

New emitted top-level keys (already in this branch)

`DocumentRetrievalEvaluator` (see `_document_retrieval.py`)

{
    "document_retrieval": ndcg_score,
    "document_retrieval_score": ndcg_score,
    "document_retrieval_passed": ndcg_passed,
    "document_retrieval_result": EVALUATION_PASS_FAIL_MAPPING[ndcg_passed],
    "document_retrieval_reason": None,
    "document_retrieval_status": "completed",
    "document_retrieval_threshold": self._threshold,
    "document_retrieval_properties": metrics,  # the old 9 sub-metrics live here now
}

`RougeScoreEvaluator` (see `_rouge.py`)

{
    "rouge": rouge_f1_score,
    "rouge_score": rouge_f1_score,
    "rouge_passed": is_passed,
    "rouge_result": EVALUATION_PASS_FAIL_MAPPING[is_passed],
    "rouge_reason": None,
    "rouge_status": "completed",
    "rouge_threshold": self._threshold["f1_score"],
    "rouge_properties": {
        "rouge_precision": ...,
        "rouge_recall": ...,
        "rouge_f1_score": ...,
        # plus per-sub-metric results/passed/thresholds
    },
}

Required change

Update _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS in sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.py so that the entries match the new single-primary-metric schema:

# Before
"document_retrieval": [
    "xdcg@3",
    "ndcg@3",
    "fidelity",
    "top1_relevance",
    "top3_max_relevance",
    "holes",
    "holes_ratio",
    "total_retrieved_documents",
    "total_ground_truth_documents",
],
...
"rouge_score": ["rouge_f1_score", "rouge_precision", "rouge_recall"],

# After
"document_retrieval": ["document_retrieval"],
...
"rouge_score": ["rouge"],

Notes:

The EVAL_CLASS_NAME_MAP entries for "DocumentRetrievalEvaluator": "document_retrieval" and "RougeScoreEvaluator": "rouge_score" are already correct and must be left as-is. The lookup chain is class name → evaluator key (here) → metric list, so the metric list is the only piece that needs to change.
Keep the comment at the top of _EvaluatorMetricMapping that says the mapping is "based on assets.json" — once the corresponding Azure/azureml-assets document_retrieval spec.yaml and rouge_score spec.yaml output schemas are updated to a single metric, both sides will be aligned.

Acceptance criteria

EVALUATOR_NAME_METRICS_MAPPINGS["document_retrieval"] equals ["document_retrieval"].
EVALUATOR_NAME_METRICS_MAPPINGS["rouge_score"] equals ["rouge"].
No other entries in EVALUATOR_NAME_METRICS_MAPPINGS or EVAL_CLASS_NAME_MAP are changed.
Existing unit tests for _get_metric_from_criteria, _is_primary_metric, and _calculate_aoai_evaluation_summary in `sdk/evaluation/azure-ai-evaluation/tests/unittests...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Update Tool Call Accuracy to output unified format * Update tests * reformatting * Refactor not applicable result method calls * Fix test assertions for new unified output format and apply black formatting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix tool call accuracy test for skipped output schema (#46356) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Standradize Output Scheme * Add explicit _KEY_PREFIX/_RESULT_KEY * add missing evaluators to init * Align evaluator unit tests with new unified output schema * Update recordings tag to solve e2e tests * Run formatting * Align evaluator unit tests with unified output schema and refresh recordings * Restore legacy `_result` and bare evaluator-name keys for backward compat * resolve conflict * Refresh azure-ai-evaluation test recordings for standardized evaluator output schema * Update multimodal test assertion for new schema and refresh recordings tag * Remove unused label assignment in navigation efficiency Remove assignment of match_result to additional_properties_metrics['label'] * update _return_not_applicable_result * Return "not_applicable" instead of "pass" * update evaluators * Fix error * Add results back * undo unrelated change * undo key_prefix change * Revert `_evaluate.py` changes from #46436 on `mohessie/standardize_output_schema` (#46835) * Initial plan * Revert _evaluate.py changes from PR 46436 by restoring file from main Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8462065c-c6cf-473a-9421-84eaf0a44b5b Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * update tool_selection prompty * Fix evaluation unit tests: replace `_KEY_PREFIX` with `_RESULT_KEY` across 7 test files (#46852) * Initial plan * Fix evaluation unit test failures: replace _KEY_PREFIX with _RESULT_KEY and align test expectations Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/b75cef24-3217-4d44-a0ad-51d690e90035 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * reformatting * Fix rouge KeyError and inject _passed key in base evaluator Two fixes for failing e2e tests on standardize_output_schema PR: 1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError. 2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields. * Fix key errors * update test records * Update recordings * Fix result key assignment in base prompt evaluation * Change 'reasoning' to 'reason' in evaluation prompt * Update _document_retrieval.py * Update task instruction from 'reasoning' to 'reason' * update records * Add ndcg_score to document retrieval results * Align evaluator metric mapping for standardized single-metric outputs (#46900) * Initial plan * Align evaluator metric mappings with single-metric output schema Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Initial plan

1eacaeb

Copilot AI assigned Copilot and m7md7sien May 14, 2026

Copilot started work on behalf of m7md7sien May 14, 2026 20:33 View session

Align evaluator metric mappings with single-metric output schema

0109c6a

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix evaluator metric mappings for output schema~~ Align evaluator metric mapping for standardized single-metric outputs May 14, 2026

Copilot finished work on behalf of m7md7sien May 14, 2026 20:39

Copilot AI requested a review from m7md7sien May 14, 2026 20:39

m7md7sien approved these changes May 14, 2026

View reviewed changes

m7md7sien marked this pull request as ready for review May 15, 2026 00:23

m7md7sien requested a review from a team as a code owner May 15, 2026 00:23

m7md7sien merged commit 51faed1 into mohessie/standardize_output_schema May 15, 2026
4 checks passed

m7md7sien deleted the copilot/fix-evaluator-metric-mappings branch May 15, 2026 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align evaluator metric mapping for standardized single-metric outputs#46900

Align evaluator metric mapping for standardized single-metric outputs#46900
m7md7sien merged 2 commits into
mohessie/standardize_output_schemafrom
copilot/fix-evaluator-metric-mappings

Copilot AI commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Context

New emitted top-level keys (already in this branch)

DocumentRetrievalEvaluator (see _document_retrieval.py)

RougeScoreEvaluator (see _rouge.py)

Required change

Acceptance criteria

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 14, 2026 •

edited

Loading

`DocumentRetrievalEvaluator` (see `_document_retrieval.py`)

`RougeScoreEvaluator` (see `_rouge.py`)