Skip to content

Align evaluator metric mapping for standardized single-metric outputs#46900

Merged
m7md7sien merged 2 commits into
mohessie/standardize_output_schemafrom
copilot/fix-evaluator-metric-mappings
May 15, 2026
Merged

Align evaluator metric mapping for standardized single-metric outputs#46900
m7md7sien merged 2 commits into
mohessie/standardize_output_schemafrom
copilot/fix-evaluator-metric-mappings

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 14, 2026

DocumentRetrievalEvaluator and RougeScoreEvaluator were updated to emit standardized single-primary-metric outputs, but _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS still pointed to legacy multi-metric lists. This caused AOAI conversion paths to treat non-emitted metrics as primary/expected.

  • Constants alignment (_constants.py)

    • Updated:
      • document_retrieval: ["document_retrieval"]
      • rouge_score: ["rouge"]
    • Left all other evaluator mappings unchanged.
    • Left EVAL_CLASS_NAME_MAP unchanged (DocumentRetrievalEvaluator -> document_retrieval, RougeScoreEvaluator -> rouge_score).
  • Changelog update (CHANGELOG.md)

    • Added an unreleased breaking-change note that these evaluators now report single primary metrics, with prior sub-metrics represented under *_properties.
# before
"document_retrieval": ["xdcg@3", "ndcg@3", "fidelity", ...],
"rouge_score": ["rouge_f1_score", "rouge_precision", "rouge_recall"],

# after
"document_retrieval": ["document_retrieval"],
"rouge_score": ["rouge"],

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • pypi.org
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.10/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-chn4lfg6/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
  • scanning-api.github.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Context

PR #46436 (Standardize Output Schema for Evaluators, branch mohessie/standardize_output_schema) updated several evaluator classes — including DocumentRetrievalEvaluator and RougeScoreEvaluator — to emit the unified output schema (a single primary metric plus _score/_passed/_result/_reason/_status/_threshold/_properties).

However, the _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS dictionary in sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.py still references the old multi-metric lists for these two evaluators. This is the static lookup that powers the AOAI result conversion in _evaluate.py (_extract_metrics_from_evaluator_name, _extract_from_evaluator_base, _is_primary_metric, _calculate_aoai_evaluation_summary, _get_metric_from_criteria), so it must be kept in lockstep with what the evaluators actually emit. Otherwise:

  • _is_primary_metric keeps treating the old (no-longer-emitted) first list entry ("xdcg@3" / "rouge_f1_score") as the primary metric, which makes _calculate_aoai_evaluation_summary.result_counts.passed/failed drop to 0 for these evaluators.
  • _add_error_summaries (via _extract_from_evaluator_base) advertises 9 (doc_retrieval) / 3 (rouge_score) metrics that are no longer emitted at the top level, producing spurious error-status entries in _evaluation_results_list.
  • _get_metric_from_criteria will try prefix-matching the new document_retrieval_* / rouge_* column suffixes against the old metric names and only match by accident (falling through to step 4).

New emitted top-level keys (already in this branch)

DocumentRetrievalEvaluator (see _document_retrieval.py)

{
    "document_retrieval": ndcg_score,
    "document_retrieval_score": ndcg_score,
    "document_retrieval_passed": ndcg_passed,
    "document_retrieval_result": EVALUATION_PASS_FAIL_MAPPING[ndcg_passed],
    "document_retrieval_reason": None,
    "document_retrieval_status": "completed",
    "document_retrieval_threshold": self._threshold,
    "document_retrieval_properties": metrics,  # the old 9 sub-metrics live here now
}

RougeScoreEvaluator (see _rouge.py)

{
    "rouge": rouge_f1_score,
    "rouge_score": rouge_f1_score,
    "rouge_passed": is_passed,
    "rouge_result": EVALUATION_PASS_FAIL_MAPPING[is_passed],
    "rouge_reason": None,
    "rouge_status": "completed",
    "rouge_threshold": self._threshold["f1_score"],
    "rouge_properties": {
        "rouge_precision": ...,
        "rouge_recall": ...,
        "rouge_f1_score": ...,
        # plus per-sub-metric results/passed/thresholds
    },
}

Required change

Update _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS in sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.py so that the entries match the new single-primary-metric schema:

# Before
"document_retrieval": [
    "xdcg@3",
    "ndcg@3",
    "fidelity",
    "top1_relevance",
    "top3_max_relevance",
    "holes",
    "holes_ratio",
    "total_retrieved_documents",
    "total_ground_truth_documents",
],
...
"rouge_score": ["rouge_f1_score", "rouge_precision", "rouge_recall"],

# After
"document_retrieval": ["document_retrieval"],
...
"rouge_score": ["rouge"],

Notes:

  • The EVAL_CLASS_NAME_MAP entries for "DocumentRetrievalEvaluator": "document_retrieval" and "RougeScoreEvaluator": "rouge_score" are already correct and must be left as-is. The lookup chain is class name → evaluator key (here) → metric list, so the metric list is the only piece that needs to change.
  • Keep the comment at the top of _EvaluatorMetricMapping that says the mapping is "based on assets.json" — once the corresponding Azure/azureml-assets document_retrieval spec.yaml and rouge_score spec.yaml output schemas are updated to a single metric, both sides will be aligned.

Acceptance criteria

  1. EVALUATOR_NAME_METRICS_MAPPINGS["document_retrieval"] equals ["document_retrieval"].
  2. EVALUATOR_NAME_METRICS_MAPPINGS["rouge_score"] equals ["rouge"].
  3. No other entries in EVALUATOR_NAME_METRICS_MAPPINGS or EVAL_CLASS_NAME_MAP are changed.
  4. Existing unit tests for _get_metric_from_criteria, _is_primary_metric, and _calculate_aoai_evaluation_summary in `sdk/evaluation/azure-ai-evaluation/tests/unittests...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Fix evaluator metric mappings for output schema Align evaluator metric mapping for standardized single-metric outputs May 14, 2026
Copilot AI requested a review from m7md7sien May 14, 2026 20:39
@m7md7sien m7md7sien marked this pull request as ready for review May 15, 2026 00:23
@m7md7sien m7md7sien requested a review from a team as a code owner May 15, 2026 00:23
@m7md7sien m7md7sien merged commit 51faed1 into mohessie/standardize_output_schema May 15, 2026
4 checks passed
@m7md7sien m7md7sien deleted the copilot/fix-evaluator-metric-mappings branch May 15, 2026 00:23
m7md7sien added a commit that referenced this pull request May 15, 2026
* Update Tool Call Accuracy to output unified format

* Update tests

* reformatting

* Refactor not applicable result method calls

* Fix test assertions for new unified output format and apply black formatting (#46336)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (#46355)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Fix tool call accuracy test for skipped output schema (#46356)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Standradize Output Scheme

* Add explicit _KEY_PREFIX/_RESULT_KEY

* add missing evaluators to init

* Align evaluator unit tests with new unified output schema

* Update recordings tag to solve e2e tests

* Run formatting

* Align evaluator unit tests with unified output schema and refresh recordings

* Restore legacy `_result` and bare evaluator-name keys for backward compat

* resolve conflict

* Refresh azure-ai-evaluation test recordings for standardized evaluator output schema

* Update multimodal test assertion for new schema and refresh recordings tag

* Remove unused label assignment in navigation efficiency

Remove assignment of match_result to additional_properties_metrics['label']

* update _return_not_applicable_result

* Return "not_applicable" instead of "pass"

* update evaluators

* Fix error

* Add results back

* undo unrelated change

* undo key_prefix change

* Revert `_evaluate.py` changes from #46436 on `mohessie/standardize_output_schema` (#46835)

* Initial plan

* Revert _evaluate.py changes from PR 46436 by restoring file from main

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8462065c-c6cf-473a-9421-84eaf0a44b5b

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* update tool_selection prompty

* Fix evaluation unit tests: replace `_KEY_PREFIX` with `_RESULT_KEY` across 7 test files (#46852)

* Initial plan

* Fix evaluation unit test failures: replace _KEY_PREFIX with _RESULT_KEY and align test expectations

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/b75cef24-3217-4d44-a0ad-51d690e90035

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* reformatting

* Fix rouge KeyError and inject _passed key in base evaluator

Two fixes for failing e2e tests on standardize_output_schema PR:

1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError.

2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields.

* Fix key errors

* update test records

* Update recordings

* Fix result key assignment in base prompt evaluation

* Change 'reasoning' to 'reason' in evaluation prompt

* Update _document_retrieval.py

* Update task instruction from 'reasoning' to 'reason'

* update records

* Add ndcg_score to document retrieval results

* Align evaluator metric mapping for standardized single-metric outputs (#46900)

* Initial plan

* Align evaluator metric mappings with single-metric output schema

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants