Align evaluator metric mapping for standardized single-metric outputs#46900
Merged
m7md7sien merged 2 commits intoMay 15, 2026
Merged
Conversation
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix evaluator metric mappings for output schema
Align evaluator metric mapping for standardized single-metric outputs
May 14, 2026
m7md7sien
approved these changes
May 14, 2026
51faed1
into
mohessie/standardize_output_schema
4 checks passed
m7md7sien
added a commit
that referenced
this pull request
May 15, 2026
* Update Tool Call Accuracy to output unified format * Update tests * reformatting * Refactor not applicable result method calls * Fix test assertions for new unified output format and apply black formatting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix tool call accuracy test for skipped output schema (#46356) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Standradize Output Scheme * Add explicit _KEY_PREFIX/_RESULT_KEY * add missing evaluators to init * Align evaluator unit tests with new unified output schema * Update recordings tag to solve e2e tests * Run formatting * Align evaluator unit tests with unified output schema and refresh recordings * Restore legacy `_result` and bare evaluator-name keys for backward compat * resolve conflict * Refresh azure-ai-evaluation test recordings for standardized evaluator output schema * Update multimodal test assertion for new schema and refresh recordings tag * Remove unused label assignment in navigation efficiency Remove assignment of match_result to additional_properties_metrics['label'] * update _return_not_applicable_result * Return "not_applicable" instead of "pass" * update evaluators * Fix error * Add results back * undo unrelated change * undo key_prefix change * Revert `_evaluate.py` changes from #46436 on `mohessie/standardize_output_schema` (#46835) * Initial plan * Revert _evaluate.py changes from PR 46436 by restoring file from main Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8462065c-c6cf-473a-9421-84eaf0a44b5b Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * update tool_selection prompty * Fix evaluation unit tests: replace `_KEY_PREFIX` with `_RESULT_KEY` across 7 test files (#46852) * Initial plan * Fix evaluation unit test failures: replace _KEY_PREFIX with _RESULT_KEY and align test expectations Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/b75cef24-3217-4d44-a0ad-51d690e90035 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * reformatting * Fix rouge KeyError and inject _passed key in base evaluator Two fixes for failing e2e tests on standardize_output_schema PR: 1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError. 2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields. * Fix key errors * update test records * Update recordings * Fix result key assignment in base prompt evaluation * Change 'reasoning' to 'reason' in evaluation prompt * Update _document_retrieval.py * Update task instruction from 'reasoning' to 'reason' * update records * Add ndcg_score to document retrieval results * Align evaluator metric mapping for standardized single-metric outputs (#46900) * Initial plan * Align evaluator metric mappings with single-metric output schema Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DocumentRetrievalEvaluatorandRougeScoreEvaluatorwere updated to emit standardized single-primary-metric outputs, but_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGSstill pointed to legacy multi-metric lists. This caused AOAI conversion paths to treat non-emitted metrics as primary/expected.Constants alignment (
_constants.py)document_retrieval: ["document_retrieval"]rouge_score: ["rouge"]EVAL_CLASS_NAME_MAPunchanged (DocumentRetrievalEvaluator -> document_retrieval,RougeScoreEvaluator -> rouge_score).Changelog update (
CHANGELOG.md)*_properties.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
pypi.org/home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.10/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-chn4lfg6/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0(dns block)scanning-api.github.com/home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
Context
PR #46436 (
Standardize Output Schema for Evaluators, branchmohessie/standardize_output_schema) updated several evaluator classes — includingDocumentRetrievalEvaluatorandRougeScoreEvaluator— to emit the unified output schema (a single primary metric plus_score/_passed/_result/_reason/_status/_threshold/_properties).However, the
_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGSdictionary insdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.pystill references the old multi-metric lists for these two evaluators. This is the static lookup that powers the AOAI result conversion in_evaluate.py(_extract_metrics_from_evaluator_name,_extract_from_evaluator_base,_is_primary_metric,_calculate_aoai_evaluation_summary,_get_metric_from_criteria), so it must be kept in lockstep with what the evaluators actually emit. Otherwise:_is_primary_metrickeeps treating the old (no-longer-emitted) first list entry ("xdcg@3"/"rouge_f1_score") as the primary metric, which makes_calculate_aoai_evaluation_summary.result_counts.passed/faileddrop to 0 for these evaluators._add_error_summaries(via_extract_from_evaluator_base) advertises 9 (doc_retrieval) / 3 (rouge_score) metrics that are no longer emitted at the top level, producing spurious error-status entries in_evaluation_results_list._get_metric_from_criteriawill try prefix-matching the newdocument_retrieval_*/rouge_*column suffixes against the old metric names and only match by accident (falling through to step 4).New emitted top-level keys (already in this branch)
DocumentRetrievalEvaluator(see_document_retrieval.py){ "document_retrieval": ndcg_score, "document_retrieval_score": ndcg_score, "document_retrieval_passed": ndcg_passed, "document_retrieval_result": EVALUATION_PASS_FAIL_MAPPING[ndcg_passed], "document_retrieval_reason": None, "document_retrieval_status": "completed", "document_retrieval_threshold": self._threshold, "document_retrieval_properties": metrics, # the old 9 sub-metrics live here now }RougeScoreEvaluator(see_rouge.py){ "rouge": rouge_f1_score, "rouge_score": rouge_f1_score, "rouge_passed": is_passed, "rouge_result": EVALUATION_PASS_FAIL_MAPPING[is_passed], "rouge_reason": None, "rouge_status": "completed", "rouge_threshold": self._threshold["f1_score"], "rouge_properties": { "rouge_precision": ..., "rouge_recall": ..., "rouge_f1_score": ..., # plus per-sub-metric results/passed/thresholds }, }Required change
Update
_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGSinsdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.pyso that the entries match the new single-primary-metric schema:Notes:
EVAL_CLASS_NAME_MAPentries for"DocumentRetrievalEvaluator": "document_retrieval"and"RougeScoreEvaluator": "rouge_score"are already correct and must be left as-is. The lookup chain isclass name → evaluator key (here) → metric list, so the metric list is the only piece that needs to change._EvaluatorMetricMappingthat says the mapping is "based on assets.json" — once the correspondingAzure/azureml-assetsdocument_retrieval spec.yaml androuge_scorespec.yaml output schemas are updated to a single metric, both sides will be aligned.Acceptance criteria
EVALUATOR_NAME_METRICS_MAPPINGS["document_retrieval"]equals["document_retrieval"].EVALUATOR_NAME_METRICS_MAPPINGS["rouge_score"]equals["rouge"].EVALUATOR_NAME_METRICS_MAPPINGSorEVAL_CLASS_NAME_MAPare changed._get_metric_from_criteria,_is_primary_metric, and_calculate_aoai_evaluation_summaryin `sdk/evaluation/azure-ai-evaluation/tests/unittests...This pull request was created from Copilot chat.