Standardize Output Schema for Evalautors#46436
Conversation
…matting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…ed properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…/standardize_output_schema
…ndardize_output_schema' of https://github.com/Azure/azure-sdk-for-python into mohessie/standardize_output_schema
Remove assignment of match_result to additional_properties_metrics['label']
Two fixes for failing e2e tests on standardize_output_schema PR: 1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError. 2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields.
…#46900) * Initial plan * Align evaluator metric mappings with single-metric output schema Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR standardizes the output schema across all evaluators in azure-ai-evaluation. Each evaluator now emits a consistent set of keys (<metric>, <metric>_score, <metric>_passed, <metric>_result, <metric>_reason, <metric>_status, <metric>_threshold, <metric>_properties) instead of a mix of bare keys (e.g., fluency), gpt_<metric> keys, and flat token/sample fields. Prompty files are updated to return JSON (with status: completed|skipped, score, reason, properties), and the base prompty evaluator parses this shape uniformly. Tests are updated accordingly; CHANGELOG mentions only the mapping change for document_retrieval/rouge_score.
Changes:
- New uniform output schema (score/passed/result/reason/status/threshold/properties) across all evaluators; token/sample metadata folded into
properties. - Prompty files migrated to
json_objectresponse_format and a shared{score, reason, properties, status}contract supportingskippedevaluations. EVALUATOR_NAME_METRICS_MAPPINGScollapsesdocument_retrievalandrouge_scoreto single primary metrics; sub-metrics moved to<metric>_properties.
Reviewed changes
Copilot reviewed 53 out of 53 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| CHANGELOG.md | Adds breaking-changes note (incomplete — only covers mapping change). |
| assets.json | Bumps assets tag. |
| _constants.py | Collapses doc_retrieval/rouge metric mappings to single primary metric. |
| _common/_base_prompty_eval.py | Replaces XML parsing with JSON parsing; adds _get_token_metadata; removes _not_applicable_result. |
| _common/_base_eval.py | Skips synthesizing _result/_threshold when already present in evaluator output. |
| _evaluators/_bleu, _gleu, _meteor, _rouge, _f1_score | Updated to new schema with _score/_passed/_status/_properties and pass-fail booleans. |
| _evaluators/_document_retrieval | Wraps per-metric details under document_retrieval_properties; uses primary NDCG as top-level score. |
| _evaluators/_groundedness (incl. prompties) | JSON output; _real_call uses shared not-applicable helper. |
| _evaluators/_relevance, _coherence, _fluency, _similarity, _retrieval (prompties + _relevance.py) | JSON score/reason/status outputs; new unified schema. |
| _evaluators/_response_completeness (incl. prompty) | JSON output; drops legacy XML parser branch. |
| _evaluators/_intent_resolution (incl. prompty) | JSON output; new schema. |
| _evaluators/_task_adherence (incl. prompty) | Replaces flagged boolean with 0/1 score; new schema. |
| _evaluators/_task_completion (incl. prompty) | Replaces success with numeric score; new schema; adds skipped status. |
| _evaluators/_task_navigation_efficiency | Adds bare task_navigation_efficiency key alongside _score. |
| _evaluators/_tool_call_accuracy (incl. prompty) | Fixes self.threshold → self._threshold; raises max_tokens. |
| _evaluators/_tool_call_success, _tool_input_accuracy, _tool_output_utilization, _tool_selection (incl. prompties) | Migrate to JSON output schema with skipped support and unified keys. |
| tests/unittests/* and tests/e2etests/* | Updated to new schema; mass-evaluate e2e tests loosened with TODO placeholders. |
Comments suppressed due to low confidence (2)
sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_mass_evaluate.py:288
- Same loose assertion / unresolved TODO as in
test_evaluate_singleton_inputs: the previous exactlen(metrics.keys()) == 88was replaced with>= 70. This will hide schema regressions where extra unintended columns/metrics appear. Please verify against a live run and restore an exact equality assertion.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py:205 _get_binary_resultreturns"unknown"for NaN scores, but here<metric>_passedis computed asscore_result == "pass", which means an "unknown" outcome is reported asFalse. Same concern as the base class — consider returningNonefor<metric>_passedwhen the result is "unknown" to avoid misleading downstream consumers. This pattern is replicated in_relevance.py,_intent_resolution.py,_task_adherence.py,_task_completion.py,_tool_input_accuracy.py,_tool_selection.py,_tool_call_success.py, and_tool_output_utilization.py.
score_result = self._get_binary_result(score)
llm_properties.update(self._get_token_metadata(result if isinstance(result, dict) else {}))
return {
self._result_key: score,
f"{self._result_key}_score": score,
f"{self._result_key}_passed": score_result == "pass",
f"{self._result_key}_result": score_result,
Description
Standradize Output Scheme for Evalautors
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines