Standardize Output Schema for Evalautors by m7md7sien · Pull Request #46436 · Azure/azure-sdk-for-python

m7md7sien · 2026-04-21T04:45:22Z

Description

Standradize Output Scheme for Evalautors

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…matting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

…ed properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

…/standardize_output_schema

…ordings

…mpat

…r output schema

…ndardize_output_schema' of https://github.com/Azure/azure-sdk-for-python into mohessie/standardize_output_schema

…s tag

Remove assignment of match_result to additional_properties_metrics['label']

Two fixes for failing e2e tests on standardize_output_schema PR: 1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError. 2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields.

…#46900) * Initial plan * Align evaluator metric mappings with single-metric output schema Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot

Pull request overview

This PR standardizes the output schema across all evaluators in azure-ai-evaluation. Each evaluator now emits a consistent set of keys (<metric>, <metric>_score, <metric>_passed, <metric>_result, <metric>_reason, <metric>_status, <metric>_threshold, <metric>_properties) instead of a mix of bare keys (e.g., fluency), gpt_<metric> keys, and flat token/sample fields. Prompty files are updated to return JSON (with status: completed|skipped, score, reason, properties), and the base prompty evaluator parses this shape uniformly. Tests are updated accordingly; CHANGELOG mentions only the mapping change for document_retrieval/rouge_score.

Changes:

New uniform output schema (score/passed/result/reason/status/threshold/properties) across all evaluators; token/sample metadata folded into properties.
Prompty files migrated to json_object response_format and a shared {score, reason, properties, status} contract supporting skipped evaluations.
EVALUATOR_NAME_METRICS_MAPPINGS collapses document_retrieval and rouge_score to single primary metrics; sub-metrics moved to <metric>_properties.

Reviewed changes

Copilot reviewed 53 out of 53 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
CHANGELOG.md	Adds breaking-changes note (incomplete — only covers mapping change).
assets.json	Bumps assets tag.
_constants.py	Collapses doc_retrieval/rouge metric mappings to single primary metric.
_common/_base_prompty_eval.py	Replaces XML parsing with JSON parsing; adds `_get_token_metadata`; removes `_not_applicable_result`.
_common/_base_eval.py	Skips synthesizing `_result`/`_threshold` when already present in evaluator output.
_evaluators/_bleu, _gleu, _meteor, _rouge, _f1_score	Updated to new schema with `_score/_passed/_status/_properties` and pass-fail booleans.
_evaluators/_document_retrieval	Wraps per-metric details under `document_retrieval_properties`; uses primary NDCG as top-level score.
_evaluators/_groundedness (incl. prompties)	JSON output; `_real_call` uses shared not-applicable helper.
_evaluators/_relevance, _coherence, _fluency, _similarity, _retrieval (prompties + _relevance.py)	JSON `score/reason/status` outputs; new unified schema.
_evaluators/_response_completeness (incl. prompty)	JSON output; drops legacy XML parser branch.
_evaluators/_intent_resolution (incl. prompty)	JSON output; new schema.
_evaluators/_task_adherence (incl. prompty)	Replaces `flagged` boolean with 0/1 `score`; new schema.
_evaluators/_task_completion (incl. prompty)	Replaces `success` with numeric `score`; new schema; adds skipped status.
_evaluators/_task_navigation_efficiency	Adds bare `task_navigation_efficiency` key alongside `_score`.
_evaluators/_tool_call_accuracy (incl. prompty)	Fixes `self.threshold` → `self._threshold`; raises max_tokens.
_evaluators/_tool_call_success, _tool_input_accuracy, _tool_output_utilization, _tool_selection (incl. prompties)	Migrate to JSON output schema with skipped support and unified keys.
tests/unittests/* and tests/e2etests/*	Updated to new schema; mass-evaluate e2e tests loosened with TODO placeholders.

Comments suppressed due to low confidence (2)

sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_mass_evaluate.py:288

Same loose assertion / unresolved TODO as in test_evaluate_singleton_inputs: the previous exact len(metrics.keys()) == 88 was replaced with >= 70. This will hide schema regressions where extra unintended columns/metrics appear. Please verify against a live run and restore an exact equality assertion.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py:205
_get_binary_result returns "unknown" for NaN scores, but here <metric>_passed is computed as score_result == "pass", which means an "unknown" outcome is reported as False. Same concern as the base class — consider returning None for <metric>_passed when the result is "unknown" to avoid misleading downstream consumers. This pattern is replicated in _relevance.py, _intent_resolution.py, _task_adherence.py, _task_completion.py, _tool_input_accuracy.py, _tool_selection.py, _tool_call_success.py, and _tool_output_utilization.py.

            score_result = self._get_binary_result(score)

            llm_properties.update(self._get_token_metadata(result if isinstance(result, dict) else {}))

            return {
                self._result_key: score,
                f"{self._result_key}_score": score,
                f"{self._result_key}_passed": score_result == "pass",
                f"{self._result_key}_result": score_result,

m7md7sien and others added 13 commits April 15, 2026 00:35

Update Tool Call Accuracy to output unified format

3eb40a8

Update tests

d3c4092

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

d076d5c

reformatting

5032e26

Refactor not applicable result method calls

a525806

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

83576b4

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

1893bc8

Standradize Output Scheme

d821299

Merge branch 'mohessie/unify_output/tool_call_accuracy' into mohessie…

c489326

…/standardize_output_schema

Add explicit _KEY_PREFIX/_RESULT_KEY

c94d4e7

github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 21, 2026

m7md7sien added 2 commits April 21, 2026 07:31

add missing evaluators to init

f873be8

Align evaluator unit tests with new unified output schema

e91a4c7

m7md7sien mentioned this pull request Apr 21, 2026

Update Tool Call Accuracy to output unified format #46319

Merged

6 tasks

m7md7sien added 12 commits April 21, 2026 21:18

Update recordings tag to solve e2e tests

55b365f

Run formatting

8726b93

Align evaluator unit tests with unified output schema and refresh rec…

73becf1

…ordings

Merge branch 'main' into mohessie/standardize_output_schema

b4740de

Restore legacy _result and bare evaluator-name keys for backward co…

c357054

…mpat

resolve conflict

d7f459b

Refresh azure-ai-evaluation test recordings for standardized evaluato…

5cb47ec

…r output schema

Merge branch 'main' into mohessie/standardize_output_schema

d91cbf6

Merge branches 'mohessie/standardize_output_schema' and 'mohessie/sta…

bcef3b2

…ndardize_output_schema' of https://github.com/Azure/azure-sdk-for-python into mohessie/standardize_output_schema

Merge branch 'main' into mohessie/standardize_output_schema

b14f476

Update multimodal test assertion for new schema and refresh recording…

76cdaf5

…s tag

Remove unused label assignment in navigation efficiency

519c97c

Remove assignment of match_result to additional_properties_metrics['label']

Copilot AI mentioned this pull request Apr 22, 2026

Migrate task_navigation_efficiency test assertions to standardized output schema #46475

Merged

m7md7sien added 12 commits May 13, 2026 04:51

Fix key errors

fba20e8

Merge branch 'main' into mohessie/standardize_output_schema

3075a5f

update test records

82ac440

Merge branch 'main' into mohessie/standardize_output_schema

fb8d320

Update recordings

575c925

Fix result key assignment in base prompt evaluation

9371c4d

Change 'reasoning' to 'reason' in evaluation prompt

f2e06f8

Update _document_retrieval.py

deb0222

Update task instruction from 'reasoning' to 'reason'

9080a5c

Merge branch 'main' into mohessie/standardize_output_schema

a184c04

update records

6006df6

This was referenced May 13, 2026

[WIP] Port changes from upstream PR to standardize output schema Azure/azureml-assets#5037

Draft

Standardize Output Schema for Evaluators (port of azure-sdk-for-python#46436) Azure/azureml-assets#5038

Draft

ashaabansoliman reviewed May 14, 2026

View reviewed changes

Comment thread ...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_document_retrieval/_document_retrieval.py

ashaabansoliman reviewed May 14, 2026

View reviewed changes

Comment thread sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_fluency/fluency.prompty

Add ndcg_score to document retrieval results

19f0121

This was referenced May 14, 2026

Align evaluator metric mapping for standardized single-metric outputs #46900

Merged

Vendor _base_eval/_base_prompty_eval helpers into evaluators to remove runtime dependency Azure/azureml-assets#5040

Merged

m7md7sien marked this pull request as ready for review May 15, 2026 01:05

m7md7sien requested a review from a team as a code owner May 15, 2026 01:05

Copilot AI review requested due to automatic review settings May 15, 2026 01:05

ashaabansoliman approved these changes May 15, 2026

View reviewed changes

m7md7sien enabled auto-merge (squash) May 15, 2026 01:06

Copilot started reviewing on behalf of m7md7sien May 15, 2026 01:07 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

ashaabansoliman approved these changes May 15, 2026

View reviewed changes

m7md7sien merged commit b06269f into main May 15, 2026
25 checks passed

m7md7sien deleted the mohessie/standardize_output_schema branch May 15, 2026 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize Output Schema for Evalautors#46436

Standardize Output Schema for Evalautors#46436
m7md7sien merged 54 commits into
mainfrom
mohessie/standardize_output_schema

m7md7sien commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

m7md7sien commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

m7md7sien commented Apr 21, 2026 •

edited

Loading