Skip to content

Standardize Output Schema for Evalautors#46436

Merged
m7md7sien merged 54 commits into
mainfrom
mohessie/standardize_output_schema
May 15, 2026
Merged

Standardize Output Schema for Evalautors#46436
m7md7sien merged 54 commits into
mainfrom
mohessie/standardize_output_schema

Conversation

@m7md7sien
Copy link
Copy Markdown
Contributor

@m7md7sien m7md7sien commented Apr 21, 2026

Description

Standradize Output Scheme for Evalautors

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

m7md7sien and others added 13 commits April 15, 2026 00:35
…matting (#46336)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…ed properties handling (#46355)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
@github-actions github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 21, 2026
m7md7sien added 12 commits May 13, 2026 04:51
Two fixes for failing e2e tests on standardize_output_schema PR:

1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError.

2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields.
…#46900)

* Initial plan

* Align evaluator metric mappings with single-metric output schema

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
@m7md7sien m7md7sien marked this pull request as ready for review May 15, 2026 01:05
@m7md7sien m7md7sien requested a review from a team as a code owner May 15, 2026 01:05
Copilot AI review requested due to automatic review settings May 15, 2026 01:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes the output schema across all evaluators in azure-ai-evaluation. Each evaluator now emits a consistent set of keys (<metric>, <metric>_score, <metric>_passed, <metric>_result, <metric>_reason, <metric>_status, <metric>_threshold, <metric>_properties) instead of a mix of bare keys (e.g., fluency), gpt_<metric> keys, and flat token/sample fields. Prompty files are updated to return JSON (with status: completed|skipped, score, reason, properties), and the base prompty evaluator parses this shape uniformly. Tests are updated accordingly; CHANGELOG mentions only the mapping change for document_retrieval/rouge_score.

Changes:

  • New uniform output schema (score/passed/result/reason/status/threshold/properties) across all evaluators; token/sample metadata folded into properties.
  • Prompty files migrated to json_object response_format and a shared {score, reason, properties, status} contract supporting skipped evaluations.
  • EVALUATOR_NAME_METRICS_MAPPINGS collapses document_retrieval and rouge_score to single primary metrics; sub-metrics moved to <metric>_properties.

Reviewed changes

Copilot reviewed 53 out of 53 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
CHANGELOG.md Adds breaking-changes note (incomplete — only covers mapping change).
assets.json Bumps assets tag.
_constants.py Collapses doc_retrieval/rouge metric mappings to single primary metric.
_common/_base_prompty_eval.py Replaces XML parsing with JSON parsing; adds _get_token_metadata; removes _not_applicable_result.
_common/_base_eval.py Skips synthesizing _result/_threshold when already present in evaluator output.
_evaluators/_bleu, _gleu, _meteor, _rouge, _f1_score Updated to new schema with _score/_passed/_status/_properties and pass-fail booleans.
_evaluators/_document_retrieval Wraps per-metric details under document_retrieval_properties; uses primary NDCG as top-level score.
_evaluators/_groundedness (incl. prompties) JSON output; _real_call uses shared not-applicable helper.
_evaluators/_relevance, _coherence, _fluency, _similarity, _retrieval (prompties + _relevance.py) JSON score/reason/status outputs; new unified schema.
_evaluators/_response_completeness (incl. prompty) JSON output; drops legacy XML parser branch.
_evaluators/_intent_resolution (incl. prompty) JSON output; new schema.
_evaluators/_task_adherence (incl. prompty) Replaces flagged boolean with 0/1 score; new schema.
_evaluators/_task_completion (incl. prompty) Replaces success with numeric score; new schema; adds skipped status.
_evaluators/_task_navigation_efficiency Adds bare task_navigation_efficiency key alongside _score.
_evaluators/_tool_call_accuracy (incl. prompty) Fixes self.thresholdself._threshold; raises max_tokens.
_evaluators/_tool_call_success, _tool_input_accuracy, _tool_output_utilization, _tool_selection (incl. prompties) Migrate to JSON output schema with skipped support and unified keys.
tests/unittests/* and tests/e2etests/* Updated to new schema; mass-evaluate e2e tests loosened with TODO placeholders.
Comments suppressed due to low confidence (2)

sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_mass_evaluate.py:288

  • Same loose assertion / unresolved TODO as in test_evaluate_singleton_inputs: the previous exact len(metrics.keys()) == 88 was replaced with >= 70. This will hide schema regressions where extra unintended columns/metrics appear. Please verify against a live run and restore an exact equality assertion.
    sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_response_completeness/_response_completeness.py:205
  • _get_binary_result returns "unknown" for NaN scores, but here <metric>_passed is computed as score_result == "pass", which means an "unknown" outcome is reported as False. Same concern as the base class — consider returning None for <metric>_passed when the result is "unknown" to avoid misleading downstream consumers. This pattern is replicated in _relevance.py, _intent_resolution.py, _task_adherence.py, _task_completion.py, _tool_input_accuracy.py, _tool_selection.py, _tool_call_success.py, and _tool_output_utilization.py.
            score_result = self._get_binary_result(score)

            llm_properties.update(self._get_token_metadata(result if isinstance(result, dict) else {}))

            return {
                self._result_key: score,
                f"{self._result_key}_score": score,
                f"{self._result_key}_passed": score_result == "pass",
                f"{self._result_key}_result": score_result,

Comment thread sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
@m7md7sien m7md7sien merged commit b06269f into main May 15, 2026
25 checks passed
@m7md7sien m7md7sien deleted the mohessie/standardize_output_schema branch May 15, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants