feat: Add judge evaluation support to agent graphs by jsonbailey · Pull Request #188 · launchdarkly/python-server-sdk-ai

jsonbailey · 2026-05-18T20:54:15Z

Summary

Implements AIRUNNER 2.1.3 and GRAPH 1.3.1: AgentGraphRunner populates AgentGraphRunnerResult.eval_requests with per-node input/output pairs (data only). ManagedAgentGraph consumes those requests to fire judge evaluations as a single background asyncio.Task surfaced on ManagedGraphResult.evaluations.
LangGraph runner: emits one EvalRequest per node activation that isn't a functional-tool-loop step (handoff-only responses still emit). Per-run isolation: list is local to run() and flows through the closure (no ContextVar).
OpenAI runner: extracts EvalRequests from result.new_items post-run, pairing each agent's final MessageOutputItem with the prompt that triggered the activation (user input for root, source agent's last message for downstream nodes).
ManagedGraphResult.evaluations is now always an asyncio.Task[List[JudgeResult]]; empty eval_requests resolves immediately to [].

Context

Jira: AIC-2267.
Re-implements PR feat: Add judge evaluation support to agent graphs #142 (merged then reverted) without the in-runner evaluator dispatch or the ContextVar-based task accumulator. PR feat: Migrate LangGraph runner to AgentGraphRunnerResult; clean up legacy shape detection #156 removed the in-runner judge dispatch from LangGraph but never wired the replacement — that's what this restores.

Test plan

make test clean across server-ai (209), server-ai-langchain (96), server-ai-openai (53).
make lint clean across the three packages.
E2e against hello-python-ai/features/create_agent_graph: one \$ld:ai:judge:accuracy event fired (metricValue 0.2), attributed to travel-agent-summarizer, matching the printed JudgeResult 1:1. No duplicates. No judge events on nodes without judges.

Implement spec AIRUNNER 2.1.3 and GRAPH 1.3.1. The agent graph runner now captures per-node input/output pairs on AgentGraphRunnerResult.eval_requests without dispatching any judges itself. ManagedAgentGraph consumes those requests to fire judge evaluations as a single background asyncio Task surfaced on ManagedGraphResult.evaluations. - Add EvalRequest dataclass (node_key, input, output). - AgentGraphRunnerResult.eval_requests is populated for nodes whose AIAgentConfig has a judge_configuration with at least one judge. - ManagedGraphResult.evaluations is now always an asyncio Task; when no eval_requests exist it resolves immediately to an empty list. - LangGraph runner emits one EvalRequest per node activation that is not a functional-tool-loop step. Responses whose only tool calls are handoff tools still emit. Per-run isolation: the eval_requests list is built locally in run() and passed through make_node_fn so concurrent calls do not share state. - OpenAI runner extracts eval_requests from result.new_items, pairing each agent's final message with the prompt that triggered the activation (user input for the root, source agent's last message for downstream nodes via HandoffOutputItem). Re-implements PR #142 (merged then reverted) without the in-runner evaluator dispatch or the ContextVar-based task accumulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add judge evaluation support to agent graphs#188

feat: Add judge evaluation support to agent graphs#188
jsonbailey wants to merge 1 commit into
mainfrom
jb/aic-2267/graph-eval-requests

jsonbailey commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jsonbailey commented May 18, 2026

Summary

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant