Skip to content

feat: Add judge evaluation support to agent graphs#188

Draft
jsonbailey wants to merge 1 commit into
mainfrom
jb/aic-2267/graph-eval-requests
Draft

feat: Add judge evaluation support to agent graphs#188
jsonbailey wants to merge 1 commit into
mainfrom
jb/aic-2267/graph-eval-requests

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

Summary

  • Implements AIRUNNER 2.1.3 and GRAPH 1.3.1: AgentGraphRunner populates AgentGraphRunnerResult.eval_requests with per-node input/output pairs (data only). ManagedAgentGraph consumes those requests to fire judge evaluations as a single background asyncio.Task surfaced on ManagedGraphResult.evaluations.
  • LangGraph runner: emits one EvalRequest per node activation that isn't a functional-tool-loop step (handoff-only responses still emit). Per-run isolation: list is local to run() and flows through the closure (no ContextVar).
  • OpenAI runner: extracts EvalRequests from result.new_items post-run, pairing each agent's final MessageOutputItem with the prompt that triggered the activation (user input for root, source agent's last message for downstream nodes).
  • ManagedGraphResult.evaluations is now always an asyncio.Task[List[JudgeResult]]; empty eval_requests resolves immediately to [].

Context

Test plan

  • make test clean across server-ai (209), server-ai-langchain (96), server-ai-openai (53).
  • make lint clean across the three packages.
  • E2e against hello-python-ai/features/create_agent_graph: one \$ld:ai:judge:accuracy event fired (metricValue 0.2), attributed to travel-agent-summarizer, matching the printed JudgeResult 1:1. No duplicates. No judge events on nodes without judges.

Implement spec AIRUNNER 2.1.3 and GRAPH 1.3.1. The agent graph runner
now captures per-node input/output pairs on
AgentGraphRunnerResult.eval_requests without dispatching any judges
itself. ManagedAgentGraph consumes those requests to fire judge
evaluations as a single background asyncio Task surfaced on
ManagedGraphResult.evaluations.

- Add EvalRequest dataclass (node_key, input, output).
- AgentGraphRunnerResult.eval_requests is populated for nodes whose
  AIAgentConfig has a judge_configuration with at least one judge.
- ManagedGraphResult.evaluations is now always an asyncio Task; when
  no eval_requests exist it resolves immediately to an empty list.
- LangGraph runner emits one EvalRequest per node activation that is
  not a functional-tool-loop step. Responses whose only tool calls
  are handoff tools still emit. Per-run isolation: the eval_requests
  list is built locally in run() and passed through make_node_fn so
  concurrent calls do not share state.
- OpenAI runner extracts eval_requests from result.new_items, pairing
  each agent's final message with the prompt that triggered the
  activation (user input for the root, source agent's last message
  for downstream nodes via HandoffOutputItem).

Re-implements PR #142 (merged then reverted) without the in-runner
evaluator dispatch or the ContextVar-based task accumulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant