feat: Add judge evaluation support to agent graphs#188
Draft
jsonbailey wants to merge 1 commit into
Draft
Conversation
Implement spec AIRUNNER 2.1.3 and GRAPH 1.3.1. The agent graph runner now captures per-node input/output pairs on AgentGraphRunnerResult.eval_requests without dispatching any judges itself. ManagedAgentGraph consumes those requests to fire judge evaluations as a single background asyncio Task surfaced on ManagedGraphResult.evaluations. - Add EvalRequest dataclass (node_key, input, output). - AgentGraphRunnerResult.eval_requests is populated for nodes whose AIAgentConfig has a judge_configuration with at least one judge. - ManagedGraphResult.evaluations is now always an asyncio Task; when no eval_requests exist it resolves immediately to an empty list. - LangGraph runner emits one EvalRequest per node activation that is not a functional-tool-loop step. Responses whose only tool calls are handoff tools still emit. Per-run isolation: the eval_requests list is built locally in run() and passed through make_node_fn so concurrent calls do not share state. - OpenAI runner extracts eval_requests from result.new_items, pairing each agent's final message with the prompt that triggered the activation (user input for the root, source agent's last message for downstream nodes via HandoffOutputItem). Re-implements PR #142 (merged then reverted) without the in-runner evaluator dispatch or the ContextVar-based task accumulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AgentGraphRunnerpopulatesAgentGraphRunnerResult.eval_requestswith per-node input/output pairs (data only).ManagedAgentGraphconsumes those requests to fire judge evaluations as a single backgroundasyncio.Tasksurfaced onManagedGraphResult.evaluations.EvalRequestper node activation that isn't a functional-tool-loop step (handoff-only responses still emit). Per-run isolation: list is local torun()and flows through the closure (no ContextVar).EvalRequests fromresult.new_itemspost-run, pairing each agent's finalMessageOutputItemwith the prompt that triggered the activation (user input for root, source agent's last message for downstream nodes).ManagedGraphResult.evaluationsis now always anasyncio.Task[List[JudgeResult]]; emptyeval_requestsresolves immediately to[].Context
ContextVar-based task accumulator. PR feat: Migrate LangGraph runner to AgentGraphRunnerResult; clean up legacy shape detection #156 removed the in-runner judge dispatch from LangGraph but never wired the replacement — that's what this restores.Test plan
make testclean acrossserver-ai(209),server-ai-langchain(96),server-ai-openai(53).make lintclean across the three packages.hello-python-ai/features/create_agent_graph: one\$ld:ai:judge:accuracyevent fired (metricValue 0.2), attributed totravel-agent-summarizer, matching the printedJudgeResult1:1. No duplicates. No judge events on nodes without judges.