feat(bench): router-backed loop executor — stateful research through the real kernel#188
Conversation
…un env passthrough research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines. experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it. Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.
…the real kernel Research benches run through runLoop + createDynamicDriver (multi-round, analyst-steered) instead of the one-shot RAG pool. A router-backed LoopSandboxClient serves each research shot off-sandbox, so the kernel drives full rounds with search working and no sandbox dependency (the in-box egress allowlist, ops-board 976, is irrelevant to research). Extract runResearchShot into a shared research-shot module so the flat RAG worker (research-gate) and the kernel loop (research-loop) score the identical retrieve-answer body. Verified: finsearch smoke runs 2 real rounds with the analyst reshaping round-1 prompts and blind held unsteered (clean equal-compute control).
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 4 non-blocking findings — b6cf2d98
Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T23:28:27Z · immutable trace
# Conflicts: # bench/src/research-gate.mts
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 8c2b7acc
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-07T12:29:11Z
What
Make the research benches run through the real stateful kernel (
runLoop+createDynamicDriver) — multi-round, analyst-steered, off-sandbox — instead of the one-shot RAG pool.router-executor.ts— a router-backedLoopSandboxClient, the "router" cost-dial the one-flow header already names ("backend = the injected LoopSandboxClient (router / local-bridge / sandbox)"). EachstreamPrompt= one research shot, off-sandbox. The kernel never branches on backend kind, so this drops in and the full loop (rounds + steering) runs with search working and no sandbox — the in-box egress allowlist (ops-board #976) is irrelevant to research.research-shot.ts— extracts the retrieve→answer body (runResearchShot) into a shared primitive, so the flat RAG worker (research-gate) and the kernel loop (research-loop) score the identical body.research-loop.mts— the stateful runner:runExperimentwith blind / analyst-steered / aggressive arms overROUNDS.Why
The provider leaderboard was one-shot RAG (K=1, no rounds, no resume) — it silently abandoned the stateful loop that's proven on EOPS/commit0. Research is retrieval, not in-box code execution, so it never needed a box; routing the executor through the router lets the same kernel drive it. This is the kernel's own interface (the anticipated LeafExecutor), not a shim.
Verification
searches=1/round confirms router search firing off-sandbox.The steering-vs-blind delta on research is now the experiment to run at real n; this PR delivers the mechanism.
Stacking
Stacked on #186 (base
feat/research-leaderboard) — it refactors that PR'sresearch-gate.mtsonto the sharedresearch-shot. Merge #186 first.