feat(bench): router-backed loop executor — stateful research through the real kernel by drewstone · Pull Request #188 · tangle-network/agent-runtime

drewstone · 2026-06-06T23:16:17Z

What

Make the research benches run through the real stateful kernel (runLoop + createDynamicDriver) — multi-round, analyst-steered, off-sandbox — instead of the one-shot RAG pool.

router-executor.ts — a router-backed LoopSandboxClient, the "router" cost-dial the one-flow header already names ("backend = the injected LoopSandboxClient (router / local-bridge / sandbox)"). Each streamPrompt = one research shot, off-sandbox. The kernel never branches on backend kind, so this drops in and the full loop (rounds + steering) runs with search working and no sandbox — the in-box egress allowlist (ops-board #976) is irrelevant to research.
research-shot.ts — extracts the retrieve→answer body (runResearchShot) into a shared primitive, so the flat RAG worker (research-gate) and the kernel loop (research-loop) score the identical body.
research-loop.mts — the stateful runner: runExperiment with blind / analyst-steered / aggressive arms over ROUNDS.

Why

The provider leaderboard was one-shot RAG (K=1, no rounds, no resume) — it silently abandoned the stateful loop that's proven on EOPS/commit0. Research is retrieval, not in-box code execution, so it never needed a box; routing the executor through the router lets the same kernel drive it. This is the kernel's own interface (the anticipated LeafExecutor), not a shim.

Verification

All four files tsc-clean (full bench tsc: 0 errors).
SimpleQA smoke (real kernel, 3 arms × rounds): search-backed answers resolve; the loop adaptively stops at round 0 on success — correct depth behavior, and it keeps blind and the steered arms compute-matched.
finsearch smoke (decisive) — all records run 2 real rounds (round 0 fails → loop continues). The analyst arm reshapes round-1 prompts (312→533 chars; findings spliced in), aggressive-push reshapes (312→452, 641→781), and blind stays byte-identical across rounds — the clean unsteered equal-compute control. searches=1/round confirms router search firing off-sandbox.

The steering-vs-blind delta on research is now the experiment to run at real n; this PR delivers the mechanism.

Stacking

Stacked on #186 (base feat/research-leaderboard) — it refactors that PR's research-gate.mts onto the shared research-shot. Merge #186 first.

…un env passthrough research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines. experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it. Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.

…the real kernel Research benches run through runLoop + createDynamicDriver (multi-round, analyst-steered) instead of the one-shot RAG pool. A router-backed LoopSandboxClient serves each research shot off-sandbox, so the kernel drives full rounds with search working and no sandbox dependency (the in-box egress allowlist, ops-board 976, is irrelevant to research). Extract runResearchShot into a shared research-shot module so the flat RAG worker (research-gate) and the kernel loop (research-loop) score the identical retrieve-answer body. Verified: finsearch smoke runs 2 real rounds with the analyst reshaping round-1 prompts and blind held unsteered (clean equal-compute control).

tangletools · 2026-06-06T23:28:29Z

✅ No Blockers — `b6cf2d98`

Readiness 83/100 · Confidence 65/100 · 4 findings (4 low)

deepseek: Correctness 83 · Security 83 · Testing 83 · Architecture 83

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟡 LOW No tests for the 4 changed bench files — bench/src/research-shot.ts

None of the 4 files in this shot (router-executor.ts, research-shot.ts, research-gate.mts, research-loop.mts) have dedicated tests. The closest coverage is experiment.test.mts which exercises runExperiment with a mock LoopSandboxClient — this indirectly covers the kernel path but doesn't test runResearchShot's search/answer logic, the gate's Phase 1/Phase 2 pipeline, or the router-executor's event shape. Risk: behavioral regressions in runResearchShot (the shared primitive) would only be caught by manual bench runs.

🟡 LOW Token usage from routerChatWithUsage dropped; kernel/lab sees zero usage — bench/src/research-shot.ts

runResearchShot (line 112) destructures only content from routerChatWithUsage(), dropping the usage and costUsd fields. The Shot type (line 28) has no usage fields. router-executor.ts line 32-34 emits only finalText, success, searches — no done event or data.usage. The kernel's extractLlmCallEvent (src/runtime/sandbox-events.ts:46) looks for type === 'result' events with `data.usa

🟡 LOW Shot.taskId set to box ID (router-research-N) rather than benchmark task ID in kernel path — bench/src/router-executor.ts

Line 31: runResearchShot(message, id, 0, cfg) uses the sandbox instance ID (router-research-0, etc.) as taskId. For the research-loop.mts kernel path this is harmless — the kernel tracks tasks by benchmark instanceId, not the Shot's internal taskId. But any debug logging or trace that reads Shot.taskId will see box IDs instead of benchmark task IDs, making it harder to correlate shots to tasks during troubleshooting. The research-gate.mts path correctly passes u.task.id (the benchmark task ID).

🟡 LOW as unknown as SandboxInstance cast skips type-checking on missing fields — bench/src/router-executor.ts

Line 38: as unknown as SandboxInstance suppresses TS checking. The returned object has only id, streamPrompt, delete but SandboxInstance from @tangle-network/sandbox likely requires status, events(), refresh(), sendCommand(), etc. The existing bench test mock (experiment.test.mts line 17) uses Promise<any> for the same reason — an established pattern. The kernel only calls id, streamPrompt, delete on the box, so this is safe TODAY. Risk: if the kernel or runExperiment ever calls status or refresh()

_{tangletools · 2026-06-06T23:28:27Z · trace}

tangletools

✅ Approved — 4 non-blocking findings — `b6cf2d98`

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T23:28:27Z · immutable trace}

The base branch was changed.

# Conflicts: # bench/src/research-gate.mts

tangletools

✅ Refreshed approval after new commits — `8c2b7acc`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-07T12:29:11Z}

drewstone added 2 commits June 6, 2026 16:36

tangletools previously approved these changes Jun 6, 2026

View reviewed changes

drewstone changed the base branch from feat/research-leaderboard to main June 7, 2026 12:25

Merge main into research-loop-executor (resolve squash-orphan of #186)

8c2b7ac

# Conflicts: # bench/src/research-gate.mts

tangletools approved these changes Jun 7, 2026

View reviewed changes

drewstone merged commit d8e1032 into main Jun 7, 2026
1 check passed

drewstone deleted the feat/research-loop-executor branch June 7, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): router-backed loop executor — stateful research through the real kernel#188

feat(bench): router-backed loop executor — stateful research through the real kernel#188
drewstone merged 3 commits into
mainfrom
feat/research-loop-executor

drewstone commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026

What

Why

Verification

Stacking

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — b6cf2d98

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 4 non-blocking findings — b6cf2d98

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 8c2b7acc

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `b6cf2d98`

✅ Approved — 4 non-blocking findings — `b6cf2d98`

✅ Refreshed approval after new commits — `8c2b7acc`