Skip to content

feat(bench): router-backed loop executor — stateful research through the real kernel#188

Merged
drewstone merged 3 commits into
mainfrom
feat/research-loop-executor
Jun 7, 2026
Merged

feat(bench): router-backed loop executor — stateful research through the real kernel#188
drewstone merged 3 commits into
mainfrom
feat/research-loop-executor

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

What

Make the research benches run through the real stateful kernel (runLoop + createDynamicDriver) — multi-round, analyst-steered, off-sandbox — instead of the one-shot RAG pool.

  • router-executor.ts — a router-backed LoopSandboxClient, the "router" cost-dial the one-flow header already names ("backend = the injected LoopSandboxClient (router / local-bridge / sandbox)"). Each streamPrompt = one research shot, off-sandbox. The kernel never branches on backend kind, so this drops in and the full loop (rounds + steering) runs with search working and no sandbox — the in-box egress allowlist (ops-board #976) is irrelevant to research.
  • research-shot.ts — extracts the retrieve→answer body (runResearchShot) into a shared primitive, so the flat RAG worker (research-gate) and the kernel loop (research-loop) score the identical body.
  • research-loop.mts — the stateful runner: runExperiment with blind / analyst-steered / aggressive arms over ROUNDS.

Why

The provider leaderboard was one-shot RAG (K=1, no rounds, no resume) — it silently abandoned the stateful loop that's proven on EOPS/commit0. Research is retrieval, not in-box code execution, so it never needed a box; routing the executor through the router lets the same kernel drive it. This is the kernel's own interface (the anticipated LeafExecutor), not a shim.

Verification

  • All four files tsc-clean (full bench tsc: 0 errors).
  • SimpleQA smoke (real kernel, 3 arms × rounds): search-backed answers resolve; the loop adaptively stops at round 0 on success — correct depth behavior, and it keeps blind and the steered arms compute-matched.
  • finsearch smoke (decisive) — all records run 2 real rounds (round 0 fails → loop continues). The analyst arm reshapes round-1 prompts (312→533 chars; findings spliced in), aggressive-push reshapes (312→452, 641→781), and blind stays byte-identical across rounds — the clean unsteered equal-compute control. searches=1/round confirms router search firing off-sandbox.

The steering-vs-blind delta on research is now the experiment to run at real n; this PR delivers the mechanism.

Stacking

Stacked on #186 (base feat/research-leaderboard) — it refactors that PR's research-gate.mts onto the shared research-shot. Merge #186 first.

drewstone added 2 commits June 6, 2026 16:36
…un env passthrough

research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines.

experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it.

Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.
…the real kernel

Research benches run through runLoop + createDynamicDriver (multi-round, analyst-steered) instead of the one-shot RAG pool. A router-backed LoopSandboxClient serves each research shot off-sandbox, so the kernel drives full rounds with search working and no sandbox dependency (the in-box egress allowlist, ops-board 976, is irrelevant to research). Extract runResearchShot into a shared research-shot module so the flat RAG worker (research-gate) and the kernel loop (research-loop) score the identical retrieve-answer body. Verified: finsearch smoke runs 2 real rounds with the analyst reshaping round-1 prompts and blind held unsteered (clean equal-compute control).
@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — b6cf2d98

Readiness 83/100 · Confidence 65/100 · 4 findings (4 low)

deepseek: Correctness 83 · Security 83 · Testing 83 · Architecture 83

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟡 LOW No tests for the 4 changed bench files — bench/src/research-shot.ts

None of the 4 files in this shot (router-executor.ts, research-shot.ts, research-gate.mts, research-loop.mts) have dedicated tests. The closest coverage is experiment.test.mts which exercises runExperiment with a mock LoopSandboxClient — this indirectly covers the kernel path but doesn't test runResearchShot's search/answer logic, the gate's Phase 1/Phase 2 pipeline, or the router-executor's event shape. Risk: behavioral regressions in runResearchShot (the shared primitive) would only be caught by manual bench runs.

🟡 LOW Token usage from routerChatWithUsage dropped; kernel/lab sees zero usage — bench/src/research-shot.ts

runResearchShot (line 112) destructures only content from routerChatWithUsage(), dropping the usage and costUsd fields. The Shot type (line 28) has no usage fields. router-executor.ts line 32-34 emits only finalText, success, searches — no done event or data.usage. The kernel's extractLlmCallEvent (src/runtime/sandbox-events.ts:46) looks for type === 'result' events with `data.usa

🟡 LOW Shot.taskId set to box ID (router-research-N) rather than benchmark task ID in kernel path — bench/src/router-executor.ts

Line 31: runResearchShot(message, id, 0, cfg) uses the sandbox instance ID (router-research-0, etc.) as taskId. For the research-loop.mts kernel path this is harmless — the kernel tracks tasks by benchmark instanceId, not the Shot's internal taskId. But any debug logging or trace that reads Shot.taskId will see box IDs instead of benchmark task IDs, making it harder to correlate shots to tasks during troubleshooting. The research-gate.mts path correctly passes u.task.id (the benchmark task ID).

🟡 LOW as unknown as SandboxInstance cast skips type-checking on missing fields — bench/src/router-executor.ts

Line 38: as unknown as SandboxInstance suppresses TS checking. The returned object has only id, streamPrompt, delete but SandboxInstance from @tangle-network/sandbox likely requires status, events(), refresh(), sendCommand(), etc. The existing bench test mock (experiment.test.mts line 17) uses Promise<any> for the same reason — an established pattern. The kernel only calls id, streamPrompt, delete on the box, so this is safe TODAY. Risk: if the kernel or runExperiment ever calls status or refresh()


tangletools · 2026-06-06T23:28:27Z · trace

tangletools
tangletools previously approved these changes Jun 6, 2026
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 4 non-blocking findings — b6cf2d98

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T23:28:27Z · immutable trace

@drewstone drewstone changed the base branch from feat/research-leaderboard to main June 7, 2026 12:25
@drewstone drewstone dismissed tangletools’s stale review June 7, 2026 12:25

The base branch was changed.

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 8c2b7acc

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-07T12:29:11Z

@drewstone drewstone merged commit d8e1032 into main Jun 7, 2026
1 check passed
@drewstone drewstone deleted the feat/research-loop-executor branch June 7, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants