Eval harness: bench-runner + context-sync + bulk-indexer (+CI) by adithyn7 · Pull Request #132 · TransformerOptimus/SuperCoder

adithyn7 · 2026-06-05T13:26:08Z

What

Adds the SuperCoder side of the L1/L2 eval harness — the atomic unit that runs the agent against one task — plus the CI/release plumbing to ship it.

New crates

crates/bench-runner — headless single-task runner: reads a task spec (--task-file/stdin), runs the agent, emits a structured JSON result (patch, turns, tokens, tool calls). Ships as a static x86_64-musl binary so it runs toolchain-free inside SWE-bench task containers. --context-engine-url enables codebase_search/codebase_graph (ON path); absent → grep/glob (OFF). Patch = git diff <base_commit> (tracked files only, so build artifacts can't pollute it).
crates/context-sync — the index-streaming client extracted byte-for-byte from the desktop app's context_watcher (streamer.rs + ignore_filter.rs) into a lib, so the offline indexer can reuse it without the Tauri/webkit deps. The 45 unit tests moved with it. Desktop now imports from context_sync.
crates/bulk-indexer — offline native host tool that indexes one repo checkout into the context-engine via context_sync::Streamer::full_sync (not musl; never enters a container).

reqwest → rustls (R1)

All reqwest deps (agent, git-ops, context-sync, desktop) switch to default-features=false + rustls-tls (no native-tls/OpenSSL). Feature unification across the single workspace lockfile means they must agree, or musl static linking breaks — guard comments mark each line.

CI / release

ci.yml: new bench-runner-musl job builds -p bench-runner for musl (native rustup + musl-tools) and asserts the binary is statically linked + self-contained — this is the only thing that catches a native-tls regression (cargo check can't). Also adds cargo test --workspace.
release.yml: new bench-runner job uploads bench-runner-x86_64-unknown-linux-musl as a release asset.

Verified locally

context-sync 45 tests green; desktop test suite green; bench-runner ON + OFF verified end-to-end inside a clean python:3.11 container with the static musl binary (agent uses codebase_search on the ON path; patch applies and pytest passes).
Native-musl CI build path de-risked in a clean ubuntu:22.04 (ring/rustls link; binary runs standalone).
go build/vet ./services/... and the desktop frontend build both pass (untouched areas).

Use default-features=false + rustls-tls (drop native-tls/OpenSSL) so the workspace links cleanly for the static x86_64-unknown-linux-musl bench-runner. Workspace feature unification means every reqwest dep must agree, so the desktop crate follows in the next commit. Guard comments mark each line.

Move the index-streaming client out of the desktop app into crates/context-sync so the offline bulk-indexer can reuse it without the Tauri/webkit deps. Byte-for- byte move (45 tests come along); desktop context_watcher now imports from context_sync. Desktop reqwest also moves to rustls-tls here (R1).

New crate: runs the SuperCoder agent on one task spec and emits a structured JSON result (patch, turns, tokens, tool calls). Static musl x86_64 binary for toolchain-free task containers. --context-engine-url enables codebase_search/ codebase_graph (ON path); absent falls back to grep/glob. Patch = git diff <base_commit> (tracked files only).

New native (non-musl) host tool: indexes one repo checkout into the context- engine under a given key via context_sync::Streamer::full_sync. Generic and manifest-agnostic; the Python eval harness drives it per instance. Appends /api/v1 to the engine URL to match the streaming index routes.

CI: add a bench-runner-musl job that builds -p bench-runner for x86_64-unknown-linux-musl (native rustup + musl-tools) and asserts it is statically linked and self-contained — guards the rustls-only invariant that cargo check can't catch. Also run cargo test --workspace. Release: add a bench-runner job that uploads the static binary as a release asset.

Rust defaults x86_64-unknown-linux-musl to static-PIE, which kept a PT_INTERP to /lib/ld-musl and segfaulted at startup on the native CI runner (local emulation masked it). relocation-model=static yields a classic non-PIE fully static binary that runs on hosts without a musl loader.

A session's status is idle on creation and only becomes active while a turn is running (set on run start, cleared on run end; see agent_create_session). The test wrongly expected a brand-new session to be active. Assert create -> idle, then idle -> active -> idle drives active_session_for_folder.

adithyn7 added 5 commits June 5, 2026 18:49

adithyn7 added feature New feature minor Minor version bump labels Jun 5, 2026

adithyn7 added 2 commits June 5, 2026 19:03

adithyn7 merged commit 37c0195 into main Jun 5, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval harness: bench-runner + context-sync + bulk-indexer (+CI)#132

Eval harness: bench-runner + context-sync + bulk-indexer (+CI)#132
adithyn7 merged 7 commits into
mainfrom
feat/eval-harness-bench-runner

adithyn7 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adithyn7 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

New crates

reqwest → rustls (R1)

CI / release

Verified locally

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adithyn7 commented Jun 5, 2026 •

edited

Loading