Skip to content

Eval harness: bench-runner + context-sync + bulk-indexer (+CI)#132

Merged
adithyn7 merged 7 commits into
mainfrom
feat/eval-harness-bench-runner
Jun 5, 2026
Merged

Eval harness: bench-runner + context-sync + bulk-indexer (+CI)#132
adithyn7 merged 7 commits into
mainfrom
feat/eval-harness-bench-runner

Conversation

@adithyn7
Copy link
Copy Markdown
Contributor

@adithyn7 adithyn7 commented Jun 5, 2026

What

Adds the SuperCoder side of the L1/L2 eval harness — the atomic unit that runs the agent against one task — plus the CI/release plumbing to ship it.

New crates

  • crates/bench-runner — headless single-task runner: reads a task spec (--task-file/stdin), runs the agent, emits a structured JSON result (patch, turns, tokens, tool calls). Ships as a static x86_64-musl binary so it runs toolchain-free inside SWE-bench task containers. --context-engine-url enables codebase_search/codebase_graph (ON path); absent → grep/glob (OFF). Patch = git diff <base_commit> (tracked files only, so build artifacts can't pollute it).
  • crates/context-sync — the index-streaming client extracted byte-for-byte from the desktop app's context_watcher (streamer.rs + ignore_filter.rs) into a lib, so the offline indexer can reuse it without the Tauri/webkit deps. The 45 unit tests moved with it. Desktop now imports from context_sync.
  • crates/bulk-indexer — offline native host tool that indexes one repo checkout into the context-engine via context_sync::Streamer::full_sync (not musl; never enters a container).

reqwest → rustls (R1)

All reqwest deps (agent, git-ops, context-sync, desktop) switch to default-features=false + rustls-tls (no native-tls/OpenSSL). Feature unification across the single workspace lockfile means they must agree, or musl static linking breaks — guard comments mark each line.

CI / release

  • ci.yml: new bench-runner-musl job builds -p bench-runner for musl (native rustup + musl-tools) and asserts the binary is statically linked + self-contained — this is the only thing that catches a native-tls regression (cargo check can't). Also adds cargo test --workspace.
  • release.yml: new bench-runner job uploads bench-runner-x86_64-unknown-linux-musl as a release asset.

Verified locally

  • context-sync 45 tests green; desktop test suite green; bench-runner ON + OFF verified end-to-end inside a clean python:3.11 container with the static musl binary (agent uses codebase_search on the ON path; patch applies and pytest passes).
  • Native-musl CI build path de-risked in a clean ubuntu:22.04 (ring/rustls link; binary runs standalone).
  • go build/vet ./services/... and the desktop frontend build both pass (untouched areas).

adithyn7 added 5 commits June 5, 2026 18:49
Use default-features=false + rustls-tls (drop native-tls/OpenSSL) so the
workspace links cleanly for the static x86_64-unknown-linux-musl bench-runner.
Workspace feature unification means every reqwest dep must agree, so the
desktop crate follows in the next commit. Guard comments mark each line.
Move the index-streaming client out of the desktop app into crates/context-sync
so the offline bulk-indexer can reuse it without the Tauri/webkit deps. Byte-for-
byte move (45 tests come along); desktop context_watcher now imports from
context_sync. Desktop reqwest also moves to rustls-tls here (R1).
New crate: runs the SuperCoder agent on one task spec and emits a structured
JSON result (patch, turns, tokens, tool calls). Static musl x86_64 binary for
toolchain-free task containers. --context-engine-url enables codebase_search/
codebase_graph (ON path); absent falls back to grep/glob. Patch = git diff
<base_commit> (tracked files only).
New native (non-musl) host tool: indexes one repo checkout into the context-
engine under a given key via context_sync::Streamer::full_sync. Generic and
manifest-agnostic; the Python eval harness drives it per instance. Appends
/api/v1 to the engine URL to match the streaming index routes.
CI: add a bench-runner-musl job that builds -p bench-runner for
x86_64-unknown-linux-musl (native rustup + musl-tools) and asserts it is
statically linked and self-contained — guards the rustls-only invariant that
cargo check can't catch. Also run cargo test --workspace. Release: add a
bench-runner job that uploads the static binary as a release asset.
@adithyn7 adithyn7 added feature New feature minor Minor version bump labels Jun 5, 2026
adithyn7 added 2 commits June 5, 2026 19:03
Rust defaults x86_64-unknown-linux-musl to static-PIE, which kept a PT_INTERP
to /lib/ld-musl and segfaulted at startup on the native CI runner (local
emulation masked it). relocation-model=static yields a classic non-PIE fully
static binary that runs on hosts without a musl loader.
A session's status is idle on creation and only becomes active while a turn is
running (set on run start, cleared on run end; see agent_create_session). The
test wrongly expected a brand-new session to be active. Assert create -> idle,
then idle -> active -> idle drives active_session_for_folder.
@adithyn7 adithyn7 merged commit 37c0195 into main Jun 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature minor Minor version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant