Eval harness: bench-runner + context-sync + bulk-indexer (+CI)#132
Merged
Conversation
Use default-features=false + rustls-tls (drop native-tls/OpenSSL) so the workspace links cleanly for the static x86_64-unknown-linux-musl bench-runner. Workspace feature unification means every reqwest dep must agree, so the desktop crate follows in the next commit. Guard comments mark each line.
Move the index-streaming client out of the desktop app into crates/context-sync so the offline bulk-indexer can reuse it without the Tauri/webkit deps. Byte-for- byte move (45 tests come along); desktop context_watcher now imports from context_sync. Desktop reqwest also moves to rustls-tls here (R1).
New crate: runs the SuperCoder agent on one task spec and emits a structured JSON result (patch, turns, tokens, tool calls). Static musl x86_64 binary for toolchain-free task containers. --context-engine-url enables codebase_search/ codebase_graph (ON path); absent falls back to grep/glob. Patch = git diff <base_commit> (tracked files only).
New native (non-musl) host tool: indexes one repo checkout into the context- engine under a given key via context_sync::Streamer::full_sync. Generic and manifest-agnostic; the Python eval harness drives it per instance. Appends /api/v1 to the engine URL to match the streaming index routes.
CI: add a bench-runner-musl job that builds -p bench-runner for x86_64-unknown-linux-musl (native rustup + musl-tools) and asserts it is statically linked and self-contained — guards the rustls-only invariant that cargo check can't catch. Also run cargo test --workspace. Release: add a bench-runner job that uploads the static binary as a release asset.
Rust defaults x86_64-unknown-linux-musl to static-PIE, which kept a PT_INTERP to /lib/ld-musl and segfaulted at startup on the native CI runner (local emulation masked it). relocation-model=static yields a classic non-PIE fully static binary that runs on hosts without a musl loader.
A session's status is idle on creation and only becomes active while a turn is running (set on run start, cleared on run end; see agent_create_session). The test wrongly expected a brand-new session to be active. Assert create -> idle, then idle -> active -> idle drives active_session_for_folder.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the SuperCoder side of the L1/L2 eval harness — the atomic unit that runs the agent against one task — plus the CI/release plumbing to ship it.
New crates
crates/bench-runner— headless single-task runner: reads a task spec (--task-file/stdin), runs the agent, emits a structured JSON result (patch, turns, tokens, tool calls). Ships as a static x86_64-musl binary so it runs toolchain-free inside SWE-bench task containers.--context-engine-urlenablescodebase_search/codebase_graph(ON path); absent → grep/glob (OFF). Patch =git diff <base_commit>(tracked files only, so build artifacts can't pollute it).crates/context-sync— the index-streaming client extracted byte-for-byte from the desktop app'scontext_watcher(streamer.rs+ignore_filter.rs) into a lib, so the offline indexer can reuse it without the Tauri/webkit deps. The 45 unit tests moved with it. Desktop now imports fromcontext_sync.crates/bulk-indexer— offline native host tool that indexes one repo checkout into the context-engine viacontext_sync::Streamer::full_sync(not musl; never enters a container).reqwest → rustls (R1)
All reqwest deps (
agent,git-ops,context-sync, desktop) switch todefault-features=false+rustls-tls(no native-tls/OpenSSL). Feature unification across the single workspace lockfile means they must agree, or musl static linking breaks — guard comments mark each line.CI / release
ci.yml: newbench-runner-musljob builds-p bench-runnerfor musl (nativerustup+musl-tools) and asserts the binary is statically linked + self-contained — this is the only thing that catches a native-tls regression (cargo checkcan't). Also addscargo test --workspace.release.yml: newbench-runnerjob uploadsbench-runner-x86_64-unknown-linux-muslas a release asset.Verified locally
context-sync45 tests green; desktop test suite green; bench-runner ON + OFF verified end-to-end inside a cleanpython:3.11container with the static musl binary (agent usescodebase_searchon the ON path; patch applies andpytestpasses).ubuntu:22.04(ring/rustls link; binary runs standalone).go build/vet ./services/...and the desktop frontend build both pass (untouched areas).