fix: tool resilience & crash-durable persistence (5 fixes)#34
Merged
Conversation
shell ran cmd.Run() on a plain exec.Command with no timeout and no context, so a stuck command (network read that never returns, interactive prompt, infinite loop) wedged the agent forever — and Ctrl-C could not recover because the loop's drain blocks on the tool goroutine. The sibling parallel_shell already had a timeout; plain shell, the most-used tool, did not. - SetContext ties execution to the agent context (the loop already calls it on context-aware tools), so Ctrl-C / turn timeout kills the command now. - exec.CommandContext + a generous 30m backstop timeout bounds genuinely stuck commands for unattended runs (serve, telegram) with no human to interrupt. WaitDelay guarantees Run() returns even if the killed process leaves children holding the pipes. - Cancellation/timeout surface as clear errors, not opaque "signal: killed". Tests: a sleeping command now returns promptly via both the timeout and context-cancellation paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Only delegate_tasks honored the loop's SetContext, so a turn timeout or Ctrl-C could not interrupt the other long-running tools — their HTTP requests and subprocesses ran to completion regardless, and the loop's drain blocked on them. Add a small race-safe ctxTool embeddable (SetContext + toolCtx) and wire it through the tools that do unbounded network or subprocess work: - http_batch, browser, web_search → http.NewRequestWithContext - vision (ffprobe/ffmpeg/llama-mtmd-cli), transcribe (ffmpeg/whisper) → exec.CommandContext - shell now uses the shared embed too (replacing its bare ctx field). The mutex in ctxTool matters: when the LLM emits two calls to the same tool in one turn, the loop sets the context on the shared instance from parallel goroutines — an unsynchronised field would race. Tests: ctxTool default/set/concurrent behavior, and http_batch returning a cancellation error when its context is already cancelled. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
SimpleCall did a single http.Do with no retry, so any transient 429/5xx or network blip aborted the best-effort secondary features that use it (skill matching, memory summaries, episode extraction, session titles), while the main agent Call retried. Extract the retry loop into postChatWithRetry and route both through it. Test: SimpleCall now succeeds after two 429s (3 attempts). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The retry loop used a fixed 1/2/4s exponential backoff and ignored the server's Retry-After header, so a real rate limit (Retry-After: 20-60s) burned all three retries in ~7s and failed the turn even though the server said exactly when to come back. Parse Retry-After (integer seconds or HTTP-date) on retryable statuses and use it for the next wait, capped at 60s so a pathological value can't wedge a turn (ctx still breaks the wait). Tests: parseRetryAfter unit cases (seconds, blank, garbage, zero, cap) and a 429+Retry-After call that retries and succeeds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Session and memory writes did WriteFile(tmp)+Rename — atomic against torn reads, but NOT durable: without an fsync a power loss / kernel crash can land the rename while the data is still in the page cache, leaving an empty or truncated file and silently losing the latest conversation turn or extracted memory. session.go also used a fixed "<target>.tmp" name, so two concurrent saves of the same target could clobber each other's temp file. Add internal/fsatomic.WriteFile (unique temp → fsync data → rename → fsync dir) and route the irreplaceable writes through it: session save + index, episode index + summaries, and the facts store. Tests: fsatomic content/perm/overwrite, no temp litter, and concurrent same-target writers never producing a torn file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
odek | 7242f51 | Commit Preview URL Branch Preview URL |
Jun 12 2026, 04:29 AM |
vprotocol: in sandbox mode the timeout/ctx kills the host-side docker exec client (unblocking the agent) but Docker does not forward the signal to the in-container process, which lingers until container teardown. Document the sharp edge so it isn't a future debugging mystery. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI caught it: the two shell timeout tests passed locally but hung ~5s on CI. exec.CommandContext's default Cancel SIGKILLs only the `sh` leader. On a shell that forks the command (`sh -c "sleep 30"` → child sleep, or any pipeline), the child survives and holds the output pipe, so Run() blocks until WaitDelay (5s) — over the test's deadline, and a real 30m-timeout command would leave a lingering process. Run the command in its own process group (Setpgid) and override Cancel to SIGKILL the whole group (negative pid). The repo is Unix-only and already uses syscall.Kill directly. WaitDelay drops to 3s as a backstop. Tests now use a forking pipeline (`sleep 30 | cat`) so they reproduce the CI failure locally; both return in <0.2s with the group kill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Five focused, independently-committed fixes for correctness, recoverability, and durability — the top items from a direct audit. Each commit ships with regression tests; full suite green under
go test ./... -race, cleango vet.1. Shell can no longer hang the agent forever (
5977e21)shellrancmd.Run()on a plainexec.Commandwith no timeout/context, so a stuck command (network read, interactive prompt, infinite loop) wedged the agent and Ctrl-C couldn't recover. Now usesexec.CommandContext(viaSetContext, which the loop already calls) + a generous 30m backstop timeout +WaitDelay. Cancellation/timeout surface as clear errors instead of opaque "signal: killed".2. Tools are now cancellable as a class (
aade314)Only
delegate_taskshonored the loop'sSetContext, so a turn timeout / Ctrl-C couldn't interrupt the other long-running tools. A small race-safectxToolembed threads the agent context intohttp_batch,browser,web_search(→NewRequestWithContext),vision,transcribe(→CommandContext), andshell. The mutex matters: the loop sets the context on a shared tool instance from parallel goroutines when the LLM emits two calls to the same tool.3.
SimpleCallgets the main loop's retry (dabc1cf)SimpleCalldid a singlehttp.Dowith no retry, so a transient 429/5xx aborted the best-effort secondary features (skill match, memory, episodes, titles). ExtractedpostChatWithRetry; both paths now share it.4. Honor
Retry-Afteron rate limits (1ae5ef5)The retry loop used fixed 1/2/4s backoff and ignored
Retry-After, burning all three retries in ~7s under a real rate limit. Now parsesRetry-After(seconds or HTTP-date), capped at 60s (ctx still breaks the wait).5. fsync before rename — crash-durable persistence (
05d8e29)Session/memory writes did
WriteFile(tmp)+Rename— atomic but not durable; a power loss could land the rename with unflushed data and silently lose the latest turn/memory.session.goalso used a fixed<target>.tmpname (concurrent-save clobber). Newinternal/fsatomic.WriteFile(unique temp → fsync data → rename → fsync dir) routes the irreplaceable writes: session save + index, episode index + summaries, facts store.Notes
timeoutfield is settable but not yet wired to config (kept un-plumbed to avoid a config detour — easy to add ashell_timeout_secondsknob later).🤖 Generated with Claude Code