Skip to content

fix: tool resilience & crash-durable persistence (5 fixes)#34

Merged
jkyberneees merged 7 commits into
mainfrom
fix/tool-resilience-and-durability
Jun 12, 2026
Merged

fix: tool resilience & crash-durable persistence (5 fixes)#34
jkyberneees merged 7 commits into
mainfrom
fix/tool-resilience-and-durability

Conversation

@jkyberneees

Copy link
Copy Markdown
Contributor

Five focused, independently-committed fixes for correctness, recoverability, and durability — the top items from a direct audit. Each commit ships with regression tests; full suite green under go test ./... -race, clean go vet.

1. Shell can no longer hang the agent forever (5977e21)

shell ran cmd.Run() on a plain exec.Command with no timeout/context, so a stuck command (network read, interactive prompt, infinite loop) wedged the agent and Ctrl-C couldn't recover. Now uses exec.CommandContext (via SetContext, which the loop already calls) + a generous 30m backstop timeout + WaitDelay. Cancellation/timeout surface as clear errors instead of opaque "signal: killed".

2. Tools are now cancellable as a class (aade314)

Only delegate_tasks honored the loop's SetContext, so a turn timeout / Ctrl-C couldn't interrupt the other long-running tools. A small race-safe ctxTool embed threads the agent context into http_batch, browser, web_search (→ NewRequestWithContext), vision, transcribe (→ CommandContext), and shell. The mutex matters: the loop sets the context on a shared tool instance from parallel goroutines when the LLM emits two calls to the same tool.

3. SimpleCall gets the main loop's retry (dabc1cf)

SimpleCall did a single http.Do with no retry, so a transient 429/5xx aborted the best-effort secondary features (skill match, memory, episodes, titles). Extracted postChatWithRetry; both paths now share it.

4. Honor Retry-After on rate limits (1ae5ef5)

The retry loop used fixed 1/2/4s backoff and ignored Retry-After, burning all three retries in ~7s under a real rate limit. Now parses Retry-After (seconds or HTTP-date), capped at 60s (ctx still breaks the wait).

5. fsync before rename — crash-durable persistence (05d8e29)

Session/memory writes did WriteFile(tmp)+Rename — atomic but not durable; a power loss could land the rename with unflushed data and silently lose the latest turn/memory. session.go also used a fixed <target>.tmp name (concurrent-save clobber). New internal/fsatomic.WriteFile (unique temp → fsync data → rename → fsync dir) routes the irreplaceable writes: session save + index, episode index + summaries, facts store.

Notes

🤖 Generated with Claude Code

jkyberneees and others added 5 commits June 12, 2026 06:04
shell ran cmd.Run() on a plain exec.Command with no timeout and no context,
so a stuck command (network read that never returns, interactive prompt,
infinite loop) wedged the agent forever — and Ctrl-C could not recover
because the loop's drain blocks on the tool goroutine. The sibling
parallel_shell already had a timeout; plain shell, the most-used tool, did
not.

- SetContext ties execution to the agent context (the loop already calls it
  on context-aware tools), so Ctrl-C / turn timeout kills the command now.
- exec.CommandContext + a generous 30m backstop timeout bounds genuinely
  stuck commands for unattended runs (serve, telegram) with no human to
  interrupt. WaitDelay guarantees Run() returns even if the killed process
  leaves children holding the pipes.
- Cancellation/timeout surface as clear errors, not opaque "signal: killed".

Tests: a sleeping command now returns promptly via both the timeout and
context-cancellation paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Only delegate_tasks honored the loop's SetContext, so a turn timeout or
Ctrl-C could not interrupt the other long-running tools — their HTTP
requests and subprocesses ran to completion regardless, and the loop's
drain blocked on them.

Add a small race-safe ctxTool embeddable (SetContext + toolCtx) and wire it
through the tools that do unbounded network or subprocess work:
- http_batch, browser, web_search → http.NewRequestWithContext
- vision (ffprobe/ffmpeg/llama-mtmd-cli), transcribe (ffmpeg/whisper) →
  exec.CommandContext
- shell now uses the shared embed too (replacing its bare ctx field).

The mutex in ctxTool matters: when the LLM emits two calls to the same tool
in one turn, the loop sets the context on the shared instance from parallel
goroutines — an unsynchronised field would race.

Tests: ctxTool default/set/concurrent behavior, and http_batch returning a
cancellation error when its context is already cancelled.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
SimpleCall did a single http.Do with no retry, so any transient 429/5xx or
network blip aborted the best-effort secondary features that use it (skill
matching, memory summaries, episode extraction, session titles), while the
main agent Call retried. Extract the retry loop into postChatWithRetry and
route both through it.

Test: SimpleCall now succeeds after two 429s (3 attempts).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The retry loop used a fixed 1/2/4s exponential backoff and ignored the
server's Retry-After header, so a real rate limit (Retry-After: 20-60s)
burned all three retries in ~7s and failed the turn even though the server
said exactly when to come back. Parse Retry-After (integer seconds or
HTTP-date) on retryable statuses and use it for the next wait, capped at
60s so a pathological value can't wedge a turn (ctx still breaks the wait).

Tests: parseRetryAfter unit cases (seconds, blank, garbage, zero, cap) and
a 429+Retry-After call that retries and succeeds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Session and memory writes did WriteFile(tmp)+Rename — atomic against torn
reads, but NOT durable: without an fsync a power loss / kernel crash can
land the rename while the data is still in the page cache, leaving an empty
or truncated file and silently losing the latest conversation turn or
extracted memory. session.go also used a fixed "<target>.tmp" name, so two
concurrent saves of the same target could clobber each other's temp file.

Add internal/fsatomic.WriteFile (unique temp → fsync data → rename → fsync
dir) and route the irreplaceable writes through it: session save + index,
episode index + summaries, and the facts store.

Tests: fsatomic content/perm/overwrite, no temp litter, and concurrent
same-target writers never producing a torn file.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 12, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
odek 7242f51 Commit Preview URL

Branch Preview URL
Jun 12 2026, 04:29 AM

jkyberneees and others added 2 commits June 12, 2026 06:23
vprotocol: in sandbox mode the timeout/ctx kills the host-side docker exec
client (unblocking the agent) but Docker does not forward the signal to the
in-container process, which lingers until container teardown. Document the
sharp edge so it isn't a future debugging mystery.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI caught it: the two shell timeout tests passed locally but hung ~5s on CI.
exec.CommandContext's default Cancel SIGKILLs only the `sh` leader. On a
shell that forks the command (`sh -c "sleep 30"` → child sleep, or any
pipeline), the child survives and holds the output pipe, so Run() blocks
until WaitDelay (5s) — over the test's deadline, and a real 30m-timeout
command would leave a lingering process.

Run the command in its own process group (Setpgid) and override Cancel to
SIGKILL the whole group (negative pid). The repo is Unix-only and already
uses syscall.Kill directly. WaitDelay drops to 3s as a backstop.

Tests now use a forking pipeline (`sleep 30 | cat`) so they reproduce the
CI failure locally; both return in <0.2s with the group kill.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jkyberneees jkyberneees merged commit 4dc1487 into main Jun 12, 2026
7 checks passed
@jkyberneees jkyberneees deleted the fix/tool-resilience-and-durability branch June 12, 2026 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant