fix: tool resilience & crash-durable persistence (5 fixes) by jkyberneees · Pull Request #34 · BackendStack21/odek

jkyberneees · 2026-06-12T04:21:44Z

Five focused, independently-committed fixes for correctness, recoverability, and durability — the top items from a direct audit. Each commit ships with regression tests; full suite green under go test ./... -race, clean go vet.

1. Shell can no longer hang the agent forever (`5977e21`)

shell ran cmd.Run() on a plain exec.Command with no timeout/context, so a stuck command (network read, interactive prompt, infinite loop) wedged the agent and Ctrl-C couldn't recover. Now uses exec.CommandContext (via SetContext, which the loop already calls) + a generous 30m backstop timeout + WaitDelay. Cancellation/timeout surface as clear errors instead of opaque "signal: killed".

2. Tools are now cancellable as a class (`aade314`)

Only delegate_tasks honored the loop's SetContext, so a turn timeout / Ctrl-C couldn't interrupt the other long-running tools. A small race-safe ctxTool embed threads the agent context into http_batch, browser, web_search (→ NewRequestWithContext), vision, transcribe (→ CommandContext), and shell. The mutex matters: the loop sets the context on a shared tool instance from parallel goroutines when the LLM emits two calls to the same tool.

3. `SimpleCall` gets the main loop's retry (`dabc1cf`)

SimpleCall did a single http.Do with no retry, so a transient 429/5xx aborted the best-effort secondary features (skill match, memory, episodes, titles). Extracted postChatWithRetry; both paths now share it.

4. Honor `Retry-After` on rate limits (`1ae5ef5`)

The retry loop used fixed 1/2/4s backoff and ignored Retry-After, burning all three retries in ~7s under a real rate limit. Now parses Retry-After (seconds or HTTP-date), capped at 60s (ctx still breaks the wait).

5. fsync before rename — crash-durable persistence (`05d8e29`)

Session/memory writes did WriteFile(tmp)+Rename — atomic but not durable; a power loss could land the rename with unflushed data and silently lose the latest turn/memory. session.go also used a fixed <target>.tmp name (concurrent-save clobber). New internal/fsatomic.WriteFile (unique temp → fsync data → rename → fsync dir) routes the irreplaceable writes: session save + index, episode index + summaries, facts store.

Notes

harden(danger): close classifier evasion vectors + fail closed on unknown commands #5 deliberately leaves the rebuildable vector-index caches on plain rename (not worth the churn).
fix(security): close five prompt-injection defense gaps #1's 30m backstop is a constant; the timeout field is settable but not yet wired to config (kept un-plumbed to avoid a config detour — easy to add a shell_timeout_seconds knob later).

🤖 Generated with Claude Code

shell ran cmd.Run() on a plain exec.Command with no timeout and no context, so a stuck command (network read that never returns, interactive prompt, infinite loop) wedged the agent forever — and Ctrl-C could not recover because the loop's drain blocks on the tool goroutine. The sibling parallel_shell already had a timeout; plain shell, the most-used tool, did not. - SetContext ties execution to the agent context (the loop already calls it on context-aware tools), so Ctrl-C / turn timeout kills the command now. - exec.CommandContext + a generous 30m backstop timeout bounds genuinely stuck commands for unattended runs (serve, telegram) with no human to interrupt. WaitDelay guarantees Run() returns even if the killed process leaves children holding the pipes. - Cancellation/timeout surface as clear errors, not opaque "signal: killed". Tests: a sleeping command now returns promptly via both the timeout and context-cancellation paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Only delegate_tasks honored the loop's SetContext, so a turn timeout or Ctrl-C could not interrupt the other long-running tools — their HTTP requests and subprocesses ran to completion regardless, and the loop's drain blocked on them. Add a small race-safe ctxTool embeddable (SetContext + toolCtx) and wire it through the tools that do unbounded network or subprocess work: - http_batch, browser, web_search → http.NewRequestWithContext - vision (ffprobe/ffmpeg/llama-mtmd-cli), transcribe (ffmpeg/whisper) → exec.CommandContext - shell now uses the shared embed too (replacing its bare ctx field). The mutex in ctxTool matters: when the LLM emits two calls to the same tool in one turn, the loop sets the context on the shared instance from parallel goroutines — an unsynchronised field would race. Tests: ctxTool default/set/concurrent behavior, and http_batch returning a cancellation error when its context is already cancelled. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

SimpleCall did a single http.Do with no retry, so any transient 429/5xx or network blip aborted the best-effort secondary features that use it (skill matching, memory summaries, episode extraction, session titles), while the main agent Call retried. Extract the retry loop into postChatWithRetry and route both through it. Test: SimpleCall now succeeds after two 429s (3 attempts). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The retry loop used a fixed 1/2/4s exponential backoff and ignored the server's Retry-After header, so a real rate limit (Retry-After: 20-60s) burned all three retries in ~7s and failed the turn even though the server said exactly when to come back. Parse Retry-After (integer seconds or HTTP-date) on retryable statuses and use it for the next wait, capped at 60s so a pathological value can't wedge a turn (ctx still breaks the wait). Tests: parseRetryAfter unit cases (seconds, blank, garbage, zero, cap) and a 429+Retry-After call that retries and succeeds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Session and memory writes did WriteFile(tmp)+Rename — atomic against torn reads, but NOT durable: without an fsync a power loss / kernel crash can land the rename while the data is still in the page cache, leaving an empty or truncated file and silently losing the latest conversation turn or extracted memory. session.go also used a fixed "<target>.tmp" name, so two concurrent saves of the same target could clobber each other's temp file. Add internal/fsatomic.WriteFile (unique temp → fsync data → rename → fsync dir) and route the irreplaceable writes through it: session save + index, episode index + summaries, and the facts store. Tests: fsatomic content/perm/overwrite, no temp litter, and concurrent same-target writers never producing a torn file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-06-12T04:21:50Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	odek	`7242f51`	Commit Preview URL Branch Preview URL	Jun 12 2026, 04:29 AM

vprotocol: in sandbox mode the timeout/ctx kills the host-side docker exec client (unblocking the agent) but Docker does not forward the signal to the in-container process, which lingers until container teardown. Document the sharp edge so it isn't a future debugging mystery. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

CI caught it: the two shell timeout tests passed locally but hung ~5s on CI. exec.CommandContext's default Cancel SIGKILLs only the `sh` leader. On a shell that forks the command (`sh -c "sleep 30"` → child sleep, or any pipeline), the child survives and holds the output pipe, so Run() blocks until WaitDelay (5s) — over the test's deadline, and a real 30m-timeout command would leave a lingering process. Run the command in its own process group (Setpgid) and override Cancel to SIGKILL the whole group (negative pid). The repo is Unix-only and already uses syscall.Kill directly. WaitDelay drops to 3s as a backstop. Tests now use a forking pipeline (`sleep 30 | cat`) so they reproduce the CI failure locally; both return in <0.2s with the group kill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jkyberneees and others added 5 commits June 12, 2026 06:04

jkyberneees and others added 2 commits June 12, 2026 06:23

jkyberneees merged commit 4dc1487 into main Jun 12, 2026
7 checks passed

jkyberneees deleted the fix/tool-resilience-and-durability branch June 12, 2026 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tool resilience & crash-durable persistence (5 fixes)#34

fix: tool resilience & crash-durable persistence (5 fixes)#34
jkyberneees merged 7 commits into
mainfrom
fix/tool-resilience-and-durability

jkyberneees commented Jun 12, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkyberneees commented Jun 12, 2026

1. Shell can no longer hang the agent forever (5977e21)

2. Tools are now cancellable as a class (aade314)

3. SimpleCall gets the main loop's retry (dabc1cf)

4. Honor Retry-After on rate limits (1ae5ef5)

5. fsync before rename — crash-durable persistence (05d8e29)

Notes

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Shell can no longer hang the agent forever (`5977e21`)

2. Tools are now cancellable as a class (`aade314`)

3. `SimpleCall` gets the main loop's retry (`dabc1cf`)

4. Honor `Retry-After` on rate limits (`1ae5ef5`)

5. fsync before rename — crash-durable persistence (`05d8e29`)

cloudflare-workers-and-pages Bot commented Jun 12, 2026 •

edited

Loading