test(evm): cap child_process.exec in lib.js to surface stalled commands#3526
test(evm): cap child_process.exec in lib.js to surface stalled commands#3526wen-coding wants to merge 4 commits into
Conversation
The Autobahn EVM integration matrix has hit multiple 30-minute job timeouts where hardhat went silent between test files with no error. The orphan-process listing at cleanup consistently shows a docker exec (running `seid tx evm send` or similar) still alive — `child_process.exec` has no timeout by default, so a stalled CLI invocation eats the entire job budget instead of surfacing as a clear error. Add a default 60s timeout to execCommand with a SIGKILL kill signal. On expiry, throw an error containing the offending command so the next occurrence is a 60s actionable error instead of a 30-minute mystery. Override via EXEC_TIMEOUT_MS for tests that legitimately need longer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryLow Risk Overview On timeout it rejects with a message that includes the command (truncated past 200 chars). Reviewed by Cursor Bugbot for commit 22e2a83. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3526 +/- ##
==========================================
- Coverage 59.12% 58.30% -0.83%
==========================================
Files 2213 2140 -73
Lines 182774 174338 -8436
==========================================
- Hits 108072 101649 -6423
+ Misses 64985 63666 -1319
+ Partials 9717 9023 -694
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Per cursor bugbot: error.signal is set on any signal-terminated process (external OOM kill, runner cleanup, etc.), not just timeout kills. error.killed is the precise Node-set-the-killer signal. Narrowing the condition prevents non-timeout signal deaths from being mis-reported as "execCommand timed out". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c1920c4. Configure here.
Per cursor bugbot: Node sets error.killed=true for two cases — the timeout kill we want to attribute, and a maxBuffer overflow which also has error.code = 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER'. Gate the timeout branch on !error.code so a buffer overflow falls through to its original error instead of being mis-reported as a timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary
execCommandin contracts/test/lib.js wrapschild_process.execwith no timeout. The Autobahn EVM integration matrix has hit multiple 30‑minute job timeouts where the hardhat process went silent at a test boundary, with the orphan‑process listing at cleanup consistently showing a stalleddocker exec ... seid ...child still alive. The job then consumed its entire wall‑clock budget instead of surfacing a useful error.This caps
execCommandwith a configurable timeout (default 60s,SIGKILL). On expiry, the rejection includes the offending command so the next occurrence is a 60s actionable error instead of a 30‑minute mystery. Override viaEXEC_TIMEOUT_MSfor tests that legitimately need longer.Observed hang pattern (7 occurrences in 3 days)
Each ran for exactly ~30 minutes (job timeout). Investigation of one of the artifacts (26648905384): validators were perfectly healthy (continuing to produce blocks) the whole time, no test‑sender activity reached them during the silence — i.e., the hardhat process was alive but never sent another RPC, consistent with a stalled child process here.
Trade-off
docker execcalls in the suite complete in ~100–200ms; 60s is comfortably above the worst observed normal case.EXEC_TIMEOUT_MSoverrides per-invocation.Test plan
execCommand timed out … <command>error instead of a 30‑min cancel — that error itself becomes the diagnostic.🤖 Generated with Claude Code