fix(tools/talis): wait-for-chain + atomic keyring + one-command driver by walldiss · Pull Request #3303 · evstack/ev-node

walldiss · 2026-05-01T13:29:33Z

Summary

Three race conditions surfaced repeatedly on a fresh AWS bring-up of the Fibre throughput experiment. Each had the same shape: a talis step "succeeded" at the CLI level before the chain actually applied the work, leaving downstream steps to fail in confusing ways. This PR makes each step verify outcome, not just invocation, so the experiment goes from talis up to a running loadgen without manual intervention.

Symptom matrix from the broken run:

setup-fibre "complete" → query valaddr providers empty → uploads cascade to voting power: collected 0
evnode crashes at startup with keyring entry "fibre-0" not found despite the file existing on disk
escrow account not found for signer celestia17ftrx… even though setup-fibre claimed to fund 100× fibre-N accounts

Fixes

setup-fibre script (fibre_setup.go) now:

polls celestia-appd status for latest_block_height>0 before submitting any tx
retries set-host until query valaddr providers reflects the new host
verifies fibre-0's escrow lands on-chain before the tmux session exits
CLI side cross-checks all validators are registered from a single vantage point before returning

fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) stages the keyring scp into a tmp dir then mvs atomically into place. The previous direct scp -r created /root/keyring-fibre/keyring-test/ before transferring its contents — evnode_init.sh's [ -d keyring-test ] poll passed mid-scp.

evnode_init.sh template (genesis.go) waits for the specific fibre-0.info file (not just the directory), as defence-in-depth.

New talis fibre-experiment umbrella command runs up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode in order. Operator goes from a prepared root dir to a running loadgen with one command.

Verified

5-min sustained loadgen on julien/fiber HEAD (which has PR #3287's concurrent submitter merged) after these fixes:

TXSIM: duration=5m0s tx_size=10240 concurrency=8
       sent=1463813 ok=1463796 err=17 wall_s=300.00
       sent_per_s=4879 mb_per_s=47.65
       rtt_p50_us=732 rtt_p99_us=51523

47.65 MB/s sustained, 99.999% ok rate — up from the prior 24.57 MB/s on the older base (the gap is #3287's concurrent submitter; these talis fixes just keep the deploy from silently breaking before throughput matters).

Full incident journal in experiments/julien-3287-perf/journal.md.

🤖 Generated with Claude Code

Three race conditions surfaced repeatedly on a fresh AWS bring-up of the Fibre throughput experiment. Each one had the same shape: a talis subcommand "succeeded" at the CLI level (or returned the txhash with --yes) before the chain had actually applied the work, leaving downstream steps to fail in confusing ways. This commit makes each step verify *outcome*, not just *invocation*, so the experiment can go from a fresh `talis up` to a running loadgen without manual intervention. • setup-fibre script (fibre_setup.go) now: - polls `celestia-appd status` for `latest_block_height>0` before submitting any tx — fixes the silent-noop where set-host + 100× deposit-to-escrow all bounced with "celestia-app is not ready; please wait for first block"; - retries `set-host` in a loop until the validator's host shows up in `query valaddr providers` — fixes the case where --yes returns the txhash before block inclusion and the tx silently lands in the mempool but never confirms; - verifies fibre-0's escrow account is funded on-chain before the tmux session exits — same silent-failure mode as set-host, but on the deposit side. The talis-CLI step also now cross-checks all validators are registered from a single vantage point before returning, so a concurrent set-host race surfaces as an error instead of a half-empty provider list start-fibre would cache forever. • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages the keyring scp into a tmp directory and `mv`s it atomically into place. The previous direct `scp -r` to /root/keyring-fibre/keyring-test created the directory before transferring its contents — the evnode init script's `[ -d keyring-test ]` poll passed mid-transfer, the daemon launched with no fibre-0.info, and crashed with `keyring entry "fibre-0" not found`. • evnode_init.sh (genesis.go) now waits for the specific keyring-test/fibre-0.info file rather than just the keyring-test directory. Belt-and-braces: the bootstrap mv is already atomic on the same filesystem, but the file-level guard means a hand-pushed keyring (not via talis) can't trip the same race. • New `talis fibre-experiment` umbrella command runs up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode in order. Each step uses the same binary as a subprocess; failures in any step abort the chain. Operator goes from a prepared root dir to a running loadgen with one command, instead of remembering the sequence. Verified by 5-min sustained loadgen against julien/fiber HEAD with PR evstack#3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok, up from the prior 24.57 MB/s baseline (the gap is PR evstack#3287's overlapping uploads — these talis fixes just stop the deploy from silently breaking before throughput matters).

coderabbitai · 2026-05-01T13:29:42Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea3bc22d-5477-42ac-9068-b1f972547b38

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Three follow-up bugs surfaced from the PR evstack#3303 follow-up verification run on a 3-validator AWS Fibre cluster: - aws.go: CreateAWSInstances exited 0 even when individual instance launches failed, so `talis up` lied about success and downstream steps proceeded against a partial cluster. Returns a joined error now so failure cascades stop early. - download.go: sshExec used cmd.CombinedOutput, mixing SSH warnings (the "Warning: Permanently added '...'..." chatter on stderr) into bytes the caller hands to fmt.Sscanf("%d"). The CLI-side providers cross-check parsed those warnings as 0 and looped until its 5-min deadline even though a direct SSH query showed all 3 providers registered. Switch to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR` to silence the chatter for any caller that does combine streams. - fibre_setup.go: the per-validator escrow verification used `celestia-appd query fibre escrow` which doesn't exist — the actual subcommand is `escrow-account`. The query errored on every retry, the grep for "amount" never matched, and the script wedged on the 3-min deadline reporting `FATAL: fibre-0 escrow not present`. Switch to `escrow-account` and key on `"found":true` (the explicit existence flag in the response). Also wrap the fibre-0 deposit-to-escrow itself in a retry loop matching set-host — same `--yes`-returns-before-inclusion silent-failure mode bit it. fibre-1..N stay best-effort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reproduced an OOM at 63.8 GiB on c6in.8xlarge (matching the prior 63.9 GiB note in evnode-fibre's main.go that warned against raising UploadMemoryBudget). On a 3-validator cluster running evnode-txsim at 80 MiB/s, evnode-fibre's submitter queue grew to 50 pending data + 26 pending headers, each blob ≈32 MiB raw. With 3 fan-out targets and per-attempt retry buffers in flight, total in-memory upload state crossed 64 GiB and the kernel killed evnode 30 s into the load test. The loadgen-side TXSIM number collapsed from the prior journal's 47.65 MB/s to 0.66 MB/s as evnode died. Two changes, one root cause (a Fibre upload stall snowballs): - pkg/config: ApplyFiberDefaults sets MaxPendingHeadersAndData to 10 (was 50). The cap is what bounds in-flight blob copies + retry buffers; 50 was sized when evnode hadn't yet seen the pathological case where Fibre uploads time out, retries accumulate, and GC pressure pushes uploads even slower in a positive-feedback loop. 10 keeps the in-flight footprint bounded while still letting healthy uploads pipeline. - evnode-fibre: override SubmitConfig.Fibre to set RPCTimeout to 30 s (upstream default 15 s). Verified live: with the pending cap at 10, a 17-blob 115 MiB upload completes in ~1.5 s — well below the 15 s default. The 30 s margin only matters for transient slow paths during signature collection across 3 FSPs; the cap fix is the load-bearing change. Drop the stale main.go comment claiming we don't override SubmitConfig.Fibre. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Fibre Submit path was opaque: failures showed up as DeadlineExceeded with no signal of how long the upload actually took, and successes only logged at debug level inside the upstream library. During load-test debugging this turned into a guessing game — was the cluster slow, the deadline too tight, or something stuck mid-RPC? Add a single info-level (warn-on-failure) log line in fiberDAClient.Submit covering the Upload call: duration, flat blob bytes, blob count. Cheap (one time.Since) and gives the operator concrete numbers — e.g. "17 blobs / 115 MiB / 1.5 s" — to reason about whether RPCTimeout, pending cap, or batch sizing is the right knob to turn next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under sustained txsim load (~50 MiB/s) the DA submitter batched 10 block_data items into one Upload(), producing a flat payload of 144 MiB. Fibre's per-upload cap is hard at ~128 MiB ("blob size exceeds maximum allowed size: data size 144366912 exceeds maximum 134217723") and rejected every batched upload. With MaxPendingHeadersAndData=10 that took down 170 consecutive submissions before the node halted itself with "Data exceeds DA blob size limit". Wrap the Upload call in a chunker that groups input blobs into ≤120 MiB chunks (8 MiB headroom under Fibre's cap for the per-blob length-prefix overhead added by flattenBlobs) and uploads each chunk separately. Aggregates submitted counts and BlobIDs across chunks; on first chunk failure, returns the error with the partially-submitted count so the submitter's retry/backoff logic sees a coherent state instead of all-or-nothing. Single oversized blobs (already validated against DefaultMaxBlobSize earlier in Submit) still land alone and fail server-side, but at least don't drag healthy peers into the same rejected batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Companion to the submitter chunking fix. The submitter can split a multi-blob batch into ≤120 MiB Fibre uploads, but a *single* block_data item that exceeds 128 MiB still ends up alone in its own chunk and fails server-side ("blob size exceeds maximum allowed size"). Lower the per-block cap to 100 MiB so under high-throughput txsim a single block can't grow past Fibre's hard limit, and update the comment to explain the relationship between this cap and Fibre's ~128 MiB upload reject threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The stub executor used by the runner returned FilterOK for every transaction unconditionally, ignoring the maxBytes budget plumbed through SoloSequencer.GetNextBatch. Under sustained txsim load (~50 MiB/s, 8 concurrent senders) the mempool would accumulate ~50K txs while a 100 MiB upload was in flight; on the next batch the sequencer drained ALL of them into one block (~369 MiB raw), the submitter saw a single item exceeding the per-blob cap, and halted the node with `single item exceeds DA blob size limit`. Walk the input txs in arrival order, accumulate sizes against maxBytes, and return FilterPostpone past the budget so the sequencer puts the overflow back on its queue. Verified live: blocks now cap at ~10K txs / ~100 MiB and evnode sustains 58.77 MB/s DA upload throughput through a 5-min txsim run with zero crashes (was 0 → crash within 30 s before this fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot assigned walldiss May 1, 2026

walldiss and others added 6 commits May 1, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tools/talis): wait-for-chain + atomic keyring + one-command driver#3303

fix(tools/talis): wait-for-chain + atomic keyring + one-command driver#3303
walldiss wants to merge 7 commits intoevstack:julien/fiberfrom
walldiss:feat/talis-fibre-fixes

walldiss commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

walldiss commented May 1, 2026

Summary

Fixes

Verified

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 1, 2026 •

edited

Loading