Skip to content

fix(payout): harden BTC amount serialization and add retry circuit-breaker#3789

Open
TaprootFreak wants to merge 3 commits into
developfrom
fix/bitcoin-payout-hardening
Open

fix(payout): harden BTC amount serialization and add retry circuit-breaker#3789
TaprootFreak wants to merge 3 commits into
developfrom
fix/bitcoin-payout-hardening

Conversation

@TaprootFreak
Copy link
Copy Markdown
Collaborator

Context

Today (2026-05-29) two BTC BuyCrypto payouts stalled for 43 min (Order 106888 / BuyCrypto 123873) and 8 min (Order 106891 / BuyCrypto 123874) in silent retry loops before random JS floating-point rolls happened to produce values Bitcoin Core's ParseFixedPoint accepted.

PR #3729 (merged 2026-05-29 15:59) fixed the fee_rate field — the immediate trigger. But three systemic gaps remain:

  1. amounts are still unsanitized. Bitcoin Core parses amount fields with ParseFixedPoint(decimals=8). Float artifacts like 0.1 + 0.2 = 0.30000000000000004 (after aggregation or fee-adjustments downstream of BitcoinBasedStrategy.aggregatePayout) trigger the same Invalid amount (error code -3) at a different field.
  2. No alerting. The cron retries every ~30 s indefinitely, no escalation. The 43-min outage was invisible to the operator until a customer asked.
  3. No retry tracking. Operators have no way to see which orders are chronically failing without grepping App Insights logs.

This PR fixes all three.

Changes

1. Amount sanitization + structured validation (BTC only)

  • PayoutBitcoinService.sendUtxoToMany now quantizes every payout amount to 8 decimals with Util.round(x, 8) before the RPC call.
  • fee_rate is re-rounded to 3 decimals as defense-in-depth — independent of fix(bitcoin): round send fee_rate to 3 decimals #3729, so a future regression on the fee service can't bring the bug back.
  • New InvalidPayoutAmountException: NaN, Infinity, zero or negative amounts now fail fast at the boundary with a clear error message instead of being lost as an opaque Invalid amount from the node.

2. Persistent retry tracking (all UTXO chains)

  • Migration AddPayoutOrderRetryTracking adds three columns to payout_order: retryCount (int, default 0), lastError (varchar 2048), lastAttemptDate (timestamp).
  • PayoutOrder.recordPayoutFailure(error) / resetPayoutRetry() helpers manage the lifecycle.
  • BitcoinBasedStrategy.send() increments the counter on RPC failure and resets it on successful broadcast.

3. Operator alert via existing ErrorMonitoringMail pattern

  • After 5 consecutive failures (≈2.5 min at the 30-s payout cron interval), BitcoinBasedStrategy fires a MailType.ERROR_MONITORING with MailContext.PAYOUT and isLiqMail: true to liq@dfx.swiss.
  • suppressRecurring: true + debounce: 3600000 cap the inbox at 1 alert per order-group per hour during long incidents.
  • Same pattern as sendNonRecoverableErrorMail, so no new infrastructure needed.

Scope decisions

  • Amount sanitization is BTC-isolated in PayoutBitcoinService — Firo and other UTXO chains inherit BitcoinRpcClient but have potentially different fee/amount conventions, so the RPC wrapper itself stays untouched.
  • Retry tracking + alerting lives in BitcoinBasedStrategy (base) — semantics-free, every UTXO chain that inherits gets the same protection.

Verification

Local CI parity:

  • npm run lint — clean
  • npm run format:check — clean
  • npm run build — pass
  • npx jest --testPathPattern='bitcoin-fee|payout'109/109 tests green, 7 suites
  • ✅ New tests added (8 in payout-bitcoin.service.spec.ts, 5 in payout-bitcoin-based.strategy.spec.ts)

Post-deploy on DEV:

  1. Watch a real BTC payout pass through end-to-end (cron interval ≈ 30 s).
  2. Force a failure by setting an impossible address in a debug payout → verify alert mail to liq@dfx.swiss after 5 retries.
  3. SELECT id, retryCount, lastError, lastAttemptDate FROM payout_order ORDER BY id DESC LIMIT 10 — counter visible.

Related

…eaker

PR #3729 fixed only the fee_rate field. The amounts field in send/sendmany
RPC is parsed with ParseFixedPoint(decimals=8) and silently rejects values
with more precision (e.g. 0.30000000000000004 from JS float arithmetic) via
"Invalid amount" (error code -3). On 2026-05-29 two BTC BuyCrypto payouts
stalled for 43 min and 8 min in stochastic retry loops before random
floating-point rolls let them through (Orders 106888, 106891).

This change applies three complementary hardenings:

* Quantize every payout amount to 8 decimals in PayoutBitcoinService.sendUtxoToMany
  before the RPC call and reject NaN/Infinity/non-positive amounts with a
  structured InvalidPayoutAmountException, so a bad input fails fast with a
  clear error instead of being lost as an opaque "Invalid amount" inside
  Bitcoin Core. fee_rate is re-rounded to 3 decimals as defense-in-depth.

* Persist a retry counter, last error and last-attempt timestamp on
  payout_order (new migration AddPayoutOrderRetryTracking) so operators can
  see which orders are chronically failing instead of inferring it from log
  pattern matching.

* Fire an operator alert via the existing ErrorMonitoringMail pattern as
  soon as a payout group has failed 5+ times (~2.5 min of silent retry).
  suppressRecurring + 1h debounce keep the inbox quiet on long incidents.

Tracking + alert live in BitcoinBasedStrategy.send() so they cover every
UTXO-derived chain that inherits the base; the amount sanitization is
deliberately scoped to BTC.
* notification options: drop suppressRecurring, keep debounce 1h — the entity
  short-circuits `suppressRecurring || isDebounced` so combining both would
  silence every retry forever instead of giving 1 alert/group/hour
* correlationId sorts ids before joining — Postgres findBy has no ORDER BY,
  so an unordered group would generate a fresh correlationId each retry and
  defeat the debounce
* min → max threshold semantic — a fresh order joining a stuck group should
  not silence the alert; the underlying RPC error fails the whole sendmany
  call, so any single order over threshold is signal worth surfacing
* guard empty orders array — Math.min/max on [] yields ±Infinity which
  would otherwise crash the alert path on orders[0].asset.name
* tighten the quantization test — assert the post-round value is observably
  distinct from the un-rounded raw and lies on the 8-decimal grid
* InvalidPayoutAmountException sets this.name so logs identify the class
* rename recordPayoutFailure parameter `error: string` → `message: string`
  to match what it actually receives
* move trackPayoutFailure + sendRecurringPayoutFailureAlert under the
  existing //*** HELPER METHODS ***// section header for consistency
@TaprootFreak TaprootFreak marked this pull request as ready for review May 29, 2026 20:56
@TaprootFreak TaprootFreak requested a review from davidleomay as a code owner May 29, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant