Skip to content

sweepbatcher: harden AddSweep against ctx closure#1130

Open
hieblmi wants to merge 1 commit intolightninglabs:masterfrom
hieblmi:fix-racae
Open

sweepbatcher: harden AddSweep against ctx closure#1130
hieblmi wants to merge 1 commit intolightninglabs:masterfrom
hieblmi:fix-racae

Conversation

@hieblmi
Copy link
Copy Markdown
Collaborator

@hieblmi hieblmi commented Apr 28, 2026

This PR fixes a shutdown race in Batcher.AddSweep where caller cancellation during setup could surface backend-specific errors, such as Postgres returning driver: bad connection, instead of a normal cancellation/shutdown error.

AddSweep now re-checks the batcher/caller shutdown state when setup calls fail and returns context.Canceled or ErrBatcherShuttingDown when appropriate. A regression test covers cancellation during sweep fetching so driver-level errors do not escape during normal shutdown.

2026-04-28 08:37:34.697 [INF] SWEEP: [Batch 1] Stopping
--- PASS: TestSweepFetcher (2.92s)
    --- PASS: TestSweepFetcher/mocks (0.51s)
    --- PASS: TestSweepFetcher/loopdb (2.41s)
=== RUN   TestSweepBatcherCloseDuringAdding
=== RUN   TestSweepBatcherCloseDuringAdding/mocks
2026-04-28 08:37:34.813 [INF] SWEEP: Stopping Batcher: run context cancelled.
2026-04-28 08:37:34.816 [INF] SWEEP: Batcher adding sweep group of 1 sweeps with primarySweep 010101000000, presigned=false, fully_confirmed=false
=== RUN   TestSweepBatcherCloseDuringAdding/loopdb
    test_postgres.go:16: Creating new Postgres DB for testing
2026-04-28 08:37:36.946 [INF] SWEEP: Stopping Batcher: run context cancelled.
    sweep_batcher_test.go:3944: 
        	Error Trace:	/home/runner/work/loop/loop/sweepbatcher/sweep_batcher_test.go:3944
        	            				/opt/hostedtoolcache/go/1.26.2/x64/src/sync/waitgroup.go:258
        	            				/opt/hostedtoolcache/go/1.26.2/x64/src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	fetchSweeps failed: failed to load sweep 0000000000000000000000000000000000000000000000000000000000000101:1: failed to fetch sweep data for 010101000000: failed to fetch loop out for 010101000000: driver: bad connection
        	Test:       	TestSweepBatcherCloseDuringAdding/loopdb
--- FAIL: TestSweepBatcherCloseDuringAdding (2.24s)
    --- PASS: TestSweepBatcherCloseDuringAdding/mocks (0.01s)
    --- FAIL: TestSweepBatcherCloseDuringAdding/loopdb (2.23s)

@hieblmi hieblmi requested a review from starius April 28, 2026 10:40
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a race condition in the Batcher.AddSweep method where caller cancellation during the setup phase could result in the propagation of backend-specific errors rather than the expected context cancellation or shutdown errors. By introducing a centralized check for shutdown states, the batcher now correctly identifies and returns the appropriate terminal error, improving the robustness of the shutdown process.

Highlights

  • Shutdown Race Fix: Implemented a helper method addSweepExitErr to consistently check for batcher or caller shutdown states during AddSweep operations.
  • Error Handling Improvement: Updated AddSweep to re-verify shutdown status when backend operations fail, ensuring that context cancellation errors are returned instead of misleading driver-level errors.
  • Regression Testing: Added a new test case TestAddSweepReturnsContextErrorOnFetchCancellation to verify that caller cancellation during sweep fetching is correctly propagated.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the AddSweep method in the Batcher to ensure that context cancellation or batcher shutdown errors are prioritized when internal operations fail. It introduces a helper method addSweepExitErr to unify this logic and adds a unit test to verify the behavior. Feedback suggests simplifying the repetitive error handling blocks by re-assigning the error variable and streamlining the ctx.Done() case, as the helper function is guaranteed to return an error in that scenario.

Comment thread sweepbatcher/sweep_batcher.go
Comment thread sweepbatcher/sweep_batcher.go Outdated
@hieblmi hieblmi self-assigned this Apr 29, 2026
@hieblmi
Copy link
Copy Markdown
Collaborator Author

hieblmi commented Apr 29, 2026

@claude review this

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

This PR fixes a real shutdown race in Batcher.AddSweep where a context cancellation during one of the setup calls (fetchSweeps, GetSweepStatus, GetParentBatch, MinRelayFee) causes the backend (Postgres) to surface driver-level errors (e.g. driver: bad connection) to the caller instead of a clean context.Canceled or ErrBatcherShuttingDown. The approach — re-checking the batcher/caller shutdown state after each fallible call — is sound and pragmatic.


Issues

1. Dead code in the final selectctx.Done() arm

case <-ctx.Done():
    err := b.addSweepExitErr(ctx)
    if err != nil {
        return err
    }
    return ctx.Err()   // ← unreachable

Inside the <-ctx.Done() case ctx.Err() is always non-nil, so addSweepExitErr will always return a non-nil value, making the final return ctx.Err() dead code. A simple return b.addSweepExitErr(ctx) (with a nil fallback removed) would be clearer.

2. ensurePresigned path is missing an exit check

The fix covers fetchSweeps, GetSweepStatus, GetParentBatch, and MinRelayFee, but ensurePresigned — which itself makes context-sensitive calls — has no equivalent guard:

if sweep.presigned && !fullyConfirmed {
    err := ensurePresigned(ctx, sweeps, b.presignedHelper, ...)
    if err != nil {
        // no addSweepExitErr check here
        return fmt.Errorf(...)
    }
}

The presigned path is less common, but it carries the same potential to leak backend errors during shutdown.

3. Regression test doesn't run against the real Postgres backend

The new TestAddSweepReturnsContextErrorOnFetchCancellation is a standalone function that uses only NewStoreMock(). The rest of the sweepbatcher test suite uses the runTests helper to run each case against both the mock store and a real Postgres instance:

func runTests(t *testing.T, testFn func(t *testing.T, store testStore, batcherStore testBatcherStore)) {
    t.Run("mocks", ...)
    t.Run("loopdb", ...)  // real Postgres
}

Since the original failure was triggered specifically by Postgres returning driver: bad connection, the regression test ideally should also exercise the real backend to prove the fix holds end-to-end. Is TestSweepBatcherCloseDuringAdding/loopdb (the test that was failing in CI) now expected to pass as-is with this fix?

4. cancelingSweepFetcher discards the context argument (minor)

func (f *cancelingSweepFetcher) FetchSweep(context.Context, lntypes.Hash, wire.OutPoint) (*SweepInfo, error) {

The context is unnamed and discarded. A one-line comment explaining that the mock simulates Postgres returning an error after the caller cancels would improve readability for future maintainers.


Minor observations

  • Repetitive boilerplate: the same 4-line pattern appears four times. A helper wrapShutdownErr could reduce duplication, but not a blocker.
  • Priority of b.quit over ctx.Err() in addSweepExitErr is intentional and consistent with the file — no concern.
  • Test doesn't start batcher.Run(): acceptable for a targeted unit test since the failure happens before the addSweepsChan select.

Summary

The fix is correct and addresses the root cause. Three items worth addressing before merge:

  1. Remove the dead return ctx.Err() line in the ctx.Done() arm of the final select.
  2. Add an addSweepExitErr check after ensurePresigned for consistency.
  3. Consider whether the regression test should use runTests to also cover the Postgres backend.

@hieblmi
Copy link
Copy Markdown
Collaborator Author

hieblmi commented Apr 29, 2026

@claude review this again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

This PR correctly fixes a real shutdown race: when a caller cancels their context during AddSweep while the backing store (e.g. Postgres) is in flight, the store returns a driver-level error (driver: bad connection) that escapes upward instead of the expected context.Canceled or ErrBatcherShuttingDown. The approach — check shutdown/cancellation state whenever a store call fails — is the right fix.


Issues

1. Misleading error wrapping when substituting the exit error (minor-to-medium)

The current pattern for every error site is:

if err != nil {
    exitErr := b.addSweepExitErr(ctx)
    if exitErr != nil {
        err = exitErr
    }
    return fmt.Errorf("someOperation failed: %w", err)
}

When exitErr is non-nil, err is replaced and then still wrapped with the operation-specific prefix. This produces messages like:

  • "fetchSweeps failed: context canceled" — confusing but tolerable
  • "inputs with primarySweep X were not presigned (call PresignSweepsGroup first): context canceled" — this one is actively harmful; an operator seeing it during shutdown would think they forgot to call PresignSweepsGroup

The cleaner pattern (which also removes the need to re-assign err) is an early return:

if err != nil {
    if exitErr := b.addSweepExitErr(ctx); exitErr != nil {
        return exitErr
    }
    return fmt.Errorf("someOperation failed: %w", err)
}

This is also what the final select effectively does — it doesn't wrap, it just returns addSweepExitErr.

2. addSweepExitErr can silently return nil (nit)

The function name implies it returns an exit error, but it returns nil when neither b.quit nor ctx.Err() is set. This is safe because all callers guard with if exitErr != nil, but a name like checkExitErr or shutdownErr might be clearer about its nullable return semantics. Not a blocker.


What's good

  • Correct priority ordering: b.quit (batcher shutdown) is checked before ctx.Err() (caller cancellation) everywhere, which is the right precedence.
  • New case <-ctx.Done() in the final select (line ~887): Previously, a context cancellation between the initial guard and the channel send could block indefinitely. This case closes that gap cleanly.
  • cancelingSweepFetcher mock: The test accurately simulates the race — cancel the context inside FetchSweep before returning a driver-level error — and the assertions (ErrorIs(context.Canceled), NotErrorIs(driver.ErrBadConn)) are correct.
  • runTests harness: Running against both the mock and loopdb backends is exactly the right coverage for this class of bug.

Minor observations

  • Test coverage for other 4 error sites: Only fetchSweeps cancellation is covered by the new test. The same race exists at GetSweepStatus, GetParentBatch, MinRelayFee, and ensurePresigned, but the fix pattern is consistent so this is a low-risk gap.
  • TestSweepBatcherCloseDuringAdding/loopdb: The PR description shows this pre-existing test failing before the fix. After the fix, driver.ErrBadConn gets masked by context.Canceled, which is already an accepted exit condition in that test (errors.Is(err, context.Canceled) → break). It would be worth mentioning in the PR description that this pre-existing test is expected to pass after the fix, for traceability.

Summary

The fix is correct and addresses a real problem. The main actionable feedback is to change the error-substitution pattern to return exitErr directly rather than wrapping it with the operation-specific message — the ensurePresigned case especially would produce very confusing logs in production if left as-is. Everything else is minor.

@hieblmi
Copy link
Copy Markdown
Collaborator Author

hieblmi commented Apr 29, 2026

@claude review this again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review: sweepbatcher: harden AddSweep against ctx closure

Overview

This PR fixes a real, observable shutdown race in Batcher.AddSweep where a context cancellation while a setup call is in-flight (e.g. fetchSweeps executing a Postgres query) causes driver: bad connection to surface to callers instead of the expected context.Canceled or ErrBatcherShuttingDown. The approach — introducing a central helper that re-checks the shutdown state after each fallible call — is sound and pragmatic.


Correctness

The coverage of error sites is complete:

  • Before fetchSweeps — covered (replaces the old bare b.quit select)
  • After fetchSweeps failure — covered
  • After GetSweepStatus failure — covered
  • After GetParentBatch failure — covered
  • After MinRelayFee failure — covered
  • After ensurePresigned failure — covered
  • Final channel select — covered (new <-ctx.Done() arm)

Priority order in addSweepExitErrIfAny is correct: checking b.quit first means that a simultaneous batcher-shutdown + ctx-cancel correctly returns ErrBatcherShuttingDown, the stronger signal.

Minor observation — <-ctx.Done() arm return value is always non-nil:

case <-ctx.Done():
    return b.addSweepExitErrIfAny(ctx)

Because ctx is already done at this point, ctx.Err() is guaranteed non-nil. The helper will always return non-nil here, so the call is correct. Just worth noting for readers: there is no path through addSweepExitErrIfAny that returns nil when reached from a ctx.Done() arm.


Possible Nit: unchecked race between successful operations

The guards are only installed after failing calls. If a context cancellation happens between two successful calls (e.g. after GetSweepStatus succeeds but before we reach the final select), the code continues unimpeded and the cancellation is only caught at the final select. This is unlikely to matter in practice since the final select does handle ctx.Done() now — it's more of a theoretical gap than a real bug. Calling addSweepExitErrIfAny after successful calls too would be overly defensive and noisy.


Test Coverage

The new test TestAddSweepReturnsContextErrorOnFetchCancellation directly reproduces the reported failure scenario:

func (f *cancelingSweepFetcher) FetchSweep(...) (*SweepInfo, error) {
    f.cancel()
    return nil, driver.ErrBadConn
}

This is clean and correctly asserts both:

  • require.ErrorIs(t, err, context.Canceled) — correct error is returned
  • require.NotErrorIs(t, err, driver.ErrBadConn) — driver error does not escape

The test follows project conventions — lowercase testAddSweepReturnsContextErrorOnFetchCancellation wired through runTests, Guard(t) deferred, cancelingSweepFetcher is a minimal mock.

One improvement worth considering: The other four guarded error paths (GetSweepStatus, GetParentBatch, MinRelayFee, ensurePresigned) are not tested. Given they all follow the identical guard pattern and the main regression (Postgres fetchSweeps) is covered, this is acceptable — but a brief comment in the test noting which path it exercises and that the same pattern applies to all others would help future readers.


Style / Minor

  • addSweepExitErrIfAny is a clear name; the godoc comment ("returns nil if AddSweep should continue") accurately captures the semantics.
  • The cancelingSweepFetcher ignores the context.Context arg in FetchSweep — that's intentional for the mock and fine.
  • Using _ testStore in testAddSweepReturnsContextErrorOnFetchCancellation correctly signals the store is unused at the sweep-fetcher failure point.

Summary

The fix is correct, well-scoped, and directly addresses the production-observed bug. The test provides meaningful regression coverage. No blocking issues. The one optional improvement is adding a comment to the test about which code path it covers.

Approved with the optional suggestion above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant