Skip to content

eventstore: fix checkpoint update race#5019

Open
lidezhu wants to merge 1 commit intomasterfrom
ldz/optimize-event-store001
Open

eventstore: fix checkpoint update race#5019
lidezhu wants to merge 1 commit intomasterfrom
ldz/optimize-event-store001

Conversation

@lidezhu
Copy link
Copy Markdown
Collaborator

@lidezhu lidezhu commented May 9, 2026

What problem does this PR solve?

Issue Number: close #4992

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved system stability by changing error handling for stale state updates; now gracefully warns instead of terminating, allowing continued operation under edge cases.
    • Enhanced checkpoint advancement mechanism with improved atomicity and thread safety.
  • Changes

    • Refined event iterator key boundary behavior for improved accuracy in edge cases.

Review Change Stack

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 9, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lidezhu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 9, 2026

📝 Walkthrough

Walkthrough

Subscription checkpoint updates are made thread-safe via atomic.Uint64 with compare-and-swap logic. Stale updates now warn and ignore instead of panicking. Iterator boundary filtering changes from inclusive to exclusive for EndKey. Test assertions align with comparable key representation.

Changes

Subscription Checkpoint Race Condition and Iterator Boundary Fix

Layer / File(s) Summary
Atomic Checkpoint Field
logservice/eventstore/event_store.go
dispatcherStat.checkpointTs is converted from uint64 to atomic.Uint64 for safe concurrent updates.
Dispatcher Initialization
logservice/eventstore/event_store.go
RegisterDispatcher uses Store(startTs) to initialize the new atomic checkpoint field.
Checkpoint Read/Write Operations
logservice/eventstore/event_store.go
UpdateDispatcherCheckpointTs reads dispatcher checkpoint via Load() and updates via Store().
Subscription Checkpoint CAS Loop
logservice/eventstore/event_store.go
updateSubStatCheckpoint is rewritten to advance checkpoint via compare-and-swap, enqueue GC item, and emit subscription update events on success. Removes previous panic behavior.
Stale Update Handling
logservice/eventstore/event_store.go
uploadStatePeriodically handles SubscriptionChangeTypeUpdate by warning and ignoring stale updates (where ts values would move backward) instead of panicking.
Iterator Boundary Filtering
logservice/eventstore/event_store.go
eventStoreIter.Next changes EndKey boundary check from inclusive (\<=) to exclusive (\<).
Test Span Bounds
logservice/eventstore/event_store_test.go
TestEventStoreIter_NextWithFiltering updates iterator span initialization to use common.ToComparableKey() for boundary representation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Atomics dance in sync,
CAS loops spin, no stale link,
Boundaries shift from wide to tight,
Tests align, all checks now right!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description includes the required issue reference (close #4992) but is missing detailed explanations of what changed and how it works, as well as a release note. Complete the 'What is changed and how it works?' section with technical details, and provide a concrete release note instead of leaving the template placeholder.
Out of Scope Changes check ⚠️ Warning The iterator span filtering boundary change from inclusive (<=) to exclusive (<) appears unrelated to the checkpoint race fix and may represent scope creep. Clarify whether the iterator span filtering change is necessary for the race fix, or separate it into a distinct PR focused on span filtering corrections.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'eventstore: fix checkpoint update race' clearly and concisely summarizes the main objective of the changeset, which addresses a race condition in checkpoint updates.
Linked Issues check ✅ Passed The changes implement atomic operations for checkpoint handling and use CAS loops to prevent out-of-order updates, directly addressing the race condition described in issue #4992.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ldz/optimize-event-store001

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 9, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the event store by implementing SST file filtering based on transaction commit timestamp ranges, allowing Pebble to skip irrelevant files during scans. It refactors key encoding and decoding logic, introduces shared Pebble caches with proper lifecycle management, and improves concurrency by using atomic types for checkpoint timestamps. Additionally, it refines the handling of subscription state updates to ensure monotonic advancement. Feedback was provided to improve the clarity of a log message regarding stale state updates.

log.Panic("should not happen",
subState := tableState.Subscriptions[targetIndex]
if change.CheckpointTs < subState.CheckpointTs || change.ResolvedTs < subState.ResolvedTs {
log.Warn("ignore stale subscription state update",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message "ignore stale subscription state update" is slightly misleading because the code does not ignore the update. It proceeds to update any timestamps that are advancing monotonically. A more accurate log message would be "received stale subscription state update", which correctly indicates that a stale value was detected while still processing the valid parts of the update. This would be clearer for anyone debugging potential message ordering issues.

Suggested change
log.Warn("ignore stale subscription state update",
log.Warn("received stale subscription state update",

@lidezhu lidezhu force-pushed the ldz/optimize-event-store001 branch from 926bb42 to 4bf750b Compare May 9, 2026 15:50
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 9, 2026
@lidezhu lidezhu marked this pull request as ready for review May 9, 2026 15:54
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@logservice/eventstore/event_store.go`:
- Around line 715-716: The dispatcher checkpoint write at
dispatcherStat.checkpointTs.Store(checkpointTs) can regress a newer
per-dispatcher value; change the update to only advance the per-dispatcher
checkpoint if checkpointTs is greater than the currently stored value by reading
dispatcherStat.checkpointTs (via its Load or atomic read) and using an atomic
compare-and-swap loop to store the new checkpoint only when it is strictly
larger (i.e., retry CAS until success or current >= checkpointTs) so
dispatcherStat.checkpointTs never moves backward and min recomputation reflects
true progress.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd2ec042-f448-4c1a-ba09-be650be4b428

📥 Commits

Reviewing files that changed from the base of the PR and between fe7febb and 4bf750b.

📒 Files selected for processing (2)
  • logservice/eventstore/event_store.go
  • logservice/eventstore/event_store_test.go

Comment on lines +715 to 716
dispatcherStat.checkpointTs.Store(checkpointTs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the per-dispatcher checkpoint monotonic too.

Line 715 can still overwrite a newer dispatcher checkpoint with a stale one. After that, the min recomputation can stay pinned below the real progress, so the subscription checkpoint and GC stop advancing until this dispatcher reports again.

Suggested fix
-	dispatcherStat.checkpointTs.Store(checkpointTs)
+	util.CompareAndMonotonicIncrease(&dispatcherStat.checkpointTs, checkpointTs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dispatcherStat.checkpointTs.Store(checkpointTs)
util.CompareAndMonotonicIncrease(&dispatcherStat.checkpointTs, checkpointTs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@logservice/eventstore/event_store.go` around lines 715 - 716, The dispatcher
checkpoint write at dispatcherStat.checkpointTs.Store(checkpointTs) can regress
a newer per-dispatcher value; change the update to only advance the
per-dispatcher checkpoint if checkpointTs is greater than the currently stored
value by reading dispatcherStat.checkpointTs (via its Load or atomic read) and
using an atomic compare-and-swap loop to store the new checkpoint only when it
is strictly larger (i.e., retry CAS until success or current >= checkpointTs) so
dispatcherStat.checkpointTs never moves backward and min recomputation reflects
true progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Event store has a race around subscription checkpoint updates

1 participant