Skip to content

fix(store): omit empty statuses from findOldestCreatedAt (false lag-gauge reading)#58

Merged
endrju19 merged 3 commits into
mainfrom
fix-lag-gauge-empty-status
Jun 5, 2026
Merged

fix(store): omit empty statuses from findOldestCreatedAt (false lag-gauge reading)#58
endrju19 merged 3 commits into
mainfrom
fix-lag-gauge-empty-status

Conversation

@endrju19
Copy link
Copy Markdown
Collaborator

@endrju19 endrju19 commented Jun 5, 2026

The bug

PostgresOutboxStore and MysqlOutboxStore both pre-seeded the result map with clock.instant() for every requested status before running the query:

val result = statuses.associateWith { clock.instant() }.toMutableMap()  // ← every status = "now"
// ... SQL overwrites only statuses that have rows ...

A status with no rows stayed at the seeded near-now timestamp instead of being absent from the map.

Why it matters

The lag gauge in okapi-micrometer relies on absence to report 0 for empty statuses:

snapshot.oldest[status]?.let { Duration.between(it, now).toMillis() / 1000.0 } ?: 0.0

With the pre-seed, an idle status (e.g. no PENDING entries) reported a small false lag — the elapsed time between the store query and the gauge publish — instead of 0. A dashboard would show a non-zero lag for a status that has nothing queued.

Why no test caught it

The existing unit test uses a stub whose findOldestCreatedAt does oldest.filterKeys { it in statuses } — i.e. it correctly omits absent statuses. So the stub and the real JDBC adapters disagreed: the test was green while production diverged. This is a stub-more-correct-than-impl trap.

The fix

Start from an empty map so absent statuses are omitted; the gauge's ?: 0.0 fallback then applies.

- val result = statuses.associateWith { clock.instant() }.toMutableMap()
+ val result = mutableMapOf<OutboxStatus, Instant>()

Applied identically to both adapters (the bug was copy-pasted from one to the other).

Test

Added one test to the shared OutboxStoreContractTests — so it runs against real Postgres and real MySQL (Testcontainers), not a stub. It persists PENDING entries only, then asserts findOldestCreatedAt returns a key only for statuses that have rows (PENDING present with the oldest created_at; DELIVERED/FAILED absent). The assertion is on the key set, which catches the structural bug even under a fixed test clock (where a value-only assertion would deceptively pass).

Test plan

  • ./gradlew :okapi-postgres:compileKotlin :okapi-mysql:compileKotlin — clean (no warnings)
  • :okapi-integration-tests:compileTestKotlin — new test compiles
  • ./gradlew ktlintCheck — green
  • ⚠️ The new integration test was not executed locally — Docker/Testcontainers is unavailable in the authoring environment. RED→GREEN was verified by reasoning (buggy code returns all 3 requested keys → key-set assertion fails; fixed code returns only populated keys → passes). Please run ./gradlew :okapi-integration-tests:test with Docker up before merging.

Note — deferred cleanup

clock is now unused by both stores (it existed only for the buggy pre-seed; Kotlin emits no warning for an unused constructor val). Removing it is a breaking constructor change touching ~20 call sites (Spring autoconfig bean factories, benchmarks, ~15 tests) and is intentionally left to a separate follow-up to keep this bugfix focused and easy to review/cherry-pick.

endrju19 added 3 commits June 5, 2026 09:51
…auge reading)

PostgresOutboxStore and MysqlOutboxStore pre-seeded the result map with
clock.instant() for every requested status before querying, then let the SQL
overwrite only the statuses that have rows. A status with no entries therefore
came back with a near-now timestamp instead of being absent.

The lag gauge (okapi-micrometer) relies on absence to report 0 for empty
statuses: `snapshot.oldest[status]?.let { Duration.between(it, now) } ?: 0.0`.
With the pre-seed, an idle status reported a small false lag (the elapsed time
between the store query and the gauge publish) instead of 0.

The unit test never caught this: it uses a stub whose findOldestCreatedAt does
`filterKeys { it in statuses }` (correctly omitting absent statuses), so the
stub and the real adapters disagreed. Fixed by starting from an empty map so
absent statuses are omitted and the gauge's `?: 0.0` fallback applies.

Added a shared OutboxStore contract test (runs against real Postgres AND MySQL
via Testcontainers) asserting that findOldestCreatedAt returns a key only for
statuses that actually have rows.

Note: `clock` is now unused by both stores. Removing it is a breaking
constructor change touching ~20 call sites (autoconfig, benchmarks, tests) and
is left to a separate follow-up to keep this fix focused.
The absent-means-no-backlog contract was implicit and was the root of the
lag-gauge bug just fixed. Stating it on the interface inoculates against a
future implementer re-seeding defaults and silently reintroducing it.
(Code review follow-up.)
Refines the previous KDoc: spells out that the map is status -> createdAt of
that status's oldest entry (min createdAt), not just 'oldest createdAt per
status'. Keeps the absent-means-no-backlog contract note.
@endrju19 endrju19 merged commit b827164 into main Jun 5, 2026
8 checks passed
@endrju19 endrju19 deleted the fix-lag-gauge-empty-status branch June 5, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant