Skip to content

TestJumpToDateEndpoint: Guard the parallel boundary subtests against millisecond collisions.#872

Open
jevolk wants to merge 1 commit intomatrix-org:mainfrom
matrix-construct:complement-868-ms-boundary-guard
Open

TestJumpToDateEndpoint: Guard the parallel boundary subtests against millisecond collisions.#872
jevolk wants to merge 1 commit intomatrix-org:mainfrom
matrix-construct:complement-868-ms-boundary-guard

Conversation

@jevolk
Copy link
Copy Markdown
Contributor

@jevolk jevolk commented May 6, 2026

Closes #868.

The two parallel boundary subtests should_find_event_after_given_timestamp and should_find_nothing_before_the_earliest_timestamp flake against any homeserver fast enough to handle a createRoom round-trip inside a single millisecond. The race is in the test, not the implementation: /timestamp_to_event is millisecond-granular by spec and resolves the boundary inclusively (forward picks the earliest event with ts >= query, backward picks the latest with ts <= query), so when a time.Now() sample collides with an origin_server_ts the homeserver just stamped, the inclusive match returns an event the subtest does not want.

should_find_event_after_given_timestamp

createTestRoom reads time.Now() immediately after MustCreateRoom returns. The trailing state event of the public_chat preset's createRoom batch (typically m.room.guest_access, after create, member/join, power_levels, join_rules, history_visibility) lands in the same millisecond as that sample on a fast deploy. The forward search anchored on timeBeforeEventA then iterates events with ts >= sample from the earliest, hits the trailing state event before Message A, and the assertion fails with the wrong event_id.

should_find_nothing_before_the_earliest_timestamp

The subtest samples timeBeforeRoomCreation := time.Now() and immediately calls createTestRoom. The m.room.create event the homeserver writes in response is stamped at the same millisecond. The backward search at that boundary inclusively returns the create event when the test wants 404.

Solution

The race direction is different for the two subtests, so the placement of the guard differs:

  • For the forward race in createTestRoom, the offending event has already been stamped before time.Now() is called. The sleep goes before the sample (between MustCreateRoom returning and timeBeforeEventA := time.Now()), advancing the sample past the trailing state event's millisecond.
  • For the backward race in the failing subtest, the offending event is stamped after time.Now(). The sleep goes after the sample (between timeBeforeRoomCreation := time.Now() and createTestRoom), pushing m.room.create into a strictly later millisecond.

A 2 ms pause is long enough to step past the 1 ms clock resolution exposed on every supported platform and short enough to be invisible against the normal homeserver round-trip. The constant tsBoundaryGuard and a paragraph above it document the invariant so neither sleep gets removed as cosmetic later.

The wire contract is unchanged: /timestamp_to_event continues to be millisecond-granular and continues to resolve the boundary inclusively. Only the test pacing changes.

Empirical evidence

A local soak against tuwunel (bench build, complement_count=N, complement_parallel=1, single docker invocation per batch, fresh OldDeploy per iteration):

Subtest Unpatched (N=100) Patched (N=100)
should_find_event_after_given_timestamp 23 fail / 77 pass 0 fail / 100 pass
should_find_nothing_before_the_earliest_timestamp 66 fail / 34 pass 0 fail / 100 pass

89 flakes vs 0 across the 200 paired trials. The unpatched tester image is the prior tuwunel-changes HEAD (9e8d447); the patched tester image is the same Complement plus this commit. The testee binary, the homeserver deployment shape (hs1 + hs2 + AS bridge), and the test invocation flags are identical between batches.

The same flake fired three times in matrix-construct/tuwunel CI in the days before we routed the two subtests onto a complement_skip allowlist:

The patch only adjusts tests/room_timestamp_to_event_test.go. The should_not_be_able_to_query_a_*_room_you_are_not_a_member_of subtests sample the same timeBeforeRoomCreation pattern but assert HTTP 403, which the homeserver returns regardless of which event a backward search would surface, so the boundary does not matter there. The parallel/federation/when_looking_backwards_before_the_room_was_created,_should_be_able_to_find_event_that_was_imported subtest has the same pattern but its dominant failure mode is federation backfill, not the boundary; the guard is harmless if added later but is out of scope here.

The matching guard is backported to the matrix-construct fork at matrix-construct/complement@28df927 on its tuwunel-changes branch and is wired into matrix-construct/tuwunel CI (complement_ref in docker/Dockerfile.complement is bumped to that hash, and the corresponding complement_skip entries are dropped) so the patched path is exercised against tuwunel ahead of merge here.

Pull Request Checklist

…t millisecond collisions.

Both subtests sample `time.Now()` and immediately probe a homeserver
that stamps events using its own millisecond clock; when the two
readings land in the same millisecond, `/timestamp_to_event`'s
inclusive boundary semantics return an event the subtest does not
want. `should_find_event_after_given_timestamp` races the trailing
createRoom state event ahead of the message that a forward search is
expected to find; `should_find_nothing_before_the_earliest_timestamp`
races `m.room.create` against a backward search expected to find
nothing. Pause around the relevant samples so the next homeserver
stamp lands in a strictly later millisecond. Both flakes have been
observed across multiple homeserver implementations on otherwise
green runs.

Closes matrix-org#868.

Signed-off-by: Jason Volk <jason@zemos.net>
@jevolk jevolk requested review from a team as code owners May 6, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestJumpToDateEndpoint/parallel boundary subtests flake at millisecond granularity

1 participant