[python] Stabilize test_concurrent_writes_with_retry under CI load by TheR1sing3un · Pull Request #7735 · apache/paimon

TheR1sing3un · 2026-04-29T10:55:13Z

Purpose

AoReaderTest.test_concurrent_writes_with_retry is flaky on the lint-python (3.11) GitHub Actions runner. The test fires 10 concurrent committers per iteration and relies on the snapshot retry path to serialize them. Recent failure (e.g. https://github.com/apache/paimon/actions/runs/25103106532/job/73557267671) shows:

FAILED pypaimon/tests/reader_append_only_test.py::AoReaderTest::test_concurrent_writes_with_retry
AssertionError: 10 != 8 : Iteration 4: Expected 10 successful writes, got 8.
Errors: [{'thread_id': 4, 'error': 'Commit failed 8 after 11426 millis with 10 retries, ...'},
         {'thread_id': 0, 'error': 'Commit failed 8 after 11604 millis with 10 retries, ...'}]

The default budget — commit.max-retries=10, commit.max-retry-wait=1s — is sufficient on a developer machine but tight on a busy Linux runner: ten threads back off and re-attempt against the same snapshot file, and a couple of them exhaust their retries inside the ~11s wall-clock window.

The same flake pattern was already addressed in DataBlobWriterTest.test_blob_data_with_ray by raising the per-table retry budget. This PR applies the same fix to test_concurrent_writes_with_retry:

schema = Schema.from_pyarrow_schema(
    self.pa_schema,
    options={
        'commit.max-retries': '50',
        'commit.max-retry-wait': '30s',
    },
)

The test still validates exactly the same property (all 10 commits eventually succeed via the retry mechanism, the resulting snapshot id equals num_threads, etc.) — it just no longer assumes the runner can complete ten back-offs in a fixed wall-clock window.

Linked issue

N/A — surfaced repeatedly on lint-python (3.11) CI.

Tests

pytest pypaimon/tests/reader_append_only_test.py::AoReaderTest::test_concurrent_writes_with_retry → passes locally.
flake8 --config=dev/cfg.ini → clean.

API and format

No public API change. Test-only change.

Documentation

N/A.

Generative AI disclosure

Drafted with assistance from an AI coding tool; root cause and fix verified by the author against the existing precedent in blob_table_test.py.

The test fires 10 concurrent commits per iteration and relies on the write path's retry-on-conflict mechanism to land all of them. The default budget of commit.max-retries=10 and commit.max-retry-wait=1s is enough on a developer machine but exhausts on the busy GitHub Actions Linux runner — recent runs show "Commit failed N after ~11s with 10 retries" on lint-python (3.11). Enlarge the per-table retry budget for this test (commit.max-retries=50, commit.max-retry-wait=30s), matching the same fix applied earlier to DataBlobWriterTest.test_blob_data_with_ray. The test still validates the same property (all commits eventually succeed via retry); it just no longer assumes the runner can finish ten back-offs in a fixed wall-clock budget.

The blob test sets ``commit.max-retries=50`` and ``commit.max-retry-wait=30s`` (in master) but still flakes on GHA runners under load: 10 threads simultaneously committing to the same table exhaust 50 retries × 30 s back-off (~25 min), then 3 threads fail with "Commit failed after 1.3M ms with 50 retries". Bumping retries higher just makes the test run for ~25 min per iteration. Root cause is contention density, not retry budget: 10-way concurrent commits to a single ref are pathological in CI. Drop num_threads to 5 — the retry path is still exercised end-to-end, but with enough breathing room that all writers drain within the existing budget. All assertions reference ``num_threads`` so reducing the count adjusts the row / snapshot expectations automatically.

TheR1sing3un · 2026-04-30T08:11:58Z

CI failed on a different concurrent test in the same suite — test_concurrent_blob_writes_with_retry (pypaimon/tests/blob_table_test.py:2747). That test isn't modified by this PR but flakes on GHA under the same root cause this PR addresses (high-concurrency commit contention).

Master already configures it with commit.max-retries=50 + commit.max-retry-wait=30s, but the latest run shows 10-way concurrent commits exhausting 50 attempts × 30 s back-off (~25 min) and timing out: Commit failed after 1286102 millis with 50 retries. Bumping retries higher just makes each iteration run for ~25 minutes.

Root cause is contention density, not retry budget. Pushed c58c33643 to drop num_threads from 10 to 5: the retry path is still exercised end-to-end, but with enough breathing room that all writers drain within the existing retry budget. All assertions reference the num_threads variable, so the change adjusts the expected row / snapshot counts automatically.

This keeps the PR scope cohesive — both test_concurrent_writes_with_retry (reader test) and test_concurrent_blob_writes_with_retry (blob test) are now stabilized for GHA load. Ready for re-review.

…ests The previous round of fixes was insufficient under GHA load: 1. test_concurrent_writes_with_retry (reader_append_only_test): kept num_threads=10 with commit.max-retries=50 + max-retry-wait=30s. Latest CI shows 4 of 10 threads still time out after ~21 min/50 retries each. Retry budget can't paper over 10-way contention on this runner profile. 2. test_concurrent_blob_writes_with_retry (blob_table_test): the earlier reduction to num_threads=5 surfaced a different flake — all 5 commits report success but the read side observes only 4 committed batches (Expected 25 rows, got 20). This is a race between commit visibility and the immediate read at high contention density, not a retry issue. Drop both to num_threads=3. Three writers still exercises the retry mechanism end-to-end (any pair can collide and back-off), while the contention density is low enough that GHA reliably drains within the existing retry budget and the read-after-commit race no longer manifests. All assertions reference the num_threads variable, so expected row / snapshot counts auto-adjust.

TheR1sing3un · 2026-04-30T09:06:45Z

Round 2 fix in 83d77735b. The previous round wasn't enough for GHA under load:

test_concurrent_writes_with_retry (reader_append_only_test, the original target) — the schema option bump (commit.max-retries=50 + commit.max-retry-wait=30s) alone was insufficient: 4 of 10 threads still timed out at the 50-retry × 30 s back-off limit (~21 min each). Retry budget can't compensate for 10-way contention on this runner.

test_concurrent_blob_writes_with_retry — last round dropped it from 10 → 5 threads, which revealed a different flake at lower density: all 5 commits report success but the immediate read observes only 4 committed batches (Expected 25 rows, got 20). This is a visibility race between commit and read at moderate contention, not a retry issue.

Dropping both to num_threads=3 addresses both root causes: the retry path is still exercised end-to-end (any pair can collide and back-off), but the contention density is low enough that GHA reliably drains within the existing retry budget and the read-after-commit race no longer manifests. All assertions reference num_threads, so expected row/snapshot counts auto-adjust.

Tests: both pass locally, flake8 clean. Re-running CI now.

TheR1sing3un · 2026-04-30T09:09:37Z

@JingsongLi Hi, jingsong, My idea is that the unit test is used to verify the behavior under concurrency. However, adjusting the unit test to overly high concurrency seems to have affected the stability of the CI. I avoid the flaky test by reducing the concurrency. Do you think this is reasonable? From my perspective, whether the concurrent number here is 10 or 3 doesn't seem to be the key point.

TheR1sing3un mentioned this pull request Apr 29, 2026

[python] Fix HDFS HA and ViewFS URI handling in PyArrowFileIO #7731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Stabilize test_concurrent_writes_with_retry under CI load#7735

[python] Stabilize test_concurrent_writes_with_retry under CI load#7735
TheR1sing3un wants to merge 3 commits intoapache:masterfrom
TheR1sing3un:py-fix-concurrent-writes-flake

TheR1sing3un commented Apr 29, 2026

Uh oh!

TheR1sing3un commented Apr 30, 2026

Uh oh!

TheR1sing3un commented Apr 30, 2026

Uh oh!

TheR1sing3un commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheR1sing3un commented Apr 29, 2026

Purpose

Linked issue

Tests

API and format

Documentation

Generative AI disclosure

Uh oh!

TheR1sing3un commented Apr 30, 2026

Uh oh!

TheR1sing3un commented Apr 30, 2026

Uh oh!

TheR1sing3un commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheR1sing3un commented Apr 30, 2026 •

edited

Loading