Skip to content

fix: Un-skipping tests in sagemaker-serve#5807

Open
mujtaba1747 wants to merge 5 commits intoaws:masterfrom
mujtaba1747:fix-model-customization-tests
Open

fix: Un-skipping tests in sagemaker-serve#5807
mujtaba1747 wants to merge 5 commits intoaws:masterfrom
mujtaba1747:fix-model-customization-tests

Conversation

@mujtaba1747
Copy link
Copy Markdown
Collaborator

@mujtaba1747 mujtaba1747 commented Apr 29, 2026

Issue #, if available: #5802

Description of changes: Fix skipped tests in sagemaker-serve

test_model_customization_deployment.py

Problem:
cleanup_e2e_endpoints is a session-scoped fixture, but under pytest-xdist each worker gets its own session. This means multiple workers independently call Endpoint.get_all() and delete every e2e-* endpoint — including endpoints that another worker just created and is actively using.

Observed failure: Test running on worker gw2 deploys e2e-1777420521-4224, its base inference component goes InService, then a different worker's cleanup deletes it. The test on gw2 then fails when it tries to Endpoint.get() the endpoint it just created.

The teardown path had the same problem — it ran the same Endpoint.get_all() + delete loop after tests, risking a second round of collisions.

Changes:

Use filelock to coordinate cleanup across xdist workers. The first worker to acquire the lock performs the e2e-* endpoint cleanup and writes a sentinel file; remaining workers see the sentinel and skip.
Remove the post-yield (after-test) cleanup entirely — it was the main source of the race condition, deleting endpoints that other workers were still using.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mujtaba1747 mujtaba1747 changed the title fix: Use filelock to synchronize endpoint cleanup fix: Un-skipping tests in sagemaker-serve Apr 29, 2026
@mujtaba1747 mujtaba1747 force-pushed the fix-model-customization-tests branch from 2ce2819 to fb43456 Compare April 30, 2026 01:20
mollyheamazon
mollyheamazon previously approved these changes Apr 30, 2026
@mujtaba1747 mujtaba1747 deployed to auto-approve May 1, 2026 18:04 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants