Protect base venv cache save against cancellation#36
Open
TheJulianJES wants to merge 2 commits into
Open
Conversation
The `prepare-base` matrix saves the base venv cache per Python version. With the default `fail-fast: true`, a failing leg (e.g. a prerelease that can't resolve dependencies) cancels the sibling legs. If a sibling is mid-way through `actions/cache/save`, the cache key gets reserved but the upload never completes, leaving it stuck and unrestorable on later runs. Disable `fail-fast` so a failing leg no longer cancels its siblings, letting each base venv cache save run to completion.
Replace the previous `fail-fast: false` approach with a more targeted fix that mirrors home-assistant/core (PR #168310). `fail-fast: false` only stops matrix-internal cancellation and forces every leg to run to completion. Instead, guard the cache save step itself with `always()` so an in-flight `actions/cache/save` survives any cancellation source -- `fail-fast` from a sibling leg or a concurrency cancel -- and finishes its upload rather than leaving the cache key reserved but unpopulated. The default `fail-fast` behavior is restored. Gate the save on the `create-venv` step's success (via a new `id`) so a failed or cancelled venv build is never cached.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed change
This fixes an issue where a cache-saving action could be cancelled while it's running, leaving a reserved cache key with nothing cached. This means that future runs can no longer reserve that cache key for potentially up to 24 hours then.
This issue is fixed by replacing the
ifwithalways(), which means this step always runs and cannot be cancelled whatsoever. We only want the second part of that, so we add another condition to that to make sure the previous step ran successfully. This also means we no longer need thecache-hit != truecondition, as that's transitively enforced by checking for success of the previous step.Additional information
This addresses the failure we saw in zigpy/zha#774 (comment), workflow attempt 1.
Apparently the Home Assistant Core repo also ran into the same issue a few months ago. See these PRs:
AI summary
AI explanation with more details (CLICK TO OPEN)
Problem
The
prepare-basejob runs a matrix over Python versions and saves a base virtual environment to the cache so later jobs (pre-commit,pytest,coverage) can restore it instead of rebuilding dependencies.With the matrix default
fail-fast: true, a failing leg — for example a prerelease Python that can't yet resolve dependencies — cancels its sibling legs. If a sibling is part-way throughactions/cache/savewhen that cancellation arrives, the runner tears the step down mid-upload. The cache key has already been reserved but never gets populated, leaving it stuck: later runs can neither restore it (fail-on-cache-misstrips) nor re-save to the same key.The goal: once the cache save step starts, it should always run to completion.
Approaches
This PR is split into two commits so reviewers can see both the discarded and the chosen approach. It is intended to be squash-merged, so only the final state below lands on
main.1.
fail-fast: false(first commit — reverted)Disabling
fail-faststops a failing leg from cancelling its siblings, so each save runs to completion. It works, but it is the weaker fix:concurrencygroup cancel could still interrupt a save mid-upload.2.
always()guard on the save step (second commit — kept)Instead of preventing the cancellation, protect the one step that must not be interrupted. On cancellation, GitHub re-evaluates the
if:of unfinished steps and exempts any that evaluate true — soalways()keeps an in-flightactions/cache/savefrom being torn down, and it finishes its upload regardless of the cancellation source (siblingfail-fastor aconcurrencycancel).The
steps.create-venv.outcome == 'success'guard is essential:always()also evaluates true on failure, so without it a failed or cancelled venv build would be cached and then restored as a broken environment by downstream jobs. Defaultfail-fastbehavior is restored.This mirrors the canonical fix in home-assistant/core#168310 ("Protect CI cache save against cancellation").
Notes
needthe wholeprepare-basematrix to succeed. The value is cache hygiene: the healthy legs' caches land fully populated, so the next run gets a clean restore instead of hitting a reserved-but-empty key.pre-commitcache save isn't a matrix and only runs afterprepare-basesucceeds, so it's never in this race.