Skip to content

Protect base venv cache save against cancellation#36

Open
TheJulianJES wants to merge 2 commits into
zigpy:mainfrom
TheJulianJES:tjj/ci-cache-save-cancellation
Open

Protect base venv cache save against cancellation#36
TheJulianJES wants to merge 2 commits into
zigpy:mainfrom
TheJulianJES:tjj/ci-cache-save-cancellation

Conversation

@TheJulianJES
Copy link
Copy Markdown
Contributor

Proposed change

This fixes an issue where a cache-saving action could be cancelled while it's running, leaving a reserved cache key with nothing cached. This means that future runs can no longer reserve that cache key for potentially up to 24 hours then.

This issue is fixed by replacing the if with always(), which means this step always runs and cannot be cancelled whatsoever. We only want the second part of that, so we add another condition to that to make sure the previous step ran successfully. This also means we no longer need the cache-hit != true condition, as that's transitively enforced by checking for success of the previous step.

Additional information

This addresses the failure we saw in zigpy/zha#774 (comment), workflow attempt 1.

Apparently the Home Assistant Core repo also ran into the same issue a few months ago. See these PRs:

AI summary

AI explanation with more details (CLICK TO OPEN)

Problem

The prepare-base job runs a matrix over Python versions and saves a base virtual environment to the cache so later jobs (pre-commit, pytest, coverage) can restore it instead of rebuilding dependencies.

With the matrix default fail-fast: true, a failing leg — for example a prerelease Python that can't yet resolve dependencies — cancels its sibling legs. If a sibling is part-way through actions/cache/save when that cancellation arrives, the runner tears the step down mid-upload. The cache key has already been reserved but never gets populated, leaving it stuck: later runs can neither restore it (fail-on-cache-miss trips) nor re-save to the same key.

The goal: once the cache save step starts, it should always run to completion.

Approaches

This PR is split into two commits so reviewers can see both the discarded and the chosen approach. It is intended to be squash-merged, so only the final state below lands on main.

1. fail-fast: false (first commit — reverted)

Disabling fail-fast stops a failing leg from cancelling its siblings, so each save runs to completion. It works, but it is the weaker fix:

  • It only addresses matrix-internal cancellation. A manual cancel or a concurrency group cancel could still interrupt a save mid-upload.
  • It forces every leg to run to completion even when the run is already doomed, wasting CI minutes.

2. always() guard on the save step (second commit — kept)

Instead of preventing the cancellation, protect the one step that must not be interrupted. On cancellation, GitHub re-evaluates the if: of unfinished steps and exempts any that evaluate true — so always() keeps an in-flight actions/cache/save from being torn down, and it finishes its upload regardless of the cancellation source (sibling fail-fast or a concurrency cancel).

- name: Create Python virtual environment
  id: create-venv
  if: steps.cache-venv.outputs.cache-hit != 'true'
  ...
- name: Cache base Python virtual environment
  if: always() && steps.create-venv.outcome == 'success'
  uses: actions/cache/save@v5

The steps.create-venv.outcome == 'success' guard is essential: always() also evaluates true on failure, so without it a failed or cancelled venv build would be cached and then restored as a broken environment by downstream jobs. Default fail-fast behavior is restored.

This mirrors the canonical fix in home-assistant/core#168310 ("Protect CI cache save against cancellation").

Notes

  • Neither approach makes a run with a genuinely broken Python version pass — downstream jobs need the whole prepare-base matrix to succeed. The value is cache hygiene: the healthy legs' caches land fully populated, so the next run gets a clean restore instead of hitting a reserved-but-empty key.
  • Only the base-venv save needs this. The pre-commit cache save isn't a matrix and only runs after prepare-base succeeds, so it's never in this race.

The `prepare-base` matrix saves the base venv cache per Python version.
With the default `fail-fast: true`, a failing leg (e.g. a prerelease
that can't resolve dependencies) cancels the sibling legs. If a sibling
is mid-way through `actions/cache/save`, the cache key gets reserved
but the upload never completes, leaving it stuck and unrestorable on
later runs.

Disable `fail-fast` so a failing leg no longer cancels its siblings,
letting each base venv cache save run to completion.
Replace the previous `fail-fast: false` approach with a more targeted
fix that mirrors home-assistant/core (PR #168310).

`fail-fast: false` only stops matrix-internal cancellation and forces
every leg to run to completion. Instead, guard the cache save step
itself with `always()` so an in-flight `actions/cache/save` survives
any cancellation source -- `fail-fast` from a sibling leg or a
concurrency cancel -- and finishes its upload rather than leaving the
cache key reserved but unpopulated. The default `fail-fast` behavior
is restored.

Gate the save on the `create-venv` step's success (via a new `id`) so
a failed or cancelled venv build is never cached.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant