Skip to content

Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0#4458

Open
Loupeznik wants to merge 1 commit intoactions:masterfrom
Loupeznik:fix/listener-deadlock-values-change
Open

Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0#4458
Loupeznik wants to merge 1 commit intoactions:masterfrom
Loupeznik:fix/listener-deadlock-values-change

Conversation

@Loupeznik
Copy link
Copy Markdown

Fixes #4432 (continuation of #4200, which was partially addressed by #4289).

The bug

When an AutoscalingRunnerSet spec is updated (e.g. via helm upgrade) and minRunners > 0, the controller can deadlock and stop picking up new jobs indefinitely.

Reproduction is trivial:

  1. Deploy gha-runner-scale-set at 0.13.x / 0.14.x with minRunners: 2, maxRunners: N, updateStrategy: eventual.
  2. Wait for the listener and the 2 idle min-runners to come up.
  3. helm upgrade changing any listener value that does not change the runner pod spec (e.g. maxRunners).
  4. The controller deletes the out-of-date listener and then never recreates it.

Controller logs get stuck on:
AutoscalingListener does not exist.
Creating a new AutoscalingListener is waiting for the running and pending runners to finish.
{"running": 2, "pending": 0}

The 2 "running" runners are the idle min-runners pool — they will never drain on their own, so the deadlock is permanent. Jobs queued in GitHub sit in queued state forever.

Root cause

autoscalingrunnerset_controller.go has two drainingJobs() gates:

  • Line 287–303 (RunnerSetSpecHash mismatch path): correctly drains and recreates the EphemeralRunnerSet when the runner pod spec has changed. After scale-to-zero and re-creation, the new ERS starts with Replicas: 0, so drainingJobs returns false and the listener is created normally.
  • Line 317–321 (listener-missing path): called on every reconciliation when the listener doesn't exist. It blocks listener creation whenever there are any running or pending runners on the latest ERS.

By the time execution reaches line 317, the runner spec hash always matches the latest ERS (the mismatch branch above returns earlier). So the only runners that can ever trigger the line-318 gate are valid, current-spec runners — which, for minRunners > 0, includes the idle pool that exists by design. The gate blocks forever for no useful reason.

The fix

Scope the drain check at the listener-creation gate to cases where the runner spec is actually outdated:

runnerSpecOutdated := latestRunnerSet.Annotations[annotationKeyRunnerSpecHash] != autoscalingRunnerSet.RunnerSetSpecHash()
if runnerSpecOutdated && r.drainingJobs(&latestRunnerSet.Status) {
    // block and wait
}

In practice, runnerSpecOutdated is never true at this point (the earlier branch handles it), so this is effectively a no-op for the spec-change path and correctly stops blocking on valid idle runners for the values-only-change path. Kept as a guarded condition rather than removed outright so that future refactors to the reconciliation flow don't silently regress into overprovisioning.

drainingJobs() itself is unchanged — still correctly prevents double-creation of EphemeralRunnerSets at line 288.

Testing

  • Unit/integration: added a Ginkgo regression test under Context("When updating an AutoscalingRunnerSet with running or pending jobs") that emulates 2 idle min-runners, applies a values-hash-only patch, and asserts the listener is recreated while the EphemeralRunnerSet stays intact.
  • Manual, multi-version: verified against three k3s clusters (k8s 1.33 / 1.34 / 1.35) spun up via k3d, using a minimal test repo (alpine:3.23 buildx workflow) targeting the scale set. Steps:
    1. Confirmed queued jobs get picked up on the unpatched 0.13.1 controller → bump maxRunners in the values file and helm upgrade -f values.yaml → deadlock, listener never returns, queued job stays queued.
    2. Same sequence with this patch → listener is recreated within seconds, queued job transitions to in_progress, runner executes the workflow.
    3. Runner-spec change (CPU requests bumped in the values file) still drains old runners, creates a new ERS, and recreates the listener afterward — no regression.

Copilot AI review requested due to automatic review settings April 17, 2026 13:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a controller deadlock where the AutoscalingListener can remain deleted after a values-only AutoscalingRunnerSet update when minRunners > 0, preventing new jobs from being picked up.

Changes:

  • Guard the listener-recreation “draining jobs” gate so it only blocks when the runner spec is actually outdated.
  • Add a Ginkgo regression test covering listener recreation with “warm” (idle-but-running) min-runners.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
controllers/actions.github.com/autoscalingrunnerset_controller.go Prevents listener recreation from blocking on current-spec idle runners by scoping the drain check to outdated runner-spec cases.
controllers/actions.github.com/autoscalingrunnerset_controller_test.go Adds regression coverage for listener recreation on values-only changes while runners are still running.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +673 to +680
// The listener should be deleted (values hash changed)
Eventually(
func() error {
return k8sClient.Get(ctx, client.ObjectKey{Name: scaleSetListenerName(autoscalingRunnerSet), Namespace: autoscalingRunnerSet.Namespace}, listener)
},
autoscalingRunnerSetTestTimeout,
autoscalingRunnerSetTestInterval,
).ShouldNot(Succeed(), "Old listener should be deleted")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check that the old listener was deleted by waiting for Get(...) to return NotFound can be flaky if deletion+recreation happens quickly between polling intervals (the Get may always succeed, but for a newly recreated object with the same name). A more robust pattern (used elsewhere in this file) is to capture the listener UID (or ResourceVersion) before patching and then assert it eventually changes after the update.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0.14.0: listener can remain deleted after AutoscalingRunnerSet patch when minRunners > 0, leaving jobs queued

2 participants