Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0 by Loupeznik · Pull Request #4458 · actions/actions-runner-controller

Loupeznik · 2026-04-17T13:38:18Z

Fixes #4432 (continuation of #4200, which was partially addressed by #4289).

The bug

When an AutoscalingRunnerSet spec is updated (e.g. via helm upgrade) and minRunners > 0, the controller can deadlock and stop picking up new jobs indefinitely.

Reproduction is trivial:

Deploy gha-runner-scale-set at 0.13.x / 0.14.x with minRunners: 2, maxRunners: N, updateStrategy: eventual.
Wait for the listener and the 2 idle min-runners to come up.
helm upgrade changing any listener value that does not change the runner pod spec (e.g. maxRunners).
The controller deletes the out-of-date listener and then never recreates it.

Controller logs get stuck on:
AutoscalingListener does not exist.
Creating a new AutoscalingListener is waiting for the running and pending runners to finish.
{"running": 2, "pending": 0}

The 2 "running" runners are the idle min-runners pool — they will never drain on their own, so the deadlock is permanent. Jobs queued in GitHub sit in queued state forever.

Root cause

autoscalingrunnerset_controller.go has two drainingJobs() gates:

Line 287–303 (RunnerSetSpecHash mismatch path): correctly drains and recreates the EphemeralRunnerSet when the runner pod spec has changed. After scale-to-zero and re-creation, the new ERS starts with Replicas: 0, so drainingJobs returns false and the listener is created normally.
Line 317–321 (listener-missing path): called on every reconciliation when the listener doesn't exist. It blocks listener creation whenever there are any running or pending runners on the latest ERS.

By the time execution reaches line 317, the runner spec hash always matches the latest ERS (the mismatch branch above returns earlier). So the only runners that can ever trigger the line-318 gate are valid, current-spec runners — which, for minRunners > 0, includes the idle pool that exists by design. The gate blocks forever for no useful reason.

The fix

Scope the drain check at the listener-creation gate to cases where the runner spec is actually outdated:

runnerSpecOutdated := latestRunnerSet.Annotations[annotationKeyRunnerSpecHash] != autoscalingRunnerSet.RunnerSetSpecHash()
if runnerSpecOutdated && r.drainingJobs(&latestRunnerSet.Status) {
    // block and wait
}

In practice, runnerSpecOutdated is never true at this point (the earlier branch handles it), so this is effectively a no-op for the spec-change path and correctly stops blocking on valid idle runners for the values-only-change path. Kept as a guarded condition rather than removed outright so that future refactors to the reconciliation flow don't silently regress into overprovisioning.

drainingJobs() itself is unchanged — still correctly prevents double-creation of EphemeralRunnerSets at line 288.

Testing

Unit/integration: added a Ginkgo regression test under Context("When updating an AutoscalingRunnerSet with running or pending jobs") that emulates 2 idle min-runners, applies a values-hash-only patch, and asserts the listener is recreated while the EphemeralRunnerSet stays intact.
Manual, multi-version: verified against three k3s clusters (k8s 1.33 / 1.34 / 1.35) spun up via k3d, using a minimal test repo (alpine:3.23 buildx workflow) targeting the scale set. Steps:
1. Confirmed queued jobs get picked up on the unpatched 0.13.1 controller → bump maxRunners in the values file and helm upgrade -f values.yaml → deadlock, listener never returns, queued job stays queued.
2. Same sequence with this patch → listener is recreated within seconds, queued job transitions to in_progress, runner executes the workflow.
3. Runner-spec change (CPU requests bumped in the values file) still drains old runners, creates a new ERS, and recreates the listener afterward — no regression.

Copilot

Pull request overview

Fixes a controller deadlock where the AutoscalingListener can remain deleted after a values-only AutoscalingRunnerSet update when minRunners > 0, preventing new jobs from being picked up.

Changes:

Guard the listener-recreation “draining jobs” gate so it only blocks when the runner spec is actually outdated.
Add a Ginkgo regression test covering listener recreation with “warm” (idle-but-running) min-runners.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
controllers/actions.github.com/autoscalingrunnerset_controller.go	Prevents listener recreation from blocking on current-spec idle runners by scoping the drain check to outdated runner-spec cases.
controllers/actions.github.com/autoscalingrunnerset_controller_test.go	Adds regression coverage for listener recreation on values-only changes while runners are still running.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T13:43:00Z

+			// The listener should be deleted (values hash changed)
+			Eventually(
+				func() error {
+					return k8sClient.Get(ctx, client.ObjectKey{Name: scaleSetListenerName(autoscalingRunnerSet), Namespace: autoscalingRunnerSet.Namespace}, listener)
+				},
+				autoscalingRunnerSetTestTimeout,
+				autoscalingRunnerSetTestInterval,
+			).ShouldNot(Succeed(), "Old listener should be deleted")


The check that the old listener was deleted by waiting for Get(...) to return NotFound can be flaky if deletion+recreation happens quickly between polling intervals (the Get may always succeed, but for a newly recreated object with the same name). A more robust pattern (used elsewhere in this file) is to capture the listener UID (or ResourceVersion) before patching and then assert it eventually changes after the update.

Fix listener deadlock on values-only AutoscalingRunnerSet updates

ff2341e

Loupeznik requested review from a team, Steve-Glass, mumoshu, nikola-jokic, rentziass and toast-gear as code owners April 17, 2026 13:38

Copilot AI review requested due to automatic review settings April 17, 2026 13:38

Copilot started reviewing on behalf of Loupeznik April 17, 2026 13:38 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0#4458

Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0#4458
Loupeznik wants to merge 1 commit intoactions:masterfrom
Loupeznik:fix/listener-deadlock-values-change

Loupeznik commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Loupeznik commented Apr 17, 2026

The bug

Root cause

The fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants