Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0#4458
Fix listener deadlock on values-only AutoscalingRunnerSet updates with minRunners > 0#4458Loupeznik wants to merge 1 commit intoactions:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a controller deadlock where the AutoscalingListener can remain deleted after a values-only AutoscalingRunnerSet update when minRunners > 0, preventing new jobs from being picked up.
Changes:
- Guard the listener-recreation “draining jobs” gate so it only blocks when the runner spec is actually outdated.
- Add a Ginkgo regression test covering listener recreation with “warm” (idle-but-running) min-runners.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| controllers/actions.github.com/autoscalingrunnerset_controller.go | Prevents listener recreation from blocking on current-spec idle runners by scoping the drain check to outdated runner-spec cases. |
| controllers/actions.github.com/autoscalingrunnerset_controller_test.go | Adds regression coverage for listener recreation on values-only changes while runners are still running. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // The listener should be deleted (values hash changed) | ||
| Eventually( | ||
| func() error { | ||
| return k8sClient.Get(ctx, client.ObjectKey{Name: scaleSetListenerName(autoscalingRunnerSet), Namespace: autoscalingRunnerSet.Namespace}, listener) | ||
| }, | ||
| autoscalingRunnerSetTestTimeout, | ||
| autoscalingRunnerSetTestInterval, | ||
| ).ShouldNot(Succeed(), "Old listener should be deleted") |
There was a problem hiding this comment.
The check that the old listener was deleted by waiting for Get(...) to return NotFound can be flaky if deletion+recreation happens quickly between polling intervals (the Get may always succeed, but for a newly recreated object with the same name). A more robust pattern (used elsewhere in this file) is to capture the listener UID (or ResourceVersion) before patching and then assert it eventually changes after the update.
Fixes #4432 (continuation of #4200, which was partially addressed by #4289).
The bug
When an
AutoscalingRunnerSetspec is updated (e.g. viahelm upgrade) andminRunners > 0, the controller can deadlock and stop picking up new jobs indefinitely.Reproduction is trivial:
gha-runner-scale-setat 0.13.x / 0.14.x withminRunners: 2, maxRunners: N,updateStrategy: eventual.helm upgradechanging any listener value that does not change the runner pod spec (e.g.maxRunners).Controller logs get stuck on:
AutoscalingListener does not exist.
Creating a new AutoscalingListener is waiting for the running and pending runners to finish.
{"running": 2, "pending": 0}
The 2 "running" runners are the idle min-runners pool — they will never drain on their own, so the deadlock is permanent. Jobs queued in GitHub sit in
queuedstate forever.Root cause
autoscalingrunnerset_controller.gohas twodrainingJobs()gates:RunnerSetSpecHashmismatch path): correctly drains and recreates theEphemeralRunnerSetwhen the runner pod spec has changed. After scale-to-zero and re-creation, the new ERS starts withReplicas: 0, so drainingJobs returnsfalseand the listener is created normally.By the time execution reaches line 317, the runner spec hash always matches the latest ERS (the mismatch branch above returns earlier). So the only runners that can ever trigger the line-318 gate are valid, current-spec runners — which, for
minRunners > 0, includes the idle pool that exists by design. The gate blocks forever for no useful reason.The fix
Scope the drain check at the listener-creation gate to cases where the runner spec is actually outdated:
In practice, runnerSpecOutdated is never true at this point (the earlier branch handles it), so this is effectively a no-op for the spec-change path and correctly stops blocking on valid idle runners for the values-only-change path. Kept as a guarded condition rather than removed outright so that future refactors to the reconciliation flow don't silently regress into overprovisioning.
drainingJobs() itself is unchanged — still correctly prevents double-creation of EphemeralRunnerSets at line 288.
Testing
alpine:3.23buildx workflow) targeting the scale set. Steps:maxRunnersin the values file andhelm upgrade -f values.yaml→ deadlock, listener never returns, queued job staysqueued.in_progress, runner executes the workflow.