fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild by AlinsRan · Pull Request #13629 · apache/apisix

AlinsRan · 2026-06-30T00:40:59Z

Description

Fixes #13282: the health-check manager destroys and rebuilds the checker on every upstream change, even when only the upstream nodes changed and the checks config is identical.

Root cause

fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version. A node-only change bumps that version, so the checker is queued for an asynchronous rebuild. During that window:

api_ctx.up_checker is nil, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be unhealthy for ~1-2s.
The rebuild discards the checker's accumulated health state and re-probes every node from scratch.

timer_working_pool_check also destroyed the checker on any version mismatch, racing timer_create_checker and widening the gap.

Fix

compute_targets() / sync_checker_targets(): when the checks config is unchanged (core.table.deep_eq) and only the nodes changed, reconcile the existing checker's targets in place via add_target/remove_target against the authoritative shm target list — keeping the checker and its health state alive instead of rebuilding.
timer_working_pool_check no longer destroys a checker for a node-only version change (it falls through to destroy only when the resource is gone or the node count drops to 0).
When a rebuild is genuinely required (the checks config changed), the new checker is created and inserted into the working pool before the old one is released, eliminating the up_checker == nil gap.
The working-pool entry now stores the checks config so node-only vs checks changes can be told apart.

Checker lifecycle after this change

Resource event	Path	Operation
Create	request → `fetch_checker` enqueues to `waiting_pool` → `timer_create_checker` (no existing checker)	`healthcheck.new()` + `add_target` per node
Update — nodes only (DNS / k8s scale)	`timer_create_checker` reuse branch	`sync_checker_targets`: `add_target`/`remove_target` on the delta only, no `delayed_clear`
Update — `checks` changed	`timer_create_checker` rebuild branch	`delayed_clear(old)` → `new()` + `add_target(new)` → publish → `stop(old)`
Delete	`timer_working_pool_check` (`fetch_latest_conf` = `nil`)	`delayed_clear(old)` + `stop` + `working_pool[path]=nil`; periodic cleanup removes the targets

On the checks-changed rebuild, delayed_clear(old) runs before create_checker(new) on purpose: the two checkers share the same shm target list, and add_target() only clears a target's purge_time when it is re-added after being marked. Clearing first lets surviving targets get un-marked on re-add, while genuinely dropped targets keep their purge_time and are cleaned up — so a checks-only update cannot make the live checker's targets disappear after DELAYED_CLEAR_TIMEOUT.

This is apisix-side only — it uses add_target/remove_target, which already exist in the current lua-resty-healthcheck-api7. No dependency bump is required.

Tests

t/node/healthcheck-incremental-update.t:

TEST 1: a node-only change must not destroy/rebuild the checker — asserts create new checker appears only once (initial) and clear checker (delayed_clear) does not.
TEST 2: a checks-config change must still rebuild — asserts clear checker appears.

Verified locally (OpenResty 1.29.2.4): TEST 1 fails on master (a node-only change logs clear checker, i.e. destroy-and-rebuild) and passes with this change. Existing healthcheck-leak-bugfix.t and the other adapted health-check tests pass.

Scope

This PR is independent of the library. The related #13385 / #13141 / #13235 are library-side bugs (stale-target cleanup and periodic-lock release) fixed in api7/lua-resty-healthcheck#55 and delivered to apisix by the dependency bump in #13627 — they are not part of this PR. (Supersedes the #13282 portion of the now-closed #13582.)

Fixes #13282.

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible

When an upstream's nodes change but its `checks` config is unchanged, the health-check manager destroyed the existing checker and built a new one. Two problems followed: 1. fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version, so a node-only change makes it return nil until the timer rebuilds the checker. During that window api_ctx.up_checker is nil and the balancer routes traffic to nodes already known to be unhealthy (apache#13282). 2. The rebuild throws away the checker's accumulated health state and re-probes every node from scratch. The manager now reconciles the existing checker's targets in place with add_target/remove_target when the `checks` config is unchanged (compared with core.table.deep_eq), keeping the checker and its state alive. timer_working_pool_check no longer destroys a checker for a node-only version change, and when a rebuild is genuinely required (the `checks` config changed) the new checker is created and inserted into the working pool before the old one is released, so fetch_checker never observes a nil gap. Bumps the lua-resty-healthcheck-api7 rockspec dependency to 3.3.0-0, which contains the companion library fix (clean every checker each window + release the periodic lock when idle) required by this change. Adds t/node/healthcheck-incremental-update.t: a node-only change must not destroy/rebuild the checker (no "clear checker"), while a checks-config change still rebuilds it.

When the checks config changes, install the freshly created checker into the working pool before stopping the previous one. This prevents a request from briefly fetching a stopped checker for the old version during the swap window.

…rop to 0 - compute_targets now returns an ordered array (preserving node order) so targets are added deterministically, keeping ordered error-log assertions stable. - Only keep/reuse a checker incrementally when the upstream still has nodes; when the node count drops to 0 the checker is destroyed as before. - Update existing healthcheck tests to assert the new incremental-reuse behaviour (a checker is reused instead of recreated when only the nodes change, and is not cleared in that case).

… config The incremental-reuse reconcile runs in timer_create_checker, which only acts on resources placed into the waiting pool by fetch_checker (i.e. when a request is routed). Re-route a request after the second PUT so the manager observes the new version and logs the incremental reuse.

…ured

… independent of the node-only incremental-update path (apache#13629)

…al sync failures - resty.healthcheck keys targets by ip+port+hostname (the Host header is not part of the identity). Remove stale targets before adding new ones so a Host-header-only change (pass_host/upstream_host) re-applies the new header instead of add_target() being a no-op on the still-present old target and the subsequent remove dropping the live target. - sync_checker_targets now returns whether every add/remove succeeded; the caller only commits the reused checker for the new version when the sync fully succeeds, otherwise it falls through to a full rebuild so the upstream still converges instead of being stuck on a stale target set.

membphis

This rebuild order is unsafe when the checks config changes while the target identity stays the same. The new checker is created and published first, so its add_target() calls reuse the existing shared shm target list entries for the same ip:port:hostname. Then the old checker calls delayed_clear(), which marks that same shared target list for purge. After the cleanup window, the targets that the newly published checker still depends on can be removed.

This blocks merge because checks-only updates can make active health-check targets disappear after DELAYED_CLEAR_TIMEOUT, especially once the cleanup fix in #13627 makes stale-target purging reliable. Please reorder the swap so delayed_clear() happens before the new checker re-adds targets, or otherwise ensure the new checker clears any inherited purge_time before it is considered published. It would also be good to add a regression test that waits past the cleanup window after a checks-config rebuild and verifies /v1/healthcheck still contains the live targets.

… are not purged On a checks-config change the manager rebuilds the checker. The new checker shares the old one's shm target list (same healthchecker name). add_target() only un-marks a target's purge_time when the target is re-added *after* being marked, so doing create_checker() (re-add) before the old checker's delayed_clear() left the surviving targets marked with a purge_time that nobody cleared -- the cleanup then purged the live checker's own targets once the delayed-clear window elapsed. Run delayed_clear() before create_checker() so the re-add un-marks surviving targets while genuinely dropped ones keep their purge_time. The old checker is still only stopped after the new one is published, so there is no rebuild gap. Add a regression test asserting the surviving targets remain in /v1/healthcheck after the delayed-clear window following a checks-config change.

membphis · 2026-07-01T08:30:47Z

This still leaves the health-filter bypass window on node-only updates.

The PR keeps the old checker alive and reconciles targets in timer_create_checker, but fetch_checker() still returns a checker only when working_item.version == resource_ver. When a discovery/DNS change bumps _nodes_ver, upstream.lua asks for the new version, fetch_checker() enqueues that version and returns nil until the 1s timer runs. During that window api_ctx.up_checker is nil; the balancer treats a nil checker as all upstream nodes being available. So a previously unhealthy node can still receive traffic immediately after a node-only update, even though the checker state was not destroyed.

Trigger:

Existing checker marks node A unhealthy.
Service discovery/DNS adds or removes another node without changing checks.
The first request after _nodes_ver changes calls fetch_checker() with the new version.
fetch_checker() returns nil before timer_create_checker reconciles and publishes the old checker under the new version.
The balancer builds the picker from all upstream nodes, including node A.

This blocks the fix for #13282 because the health state is preserved internally but is not used by requests during the version transition. Please let fetch_checker() return the existing live checker for the same resource when the latest config is reuse-eligible (checks unchanged and nodes non-empty), while still enqueueing the new version for target reconciliation. A regression test should make one node unhealthy, apply a node-only update, and assert that requests before the timer reconcile do not hit the unhealthy node.

…ransition A node-only change (discovery/DNS bumps _nodes_ver) queues the new version and timer_create_checker reconciles the existing checker in place. But fetch_checker only returned a checker on an exact version match, so during the ~1s until the timer ran it returned nil, api_ctx.up_checker was nil, and the balancer fell back to all upstream nodes -- a node already known unhealthy could take traffic during the transition (the request-path side of apache#13282). Return the existing live (non-dead) checker for the resource while the new version is queued for reconciliation, so requests keep using the preserved health state throughout the transition. Regression test: mark a node unhealthy, apply a node-only update, and assert a burst of requests in the transition window never hit the unhealthy node.

AlinsRan · 2026-07-01T08:43:49Z

Good catch — confirmed. fetch_checker() gated on an exact version match, so on a node-only _nodes_ver bump it returned nil (only enqueuing the new version) until the 1s timer ran; upstream.lua then set api_ctx.up_checker = nil and balancer.lua built the picker from all nodes. The health state was preserved internally but not used during the window — exactly as you described.

Fixed in 3688210: fetch_checker() now returns the existing live (non-dead) checker for the resource while it enqueues the new version for reconciliation, so requests keep filtering unhealthy nodes throughout the transition.

One deviation from your suggestion worth flagging: I return the live checker whenever one exists (and isn't dead), rather than gating on reuse-eligibility (checks unchanged + nodes non-empty). That guard would need a fetch_latest_conf() + deep_eq of the checks on the per-request hot path. Returning the live checker is safe even in the rarer checks-change window — its health status is real and current for the pre-change checks, which is strictly better than nil/no-filtering — and once the timer publishes the rebuilt checker the exact-version branch takes over. Happy to add the eligibility guard if you'd prefer it despite the hot-path cost.

Regression test (healthcheck-incremental-update.t TEST 4): mark a node unhealthy via active checks, apply a node-only update, then burst requests in the transition window and assert none hit the unhealthy node. Verified locally — it fails before the fix (errors: 3, the dead node is picked during the window) and passes after (errors: 0).

arunmat27 · 2026-07-01T11:03:37Z

I hit an issue while testing these changes.

Setup: APISIX container (with 2 NGINX worker processes) and 2 upstream nodes.

Steps:

Start APISIX and make a request. The worker that handles the request creates a new checker and adds both upstream nodes to shared memory.
Remove one of the upstream nodes.
Send another request and if that is handled by the other NGINX worker (i.e., not the worker from step 1) - it won't have the checker in its working_pool, it creates a new one. However, during the create_checker() flow, it only adds targets and never removes. As a result, the removed upstream node remains in shared memory and continues to appear in the /v1/healthcheck response.

Potential fix: Update create_checker() to call sync_checker_targets() instead of checker:add_targets().

…ulti-worker With multiple workers, a peer worker can create the checker and populate the shared shm target list, after which a node is removed from the upstream. A worker that never had the checker in its own working_pool then reaches create_checker(), which only added targets -- so the removed node lingered in the shm and kept being probed and returned by /v1/healthcheck. create_checker() now also removes targets present in the shm but absent from the config, reconciling it to the desired set. The add pass is kept as-is (re-adding un-marks any delayed_clear purge_time, preserving the checks-rebuild fix), so a straight switch to sync_checker_targets() -- which only adds missing targets -- is deliberately avoided. Regression test seeds a stale target into the shm and asserts create_checker() drops it. Reported-by: arunmat27

…ns it When only the checks config changes, a worker that did not serve traffic keeps its old-version checker in working_pool; timer_working_pool_check then destroys that stale local handle. Because the checker's shm target list is shared by name, a same-name replacement checker (built by whichever worker serves traffic) owns it. Unconditionally calling delayed_clear() on destroy marks that live checker's targets for purge on every worker. Only tear down the shm when no same-name checker will own it (resource deleted, or new config has no checks/nodes); otherwise drop the local handle with stop() and leave the shm to the replacement checker.

Add regression coverage for more real-world multi-worker scenarios: - recreating the same upstream id within the delayed-clear window keeps its targets (re-add un-marks the pending purge_time) - deleting an upstream cleans its shm target list - a new 2-worker test asserting the shared shm target set converges to the desired node set under node churn (incremental) and a checks-config change (rebuild), with no orphan and no purge of surviving targets

… the log Replace the blanket `--- ignore_error_log` with a targeted `--- no_error_log` that asserts the health-check timer and target-reconcile paths do not log an error. The timers are pcall-wrapped, so a Lua fault in the code under test was previously logged and swallowed while the test still passed on response_body alone. Targeting specific patterns keeps the missing-optional-dependency noise from failing the run while still catching a real crash. Also note in the multi-worker test that it is a convergence/non-regression check, not a deterministic reproduction of the cross-worker asymmetry.

Follow-up to the error-log hardening: convert the one remaining blanket `--- ignore_error_log` (the checks-config rebuild test) to the same targeted `--- no_error_log`, and add `failed to create healthcheck` to the pattern set so a healthcheck.new() failure on the create_checker path is also caught.

…rkers Add a 2-worker health-status test: a dead node (1970) alongside a healthy one (1980) with active checks. Assert the shared shm reports 1970 unhealthy and, once every worker's status cache has converged, retries=0 traffic never lands on 1970 -- proving both workers filter it. This covers health STATUS propagation, not just target membership.

…leeps The strict zero-error assertion previously relied on two fixed sleeps to let every worker's per-worker status cache converge. Under loaded CI a worker that built its checker late might not have received the unhealthy event by the burst, flaking. Drive traffic until several consecutive zero-error bursts confirm both workers filter the dead node (bounded); a genuine per-worker miss never converges and fails the test deterministically.

AlinsRan · 2026-07-02T09:13:20Z

I hit an issue while testing these changes.

Good catch, and thanks for the clear repro — confirmed. A worker that never held the checker reaches create_checker(), which only added targets, so a node removed after another worker populated the shared shm lingered there and kept showing up in /v1/healthcheck.

On the suggested sync_checker_targets() swap — I went a slightly different route, worth flagging why:

sync_checker_targets() only adds missing targets (if not current[key]); it never re-adds an already-present one.
But re-adding is exactly what clears a purge_time that a concurrent delayed_clear() sets during a checks-config rebuild.
So swapping to it would bring back the "surviving targets get purged after a rebuild" bug.

Instead create_checker() keeps its add-all pass (which un-marks purge_time) and adds a remove-stale pass — same net effect as your fix, without regressing the rebuild path.

AlinsRan added 5 commits June 30, 2026 08:36

test(healthcheck): run leak-bugfix at info level so reuse log is capt…

492e2bd

…ured

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jun 30, 2026

AlinsRan added a commit to AlinsRan/apisix that referenced this pull request Jun 30, 2026

test: trigger cleanup via a checks-config change so the regression is…

9fbade2

… independent of the node-only incremental-update path (apache#13629)

AlinsRan mentioned this pull request Jun 30, 2026

fix(healthcheck): purge stale targets and release periodic lock (bump lua-resty-healthcheck-api7 to 3.2.2) #13627

Merged

5 tasks

membphis requested changes Jun 30, 2026

View reviewed changes

AlinsRan mentioned this pull request Jul 1, 2026

fix(healthcheck): use immediate clear() to ensure k8s endpoint change… #13029

Open

5 tasks

AlinsRan requested a review from membphis July 1, 2026 07:58

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 1, 2026

AlinsRan added 5 commits July 2, 2026 09:11

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild#13629

fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild#13629
AlinsRan wants to merge 15 commits into
apache:masterfrom
AlinsRan:fix/healthcheck-13282-nil-window

AlinsRan commented Jun 30, 2026 •

edited

Loading

Uh oh!

membphis left a comment

Uh oh!

membphis commented Jul 1, 2026

Uh oh!

AlinsRan commented Jul 1, 2026

Uh oh!

arunmat27 commented Jul 1, 2026

Uh oh!

AlinsRan commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

AlinsRan commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Root cause

Fix

Checker lifecycle after this change

Tests

Scope

Checklist

Uh oh!

membphis left a comment

Choose a reason for hiding this comment

Uh oh!

membphis commented Jul 1, 2026

Uh oh!

AlinsRan commented Jul 1, 2026

Uh oh!

arunmat27 commented Jul 1, 2026

Uh oh!

AlinsRan commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlinsRan commented Jun 30, 2026 •

edited

Loading

AlinsRan commented Jul 2, 2026 •

edited

Loading