fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild#13629
fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild#13629AlinsRan wants to merge 15 commits into
Conversation
When an upstream's nodes change but its `checks` config is unchanged, the health-check manager destroyed the existing checker and built a new one. Two problems followed: 1. fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version, so a node-only change makes it return nil until the timer rebuilds the checker. During that window api_ctx.up_checker is nil and the balancer routes traffic to nodes already known to be unhealthy (apache#13282). 2. The rebuild throws away the checker's accumulated health state and re-probes every node from scratch. The manager now reconciles the existing checker's targets in place with add_target/remove_target when the `checks` config is unchanged (compared with core.table.deep_eq), keeping the checker and its state alive. timer_working_pool_check no longer destroys a checker for a node-only version change, and when a rebuild is genuinely required (the `checks` config changed) the new checker is created and inserted into the working pool before the old one is released, so fetch_checker never observes a nil gap. Bumps the lua-resty-healthcheck-api7 rockspec dependency to 3.3.0-0, which contains the companion library fix (clean every checker each window + release the periodic lock when idle) required by this change. Adds t/node/healthcheck-incremental-update.t: a node-only change must not destroy/rebuild the checker (no "clear checker"), while a checks-config change still rebuilds it.
When the checks config changes, install the freshly created checker into the working pool before stopping the previous one. This prevents a request from briefly fetching a stopped checker for the old version during the swap window.
…rop to 0 - compute_targets now returns an ordered array (preserving node order) so targets are added deterministically, keeping ordered error-log assertions stable. - Only keep/reuse a checker incrementally when the upstream still has nodes; when the node count drops to 0 the checker is destroyed as before. - Update existing healthcheck tests to assert the new incremental-reuse behaviour (a checker is reused instead of recreated when only the nodes change, and is not cleared in that case).
… config The incremental-reuse reconcile runs in timer_create_checker, which only acts on resources placed into the waiting pool by fetch_checker (i.e. when a request is routed). Re-route a request after the second PUT so the manager observes the new version and logs the incremental reuse.
… independent of the node-only incremental-update path (apache#13629)
…al sync failures - resty.healthcheck keys targets by ip+port+hostname (the Host header is not part of the identity). Remove stale targets before adding new ones so a Host-header-only change (pass_host/upstream_host) re-applies the new header instead of add_target() being a no-op on the still-present old target and the subsequent remove dropping the live target. - sync_checker_targets now returns whether every add/remove succeeded; the caller only commits the reused checker for the new version when the sync fully succeeds, otherwise it falls through to a full rebuild so the upstream still converges instead of being stuck on a stale target set.
membphis
left a comment
There was a problem hiding this comment.
This rebuild order is unsafe when the checks config changes while the target identity stays the same. The new checker is created and published first, so its add_target() calls reuse the existing shared shm target list entries for the same ip:port:hostname. Then the old checker calls delayed_clear(), which marks that same shared target list for purge. After the cleanup window, the targets that the newly published checker still depends on can be removed.
This blocks merge because checks-only updates can make active health-check targets disappear after DELAYED_CLEAR_TIMEOUT, especially once the cleanup fix in #13627 makes stale-target purging reliable. Please reorder the swap so delayed_clear() happens before the new checker re-adds targets, or otherwise ensure the new checker clears any inherited purge_time before it is considered published. It would also be good to add a regression test that waits past the cleanup window after a checks-config rebuild and verifies /v1/healthcheck still contains the live targets.
… are not purged On a checks-config change the manager rebuilds the checker. The new checker shares the old one's shm target list (same healthchecker name). add_target() only un-marks a target's purge_time when the target is re-added *after* being marked, so doing create_checker() (re-add) before the old checker's delayed_clear() left the surviving targets marked with a purge_time that nobody cleared -- the cleanup then purged the live checker's own targets once the delayed-clear window elapsed. Run delayed_clear() before create_checker() so the re-add un-marks surviving targets while genuinely dropped ones keep their purge_time. The old checker is still only stopped after the new one is published, so there is no rebuild gap. Add a regression test asserting the surviving targets remain in /v1/healthcheck after the delayed-clear window following a checks-config change.
|
This still leaves the health-filter bypass window on node-only updates. The PR keeps the old checker alive and reconciles targets in Trigger:
This blocks the fix for #13282 because the health state is preserved internally but is not used by requests during the version transition. Please let |
…ransition A node-only change (discovery/DNS bumps _nodes_ver) queues the new version and timer_create_checker reconciles the existing checker in place. But fetch_checker only returned a checker on an exact version match, so during the ~1s until the timer ran it returned nil, api_ctx.up_checker was nil, and the balancer fell back to all upstream nodes -- a node already known unhealthy could take traffic during the transition (the request-path side of apache#13282). Return the existing live (non-dead) checker for the resource while the new version is queued for reconciliation, so requests keep using the preserved health state throughout the transition. Regression test: mark a node unhealthy, apply a node-only update, and assert a burst of requests in the transition window never hit the unhealthy node.
|
Good catch — confirmed. Fixed in 3688210: One deviation from your suggestion worth flagging: I return the live checker whenever one exists (and isn't dead), rather than gating on reuse-eligibility ( Regression test ( |
|
I hit an issue while testing these changes. Setup: APISIX container (with 2 NGINX worker processes) and 2 upstream nodes. Steps:
Potential fix: Update |
…ulti-worker With multiple workers, a peer worker can create the checker and populate the shared shm target list, after which a node is removed from the upstream. A worker that never had the checker in its own working_pool then reaches create_checker(), which only added targets -- so the removed node lingered in the shm and kept being probed and returned by /v1/healthcheck. create_checker() now also removes targets present in the shm but absent from the config, reconciling it to the desired set. The add pass is kept as-is (re-adding un-marks any delayed_clear purge_time, preserving the checks-rebuild fix), so a straight switch to sync_checker_targets() -- which only adds missing targets -- is deliberately avoided. Regression test seeds a stale target into the shm and asserts create_checker() drops it. Reported-by: arunmat27
…ns it When only the checks config changes, a worker that did not serve traffic keeps its old-version checker in working_pool; timer_working_pool_check then destroys that stale local handle. Because the checker's shm target list is shared by name, a same-name replacement checker (built by whichever worker serves traffic) owns it. Unconditionally calling delayed_clear() on destroy marks that live checker's targets for purge on every worker. Only tear down the shm when no same-name checker will own it (resource deleted, or new config has no checks/nodes); otherwise drop the local handle with stop() and leave the shm to the replacement checker.
Add regression coverage for more real-world multi-worker scenarios: - recreating the same upstream id within the delayed-clear window keeps its targets (re-add un-marks the pending purge_time) - deleting an upstream cleans its shm target list - a new 2-worker test asserting the shared shm target set converges to the desired node set under node churn (incremental) and a checks-config change (rebuild), with no orphan and no purge of surviving targets
… the log Replace the blanket `--- ignore_error_log` with a targeted `--- no_error_log` that asserts the health-check timer and target-reconcile paths do not log an error. The timers are pcall-wrapped, so a Lua fault in the code under test was previously logged and swallowed while the test still passed on response_body alone. Targeting specific patterns keeps the missing-optional-dependency noise from failing the run while still catching a real crash. Also note in the multi-worker test that it is a convergence/non-regression check, not a deterministic reproduction of the cross-worker asymmetry.
Follow-up to the error-log hardening: convert the one remaining blanket `--- ignore_error_log` (the checks-config rebuild test) to the same targeted `--- no_error_log`, and add `failed to create healthcheck` to the pattern set so a healthcheck.new() failure on the create_checker path is also caught.
…rkers Add a 2-worker health-status test: a dead node (1970) alongside a healthy one (1980) with active checks. Assert the shared shm reports 1970 unhealthy and, once every worker's status cache has converged, retries=0 traffic never lands on 1970 -- proving both workers filter it. This covers health STATUS propagation, not just target membership.
…leeps The strict zero-error assertion previously relied on two fixed sleeps to let every worker's per-worker status cache converge. Under loaded CI a worker that built its checker late might not have received the unhealthy event by the burst, flaking. Drive traffic until several consecutive zero-error bursts confirm both workers filter the dead node (bounded); a genuine per-worker miss never converges and fails the test deterministically.
Good catch, and thanks for the clear repro — confirmed. A worker that never held the checker reaches On the suggested
Instead |
Description
Fixes #13282: the health-check manager destroys and rebuilds the checker on every upstream change, even when only the upstream nodes changed and the
checksconfig is identical.Root cause
fetch_checker()keys the working checker by a version derived from bothmodifiedIndexand the nodes version. A node-only change bumps that version, so the checker is queued for an asynchronous rebuild. During that window:api_ctx.up_checkerisnil, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be unhealthy for ~1-2s.timer_working_pool_checkalso destroyed the checker on any version mismatch, racingtimer_create_checkerand widening the gap.Fix
compute_targets()/sync_checker_targets(): when thechecksconfig is unchanged (core.table.deep_eq) and only the nodes changed, reconcile the existing checker's targets in place viaadd_target/remove_targetagainst the authoritative shm target list — keeping the checker and its health state alive instead of rebuilding.timer_working_pool_checkno longer destroys a checker for a node-only version change (it falls through to destroy only when the resource is gone or the node count drops to 0).checksconfig changed), the new checker is created and inserted into the working pool before the old one is released, eliminating theup_checker == nilgap.checksconfig so node-only vs checks changes can be told apart.Checker lifecycle after this change
fetch_checkerenqueues towaiting_pool→timer_create_checker(no existing checker)healthcheck.new()+add_targetper nodetimer_create_checkerreuse branchsync_checker_targets:add_target/remove_targeton the delta only, nodelayed_clearcheckschangedtimer_create_checkerrebuild branchdelayed_clear(old)→new()+add_target(new)→ publish →stop(old)timer_working_pool_check(fetch_latest_conf=nil)delayed_clear(old)+stop+working_pool[path]=nil; periodic cleanup removes the targetsOn the
checks-changed rebuild,delayed_clear(old)runs beforecreate_checker(new)on purpose: the two checkers share the same shm target list, andadd_target()only clears a target'spurge_timewhen it is re-added after being marked. Clearing first lets surviving targets get un-marked on re-add, while genuinely dropped targets keep theirpurge_timeand are cleaned up — so achecks-only update cannot make the live checker's targets disappear afterDELAYED_CLEAR_TIMEOUT.This is apisix-side only — it uses
add_target/remove_target, which already exist in the currentlua-resty-healthcheck-api7. No dependency bump is required.Tests
t/node/healthcheck-incremental-update.t:create new checkerappears only once (initial) andclear checker(delayed_clear) does not.checks-config change must still rebuild — assertsclear checkerappears.Verified locally (OpenResty 1.29.2.4): TEST 1 fails on master (a node-only change logs
clear checker, i.e. destroy-and-rebuild) and passes with this change. Existinghealthcheck-leak-bugfix.tand the other adapted health-check tests pass.Scope
This PR is independent of the library. The related #13385 / #13141 / #13235 are library-side bugs (stale-target cleanup and periodic-lock release) fixed in api7/lua-resty-healthcheck#55 and delivered to apisix by the dependency bump in #13627 — they are not part of this PR. (Supersedes the #13282 portion of the now-closed #13582.)
Fixes #13282.
Checklist