Skip to content

fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild#13629

Open
AlinsRan wants to merge 15 commits into
apache:masterfrom
AlinsRan:fix/healthcheck-13282-nil-window
Open

fix(healthcheck): reuse the checker on node-only upstream changes instead of destroy-and-rebuild#13629
AlinsRan wants to merge 15 commits into
apache:masterfrom
AlinsRan:fix/healthcheck-13282-nil-window

Conversation

@AlinsRan

@AlinsRan AlinsRan commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #13282: the health-check manager destroys and rebuilds the checker on every upstream change, even when only the upstream nodes changed and the checks config is identical.

Root cause

fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version. A node-only change bumps that version, so the checker is queued for an asynchronous rebuild. During that window:

  1. api_ctx.up_checker is nil, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be unhealthy for ~1-2s.
  2. The rebuild discards the checker's accumulated health state and re-probes every node from scratch.

timer_working_pool_check also destroyed the checker on any version mismatch, racing timer_create_checker and widening the gap.

Fix

  • compute_targets() / sync_checker_targets(): when the checks config is unchanged (core.table.deep_eq) and only the nodes changed, reconcile the existing checker's targets in place via add_target/remove_target against the authoritative shm target list — keeping the checker and its health state alive instead of rebuilding.
  • timer_working_pool_check no longer destroys a checker for a node-only version change (it falls through to destroy only when the resource is gone or the node count drops to 0).
  • When a rebuild is genuinely required (the checks config changed), the new checker is created and inserted into the working pool before the old one is released, eliminating the up_checker == nil gap.
  • The working-pool entry now stores the checks config so node-only vs checks changes can be told apart.

Checker lifecycle after this change

Resource event Path Operation
Create request → fetch_checker enqueues to waiting_pooltimer_create_checker (no existing checker) healthcheck.new() + add_target per node
Update — nodes only (DNS / k8s scale) timer_create_checker reuse branch sync_checker_targets: add_target/remove_target on the delta only, no delayed_clear
Update — checks changed timer_create_checker rebuild branch delayed_clear(old)new() + add_target(new) → publish → stop(old)
Delete timer_working_pool_check (fetch_latest_conf = nil) delayed_clear(old) + stop + working_pool[path]=nil; periodic cleanup removes the targets

On the checks-changed rebuild, delayed_clear(old) runs before create_checker(new) on purpose: the two checkers share the same shm target list, and add_target() only clears a target's purge_time when it is re-added after being marked. Clearing first lets surviving targets get un-marked on re-add, while genuinely dropped targets keep their purge_time and are cleaned up — so a checks-only update cannot make the live checker's targets disappear after DELAYED_CLEAR_TIMEOUT.

This is apisix-side only — it uses add_target/remove_target, which already exist in the current lua-resty-healthcheck-api7. No dependency bump is required.

Tests

t/node/healthcheck-incremental-update.t:

  • TEST 1: a node-only change must not destroy/rebuild the checker — asserts create new checker appears only once (initial) and clear checker (delayed_clear) does not.
  • TEST 2: a checks-config change must still rebuild — asserts clear checker appears.

Verified locally (OpenResty 1.29.2.4): TEST 1 fails on master (a node-only change logs clear checker, i.e. destroy-and-rebuild) and passes with this change. Existing healthcheck-leak-bugfix.t and the other adapted health-check tests pass.

Scope

This PR is independent of the library. The related #13385 / #13141 / #13235 are library-side bugs (stale-target cleanup and periodic-lock release) fixed in api7/lua-resty-healthcheck#55 and delivered to apisix by the dependency bump in #13627 — they are not part of this PR. (Supersedes the #13282 portion of the now-closed #13582.)

Fixes #13282.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible

AlinsRan added 5 commits June 30, 2026 08:36
When an upstream's nodes change but its `checks` config is unchanged,
the health-check manager destroyed the existing checker and built a new
one. Two problems followed:

1. fetch_checker() keys the working checker by a version derived from
   both modifiedIndex and the nodes version, so a node-only change makes
   it return nil until the timer rebuilds the checker. During that
   window api_ctx.up_checker is nil and the balancer routes traffic to
   nodes already known to be unhealthy (apache#13282).

2. The rebuild throws away the checker's accumulated health state and
   re-probes every node from scratch.

The manager now reconciles the existing checker's targets in place with
add_target/remove_target when the `checks` config is unchanged
(compared with core.table.deep_eq), keeping the checker and its state
alive. timer_working_pool_check no longer destroys a checker for a
node-only version change, and when a rebuild is genuinely required (the
`checks` config changed) the new checker is created and inserted into
the working pool before the old one is released, so fetch_checker never
observes a nil gap.

Bumps the lua-resty-healthcheck-api7 rockspec dependency to 3.3.0-0,
which contains the companion library fix (clean every checker each
window + release the periodic lock when idle) required by this change.

Adds t/node/healthcheck-incremental-update.t: a node-only change must
not destroy/rebuild the checker (no "clear checker"), while a
checks-config change still rebuilds it.
When the checks config changes, install the freshly created checker into
the working pool before stopping the previous one. This prevents a request
from briefly fetching a stopped checker for the old version during the
swap window.
…rop to 0

- compute_targets now returns an ordered array (preserving node order) so
  targets are added deterministically, keeping ordered error-log assertions
  stable.
- Only keep/reuse a checker incrementally when the upstream still has nodes;
  when the node count drops to 0 the checker is destroyed as before.
- Update existing healthcheck tests to assert the new incremental-reuse
  behaviour (a checker is reused instead of recreated when only the nodes
  change, and is not cleared in that case).
… config

The incremental-reuse reconcile runs in timer_create_checker, which only
acts on resources placed into the waiting pool by fetch_checker (i.e. when a
request is routed). Re-route a request after the second PUT so the manager
observes the new version and logs the incremental reuse.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jun 30, 2026
AlinsRan added a commit to AlinsRan/apisix that referenced this pull request Jun 30, 2026
… independent of the node-only incremental-update path (apache#13629)
…al sync failures

- resty.healthcheck keys targets by ip+port+hostname (the Host header is not
  part of the identity). Remove stale targets before adding new ones so a
  Host-header-only change (pass_host/upstream_host) re-applies the new header
  instead of add_target() being a no-op on the still-present old target and the
  subsequent remove dropping the live target.
- sync_checker_targets now returns whether every add/remove succeeded; the
  caller only commits the reused checker for the new version when the sync fully
  succeeds, otherwise it falls through to a full rebuild so the upstream still
  converges instead of being stuck on a stale target set.

@membphis membphis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rebuild order is unsafe when the checks config changes while the target identity stays the same. The new checker is created and published first, so its add_target() calls reuse the existing shared shm target list entries for the same ip:port:hostname. Then the old checker calls delayed_clear(), which marks that same shared target list for purge. After the cleanup window, the targets that the newly published checker still depends on can be removed.

This blocks merge because checks-only updates can make active health-check targets disappear after DELAYED_CLEAR_TIMEOUT, especially once the cleanup fix in #13627 makes stale-target purging reliable. Please reorder the swap so delayed_clear() happens before the new checker re-adds targets, or otherwise ensure the new checker clears any inherited purge_time before it is considered published. It would also be good to add a regression test that waits past the cleanup window after a checks-config rebuild and verifies /v1/healthcheck still contains the live targets.

… are not purged

On a checks-config change the manager rebuilds the checker. The new checker
shares the old one's shm target list (same healthchecker name). add_target()
only un-marks a target's purge_time when the target is re-added *after* being
marked, so doing create_checker() (re-add) before the old checker's
delayed_clear() left the surviving targets marked with a purge_time that nobody
cleared -- the cleanup then purged the live checker's own targets once the
delayed-clear window elapsed.

Run delayed_clear() before create_checker() so the re-add un-marks surviving
targets while genuinely dropped ones keep their purge_time. The old checker is
still only stopped after the new one is published, so there is no rebuild gap.

Add a regression test asserting the surviving targets remain in /v1/healthcheck
after the delayed-clear window following a checks-config change.
@membphis

membphis commented Jul 1, 2026

Copy link
Copy Markdown
Member

This still leaves the health-filter bypass window on node-only updates.

The PR keeps the old checker alive and reconciles targets in timer_create_checker, but fetch_checker() still returns a checker only when working_item.version == resource_ver. When a discovery/DNS change bumps _nodes_ver, upstream.lua asks for the new version, fetch_checker() enqueues that version and returns nil until the 1s timer runs. During that window api_ctx.up_checker is nil; the balancer treats a nil checker as all upstream nodes being available. So a previously unhealthy node can still receive traffic immediately after a node-only update, even though the checker state was not destroyed.

Trigger:

  1. Existing checker marks node A unhealthy.
  2. Service discovery/DNS adds or removes another node without changing checks.
  3. The first request after _nodes_ver changes calls fetch_checker() with the new version.
  4. fetch_checker() returns nil before timer_create_checker reconciles and publishes the old checker under the new version.
  5. The balancer builds the picker from all upstream nodes, including node A.

This blocks the fix for #13282 because the health state is preserved internally but is not used by requests during the version transition. Please let fetch_checker() return the existing live checker for the same resource when the latest config is reuse-eligible (checks unchanged and nodes non-empty), while still enqueueing the new version for target reconciliation. A regression test should make one node unhealthy, apply a node-only update, and assert that requests before the timer reconcile do not hit the unhealthy node.

…ransition

A node-only change (discovery/DNS bumps _nodes_ver) queues the new version and
timer_create_checker reconciles the existing checker in place. But fetch_checker
only returned a checker on an exact version match, so during the ~1s until the
timer ran it returned nil, api_ctx.up_checker was nil, and the balancer fell back
to all upstream nodes -- a node already known unhealthy could take traffic during
the transition (the request-path side of apache#13282).

Return the existing live (non-dead) checker for the resource while the new
version is queued for reconciliation, so requests keep using the preserved health
state throughout the transition. Regression test: mark a node unhealthy, apply a
node-only update, and assert a burst of requests in the transition window never
hit the unhealthy node.
@AlinsRan

AlinsRan commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Good catch — confirmed. fetch_checker() gated on an exact version match, so on a node-only _nodes_ver bump it returned nil (only enqueuing the new version) until the 1s timer ran; upstream.lua then set api_ctx.up_checker = nil and balancer.lua built the picker from all nodes. The health state was preserved internally but not used during the window — exactly as you described.

Fixed in 3688210: fetch_checker() now returns the existing live (non-dead) checker for the resource while it enqueues the new version for reconciliation, so requests keep filtering unhealthy nodes throughout the transition.

One deviation from your suggestion worth flagging: I return the live checker whenever one exists (and isn't dead), rather than gating on reuse-eligibility (checks unchanged + nodes non-empty). That guard would need a fetch_latest_conf() + deep_eq of the checks on the per-request hot path. Returning the live checker is safe even in the rarer checks-change window — its health status is real and current for the pre-change checks, which is strictly better than nil/no-filtering — and once the timer publishes the rebuilt checker the exact-version branch takes over. Happy to add the eligibility guard if you'd prefer it despite the hot-path cost.

Regression test (healthcheck-incremental-update.t TEST 4): mark a node unhealthy via active checks, apply a node-only update, then burst requests in the transition window and assert none hit the unhealthy node. Verified locally — it fails before the fix (errors: 3, the dead node is picked during the window) and passes after (errors: 0).

@arunmat27

Copy link
Copy Markdown

I hit an issue while testing these changes.

Setup: APISIX container (with 2 NGINX worker processes) and 2 upstream nodes.

Steps:

  1. Start APISIX and make a request. The worker that handles the request creates a new checker and adds both upstream nodes to shared memory.
  2. Remove one of the upstream nodes.
  3. Send another request and if that is handled by the other NGINX worker (i.e., not the worker from step 1) - it won't have the checker in its working_pool, it creates a new one. However, during the create_checker() flow, it only adds targets and never removes. As a result, the removed upstream node remains in shared memory and continues to appear in the /v1/healthcheck response.

Potential fix: Update create_checker() to call sync_checker_targets() instead of checker:add_targets().

…ulti-worker

With multiple workers, a peer worker can create the checker and populate the
shared shm target list, after which a node is removed from the upstream. A worker
that never had the checker in its own working_pool then reaches create_checker(),
which only added targets -- so the removed node lingered in the shm and kept
being probed and returned by /v1/healthcheck.

create_checker() now also removes targets present in the shm but absent from the
config, reconciling it to the desired set. The add pass is kept as-is (re-adding
un-marks any delayed_clear purge_time, preserving the checks-rebuild fix), so a
straight switch to sync_checker_targets() -- which only adds missing targets --
is deliberately avoided. Regression test seeds a stale target into the shm and
asserts create_checker() drops it.

Reported-by: arunmat27
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 1, 2026
AlinsRan added 5 commits July 2, 2026 09:11
…ns it

When only the checks config changes, a worker that did not serve traffic
keeps its old-version checker in working_pool; timer_working_pool_check
then destroys that stale local handle. Because the checker's shm target
list is shared by name, a same-name replacement checker (built by whichever
worker serves traffic) owns it. Unconditionally calling delayed_clear() on
destroy marks that live checker's targets for purge on every worker.

Only tear down the shm when no same-name checker will own it (resource
deleted, or new config has no checks/nodes); otherwise drop the local
handle with stop() and leave the shm to the replacement checker.
Add regression coverage for more real-world multi-worker scenarios:
- recreating the same upstream id within the delayed-clear window keeps
  its targets (re-add un-marks the pending purge_time)
- deleting an upstream cleans its shm target list
- a new 2-worker test asserting the shared shm target set converges to the
  desired node set under node churn (incremental) and a checks-config change
  (rebuild), with no orphan and no purge of surviving targets
… the log

Replace the blanket `--- ignore_error_log` with a targeted `--- no_error_log`
that asserts the health-check timer and target-reconcile paths do not log an
error. The timers are pcall-wrapped, so a Lua fault in the code under test was
previously logged and swallowed while the test still passed on response_body
alone. Targeting specific patterns keeps the missing-optional-dependency noise
from failing the run while still catching a real crash.

Also note in the multi-worker test that it is a convergence/non-regression
check, not a deterministic reproduction of the cross-worker asymmetry.
Follow-up to the error-log hardening: convert the one remaining blanket
`--- ignore_error_log` (the checks-config rebuild test) to the same targeted
`--- no_error_log`, and add `failed to create healthcheck` to the pattern set
so a healthcheck.new() failure on the create_checker path is also caught.
…rkers

Add a 2-worker health-status test: a dead node (1970) alongside a healthy one
(1980) with active checks. Assert the shared shm reports 1970 unhealthy and,
once every worker's status cache has converged, retries=0 traffic never lands
on 1970 -- proving both workers filter it. This covers health STATUS
propagation, not just target membership.
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jul 2, 2026
…leeps

The strict zero-error assertion previously relied on two fixed sleeps to let
every worker's per-worker status cache converge. Under loaded CI a worker that
built its checker late might not have received the unhealthy event by the
burst, flaking. Drive traffic until several consecutive zero-error bursts
confirm both workers filter the dead node (bounded); a genuine per-worker miss
never converges and fails the test deterministically.
@AlinsRan

AlinsRan commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

I hit an issue while testing these changes.

Good catch, and thanks for the clear repro — confirmed. A worker that never held the checker reaches create_checker(), which only added targets, so a node removed after another worker populated the shared shm lingered there and kept showing up in /v1/healthcheck.

On the suggested sync_checker_targets() swap — I went a slightly different route, worth flagging why:

  • sync_checker_targets() only adds missing targets (if not current[key]); it never re-adds an already-present one.
  • But re-adding is exactly what clears a purge_time that a concurrent delayed_clear() sets during a checks-config rebuild.
  • So swapping to it would bring back the "surviving targets get purged after a rebuild" bug.

Instead create_checker() keeps its add-all pass (which un-marks purge_time) and adds a remove-stale pass — same net effect as your fix, without regressing the rebuild path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Health check state lost and checker not working after upstream node changes

3 participants