Skip to content

fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029

Open
tyq010101 wants to merge 1 commit into
apache:masterfrom
tyq010101:fix/k8s_ipaddress_no_update
Open

fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029
tyq010101 wants to merge 1 commit into
apache:masterfrom
tyq010101:fix/k8s_ipaddress_no_update

Conversation

@tyq010101

@tyq010101 tyq010101 commented Feb 25, 2026

Copy link
Copy Markdown

…s take effect

Replace delayed_clear() with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates.

Description

Which issue(s) this PR fixes:

Fixes #

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Feb 25, 2026
@tyq010101 tyq010101 force-pushed the fix/k8s_ipaddress_no_update branch from 03b48d9 to 3af6251 Compare February 25, 2026 03:24
@tyq010101

tyq010101 commented Feb 25, 2026

Copy link
Copy Markdown
Author

#12803
I think this pr can fix this issue @Baoyuantop

@Baoyuantop

Copy link
Copy Markdown
Contributor

The current method of clearing all data causes a temporary loss of health status with each upstream configuration update. The correct approach is to precisely remove nodes that no longer exist in the new upstream configuration. This clears the old stale IPs while preserving the historical health status of unchanged nodes.

@tyq010101

Copy link
Copy Markdown
Author

@Baoyuantop

You are correct,but clear_delayed also causes a temporary update. The current problem is:

When clear_delayed performs cleanup, the K8s endpoint is no longer in the cache, so even if clear_delayed is executed, the old endpoint IP will not be deleted. The current approach is an optimization based on the previous implementation and does not make the result worse.

@tyq010101 tyq010101 force-pushed the fix/k8s_ipaddress_no_update branch from 3af6251 to 3ee8d5d Compare March 2, 2026 07:45
…s take effect

Replace delayed_clear with immediate clear() in healthcheck_manager to fix
an issue where k8s endpoint changes would not take effect immediately. The
delayed clear could cause healthcheck to continue using stale IP addresses
after endpoint updates.
Old K8s Endpoint ip has destroyed, but healthCheck manager always check old ip address
@tyq010101 tyq010101 force-pushed the fix/k8s_ipaddress_no_update branch from 3ee8d5d to 7476c07 Compare March 2, 2026 10:07
@Baoyuantop

Copy link
Copy Markdown
Contributor

Thanks for the explanation. I agree that delayed_clear has a problem in this scenario -- you're right that stale IPs can persist. But switching to clear() introduces a different issue: it wipes the health status of all nodes (including the ones that haven't changed), causing a temporary health status reset on every upstream update.
Here's what happens with clear():

  1. Upstream has nodes A, B, C, D (all with accumulated health status)
  2. K8s scales down to A, B
  3. clear() removes all target data from shm (A, B, C, D)
  4. New checker re-adds A, B as healthy by default
  5. If A was actually unhealthy, traffic briefly flows to a broken node until the next health check cycle detects the failure again

The correct approach is a diff-based removal: compare the old target list with the new node list, and only remove the nodes that no longer exist (C, D), while preserving A and B's health status.
Something like this in timer_create_checker:

local existing_checker = working_pool[resource_path]
if existing_checker then
    -- diff-based removal: only remove nodes that no longer exist
    local new_nodes_set = {}
    for _, node in ipairs(upstream.nodes) do
        new_nodes_set[node.host .. ":" .. (node.port or "")] = true
    end
    for _, target in ipairs(existing_checker.checker.targets) do
        local key = target.ip .. ":" .. (target.port or "")
        if not new_nodes_set[key] then
            existing_checker.checker:remove_target(target.ip, target.port, target.hostname)
        end
    end
    existing_checker.checker:stop()
end

This removes only the stale IPs while preserving health history for unchanged nodes.

@Baoyuantop Baoyuantop added the wait for update wait for the author's response in this issue/PR label Mar 12, 2026

@moonming moonming left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tyq010101, thank you for looking into the Kubernetes healthcheck issue!

The change: Replacing delayed_clear() with immediate clear() to ensure k8s endpoint changes take effect immediately.

My concern: The original code likely uses delayed_clear() intentionally to avoid race conditions during configuration updates. Switching to immediate clear() could introduce issues:

  • Multiple workers might try to clear simultaneously
  • In-flight health checks could be disrupted
  • The clear might happen while new endpoints are still being processed

Could you help clarify:

  1. What specific scenario led you to discover this issue?
  2. Have you observed the race condition I mentioned in testing?
  3. Why was delayed_clear() used originally — was there a comment or commit message explaining the choice?

If you can provide more context on the failure scenario, we can work together on a solution that's both immediate and safe. Thank you!

@Baoyuantop

Copy link
Copy Markdown
Contributor

Hi @tyq010101, following up on the previous review comments. Please let us know if you have any updates. Thank you.

@Baoyuantop

Copy link
Copy Markdown
Contributor

I traced the history of delayed_clear usage:

  1. PR fix: keep healthcheck target state when upstream changes #10312 (2023-10-13) originally changed clear() → delayed_clear(3) to fix a race condition: when upstream config changes, clear() immediately removes all targets from shared dict, causing concurrent requests to fail when looking
    up healthy nodes.

  2. PR refactor: add healthcheck manager to decouple upstream #12426 (2025-08-07) increased the timeout to 10s when refactoring healthcheck to async timer-driven architecture, since checker creation now takes longer.

Therefore, delayed_clear cannot be changed to clear(), otherwise the problem in #10312 will continue to occur. The correct practice on the Kubernetes side is to configure a preStop hook — ensuring that the pod does not exit before the healthcheck response, which is Kubernetes best practice.

Therefore, I will close this PR for now. If you still have this problem, you can submit an issue first. Thank you for your contribution.

@Baoyuantop Baoyuantop closed this Apr 14, 2026
@moonming moonming reopened this Jul 1, 2026
@moonming

moonming commented Jul 1, 2026

Copy link
Copy Markdown
Member

Reopening to capture a better fix for the underlying problem. After tracing the code end-to-end, I don't think the choice is between clear() and delayed_clear() at all — both are wrong for different reasons. The real fix is to stop tearing the checker down on every node change and reconcile targets incrementally instead.

1. What actually happens today

The checker is keyed by resource_path (upstream#<resource_key>), which is stable across config versions, so the old and new checker write to the same shm (TARGET_LIST plus the per-target state keys). On a k8s endpoint change (_nodes_ver bumps → version bumps), timer_create_checker does:

delayed_clear(10s)  -- stamp every old target with purge_time = now + 10s
stop()              -- stop the old checker
create_checker()    -- build a brand-new checker, add_target() for the new node set

So every node-set change tears the whole checker down and rebuilds it, laundering health status through the shm: a surviving node gets re-added, sees purge_time ~= nil, and inherits its old status; a removed node is left with a purge_time and is purged ~10s later. This teardown/rebuild + delayed purge is the root of the stale window and its side effects.

2. Why delayed_clear exists — and why that reason is now weak

delayed_clear came from #10312: an immediate clear() empties the target list, and concurrent in-flight requests that look up node health hit target not found → the node was treated as unavailable → the request failed. The 10s delay kept old targets alive until the new checker re-registered them.

But that justification is largely gone today. In fetch_node_status, target not found is now explicitly treated as unknown → usable (returns true):

https://github.com/apache/apisix/blob/master/apisix/healthcheck_manager.lua#L116-L127

So the #10312 failure mode is already absorbed at the APISIX layer. The only thing delayed_clear still buys us is preserving accumulated health status across the rebuild — not preventing request failures.

3. The harder bug nobody flagged: IP reuse → stale health status

A removed IP lingering in the shm for 10s is mostly harmless on its own — routing rebuilds its picker from the upstream nodes list, not from the healthcheck target list, so the departed IP is never selected and the residue self-cleans.

The genuinely harmful case is IP reuse. If k8s reassigns a just-terminated pod's ip:port to a new pod within the 10s window (dense clusters, small pod CIDRs, hostNetwork), the new checker's add_target(reused_ip, is_healthy = true) finds the still-present old entry, revives it (purge_time = nil), and inherits the previous pod's health status for the new pod:

https://github.com/api7/lua-resty-healthcheck/blob/master/lib/resty/healthcheck.lua#L503-L512

If the old pod was unhealthy, a brand-new healthy pod is starved of traffic; if it was healthy, traffic flows to a possibly-broken new pod until the next probe. Health status is indexed only by ip:port:hostname, and delayed_clear keeps that identity alive with stale state across a window where the identity can be reused. That is the real correctness defect delayed_clear introduces.

4. Proposed fix: incremental target reconciliation, reuse the checker

Don't rebuild — keep the existing checker and reconcile only the target diff. Gate it so that only node-set changes take the fast path; a change to the check parameters themselves (interval, http_path, type, …) still requires a fresh checker.

-- replace the delayed_clear + stop + create block in timer_create_checker
local existing = working_pool[resource_path]
if existing and checks_unchanged(existing, upstream) then
    -- node set changed only: reconcile in place, do NOT rebuild the checker
    local c = existing.checker
    local new_set = {}
    for _, n in ipairs(upstream.nodes) do
        new_set[n.host .. ":" .. (port or n.port)] = n
    end
    -- removed nodes: remove immediately, also clears their shm state keys
    for _, t in ipairs(c.targets) do
        if not new_set[t.ip .. ":" .. t.port] then
            c:remove_target(t.ip, t.port, t.hostname)   -- takes effect now, no 10s window
        else
            new_set[t.ip .. ":" .. t.port] = nil          -- unchanged: keep its health status
        end
    end
    -- added nodes: register immediately
    for _, n in pairs(new_set) do
        c:add_target(n.host, port or n.port, host, true, host_hdr)
    end
    existing.version = resource_ver                        -- bump version in place, keep the checker
else
    -- check parameters changed: a fresh checker is genuinely required
    if existing then
        existing.checker:delayed_clear(DELAYED_CLEAR_TIMEOUT)
        existing.checker:stop()
    end
    local checker = create_checker(upstream)
    -- ... add_working_pool(resource_path, resource_ver, checker)
end

remove_target writes the updated TARGET_LIST first and clears the per-target state only after — the same write ordering the #10312 fix relies on — so surviving nodes (never touched) see no race, and departed nodes have their state cleared immediately, which closes the IP-reuse hole:

https://github.com/api7/lua-resty-healthcheck/blob/master/lib/resty/healthcheck.lua#L585-L623

5. Why this beats both the status quo and immediate clear()

Approach Removed IP Surviving nodes' health status #10312 race
current delayed_clear lingers ~10s (+ IP-reuse bug) preserved none
immediate clear() (this PR) immediate all wiped reintroduced
incremental reconciliation immediate preserved none (survivors untouched)

Node-set changes vastly outnumber check-parameter changes in k8s, so the common path is the zero-race, zero-status-loss fast path; the rebuild path (with delayed_clear) only runs when the check config itself changes, where a fresh checker is actually warranted.

6. Note on the preStop suggestion

A preStop hook solves graceful connection draining (letting in-flight connections finish before the pod exits) — a different, orthogonal problem. It does not fix the control-plane defect here: the healthchecker holding a stale target (or inheriting status via IP reuse) after the endpoint is already gone from the upstream node set. Good practice, but not the fix for this issue.


@tyq010101 @Baoyuantop @moonming — I'd suggest we repurpose this PR toward the incremental-reconciliation approach (with a checks_unchanged gate + rebuild fallback), which fixes the reported k8s staleness at the root without reintroducing #10312 and without the health-status reset that plain clear() causes. Happy to help shape the checks_unchanged comparison and add e2e coverage for both the node-diff path and the check-config-change fallback.

@AlinsRan

AlinsRan commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Thanks for tackling this — #12803 is real. Flagging an overlap: #13629 reworks the same area and I think covers this case with fewer trade-offs.

Same goal — remove stale IPs immediately after a scale-down. This PR does it by swapping delayed_clear()clear(). #13629 does it by reconciling the checker's targets in place (remove_target() for dropped nodes — synchronous, so the IP disappears immediately) instead of destroy-and-rebuild.

#13029 (clear() + rebuild) #13629 (incremental reconcile)
Stale IP removed ✅ immediate ✅ immediate (remove_target)
Rebuild window (#13282) ❌ kept, and worse (list empty → unhealthy nodes look usable) ✅ none (no rebuild on node change)
Surviving nodes' health state ❌ wiped on every change ✅ preserved
Blast radius whole target list only the delta
Churn under frequent k8s/DNS updates ❌ clear + rebuild each time ✅ stable

Why the incremental shape is preferable

  • The operation matches the change. A scale-down drops one endpoint. clear() + rebuild throws away all N targets and re-adds them; incremental issues exactly one remove_target(). The work — and the risk — is proportional to what actually changed, not to the size of the upstream.
  • You can't hit a rebuild gap if you don't rebuild. The up_checker == nil window (bug: Health check state lost and checker not working after upstream node changes #13282) and the loss of accumulated health state are consequences of destroy-and-rebuild. Immediate clear() doesn't remove that path — it makes the window stricter (the shared target list is empty while the new checker re-probes). Reconciling in place never tears the checker down, so both problems disappear by construction rather than being mitigated.
  • clear() treats the symptom. The stale IP lingered because the delayed cleanup was unreliable (multi-checker bug, bug: After a node is deleted from the upstream, its information can still be queried from the health check interface. #13385). clear() routes around that by nuking everything; incremental removes only the dead endpoint and leaves the healthy ones — same immediacy for the dropped IP, no collateral damage to the rest.

The cleanup bug itself is fixed separately in api7/lua-resty-healthcheck#55 (shipped via the bump in #13627), so the delayed_clear() path is reliable again too.

Both PRs modify the same lines in apisix/healthcheck_manager.lua, so they'll conflict. Since #13629 covers this without the rebuild window or churn, would you be open to closing this in favor of #13629?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:S This PR changes 10-29 lines, ignoring generated files. wait for update wait for the author's response in this issue/PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants