fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029
fix(healthcheck): use immediate clear() to ensure k8s endpoint change…#13029tyq010101 wants to merge 1 commit into
Conversation
03b48d9 to
3af6251
Compare
|
#12803 |
|
The current method of clearing all data causes a temporary loss of health status with each upstream configuration update. The correct approach is to precisely remove nodes that no longer exist in the new upstream configuration. This clears the old stale IPs while preserving the historical health status of unchanged nodes. |
|
3af6251 to
3ee8d5d
Compare
…s take effect Replace delayed_clear with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates. Old K8s Endpoint ip has destroyed, but healthCheck manager always check old ip address
3ee8d5d to
7476c07
Compare
|
Thanks for the explanation. I agree that delayed_clear has a problem in this scenario -- you're right that stale IPs can persist. But switching to clear() introduces a different issue: it wipes the health status of all nodes (including the ones that haven't changed), causing a temporary health status reset on every upstream update.
The correct approach is a diff-based removal: compare the old target list with the new node list, and only remove the nodes that no longer exist (C, D), while preserving A and B's health status. This removes only the stale IPs while preserving health history for unchanged nodes. |
moonming
left a comment
There was a problem hiding this comment.
Hi @tyq010101, thank you for looking into the Kubernetes healthcheck issue!
The change: Replacing delayed_clear() with immediate clear() to ensure k8s endpoint changes take effect immediately.
My concern: The original code likely uses delayed_clear() intentionally to avoid race conditions during configuration updates. Switching to immediate clear() could introduce issues:
- Multiple workers might try to clear simultaneously
- In-flight health checks could be disrupted
- The clear might happen while new endpoints are still being processed
Could you help clarify:
- What specific scenario led you to discover this issue?
- Have you observed the race condition I mentioned in testing?
- Why was
delayed_clear()used originally — was there a comment or commit message explaining the choice?
If you can provide more context on the failure scenario, we can work together on a solution that's both immediate and safe. Thank you!
|
Hi @tyq010101, following up on the previous review comments. Please let us know if you have any updates. Thank you. |
|
I traced the history of delayed_clear usage:
Therefore, Therefore, I will close this PR for now. If you still have this problem, you can submit an issue first. Thank you for your contribution. |
|
Reopening to capture a better fix for the underlying problem. After tracing the code end-to-end, I don't think the choice is between 1. What actually happens todayThe checker is keyed by So every node-set change tears the whole checker down and rebuilds it, laundering health status through the shm: a surviving node gets re-added, sees 2. Why
|
| Approach | Removed IP | Surviving nodes' health status | #10312 race |
|---|---|---|---|
current delayed_clear |
lingers ~10s (+ IP-reuse bug) | preserved | none |
immediate clear() (this PR) |
immediate | all wiped | reintroduced |
| incremental reconciliation | immediate | preserved | none (survivors untouched) |
Node-set changes vastly outnumber check-parameter changes in k8s, so the common path is the zero-race, zero-status-loss fast path; the rebuild path (with delayed_clear) only runs when the check config itself changes, where a fresh checker is actually warranted.
6. Note on the preStop suggestion
A preStop hook solves graceful connection draining (letting in-flight connections finish before the pod exits) — a different, orthogonal problem. It does not fix the control-plane defect here: the healthchecker holding a stale target (or inheriting status via IP reuse) after the endpoint is already gone from the upstream node set. Good practice, but not the fix for this issue.
@tyq010101 @Baoyuantop @moonming — I'd suggest we repurpose this PR toward the incremental-reconciliation approach (with a checks_unchanged gate + rebuild fallback), which fixes the reported k8s staleness at the root without reintroducing #10312 and without the health-status reset that plain clear() causes. Happy to help shape the checks_unchanged comparison and add e2e coverage for both the node-diff path and the check-config-change fallback.
|
Thanks for tackling this — #12803 is real. Flagging an overlap: #13629 reworks the same area and I think covers this case with fewer trade-offs. Same goal — remove stale IPs immediately after a scale-down. This PR does it by swapping
Why the incremental shape is preferable
The cleanup bug itself is fixed separately in api7/lua-resty-healthcheck#55 (shipped via the bump in #13627), so the Both PRs modify the same lines in |
…s take effect
Replace delayed_clear() with immediate clear() in healthcheck_manager to fix an issue where k8s endpoint changes would not take effect immediately. The delayed clear could cause healthcheck to continue using stale IP addresses after endpoint updates.
Description
Which issue(s) this PR fixes:
Fixes #
Checklist