fix: competing owner lease takeover#178
Open
savme wants to merge 4 commits into
Open
Conversation
Contributor
|
Thanks @savme! For context to reviewers, this is (hopefully) going to address the remaining stability issue with tunnels. Check the linked issue for more detail, but essentially a tunnel can be killed by a zombie connector in a totally different user account. |
Contributor
|
I wonder if we have a gap in the design? Seems odd that the same public key can be used across multiple connectors? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related to #174
Problem
When multiple connectors share the same iroh public key, the iroh-dns controller uses a first-writer-wins claim on the downstream
DNSRecordSet. The claim was tracked by UID label only - there was no way to tell whether the claimant was still alive. A connector in a foreign project with a dead or deleted agent could hold the claim indefinitely, blocking an active connector from publishing its DNS record and causing 5xx at the edge.Two production issues confirmed this: one where the cited owner had been deleted and the active connector never recovered, and one where a cross-account connector with an expired lease displaced an actively-serving session.
Fix
Ownership arbitration needs a liveness signal that any reconciler can check regardless of which project cluster the claimant lives in. The downstream DNS cluster is the one place all reconcilers can reach by design, so liveness state now lives there: the claim holder writes a
Leasenext to itsDNSRecordSetand renews it on every heartbeat. A competitor checks that Lease before deferring - if it's absent or expired, the competitor takes over.