Skip to content

fix: competing owner lease takeover#178

Open
savme wants to merge 4 commits into
mainfrom
fix/tunnel-competing-claims
Open

fix: competing owner lease takeover#178
savme wants to merge 4 commits into
mainfrom
fix/tunnel-competing-claims

Conversation

@savme

@savme savme commented Jun 9, 2026

Copy link
Copy Markdown

Related to #174

Problem

When multiple connectors share the same iroh public key, the iroh-dns controller uses a first-writer-wins claim on the downstream DNSRecordSet. The claim was tracked by UID label only - there was no way to tell whether the claimant was still alive. A connector in a foreign project with a dead or deleted agent could hold the claim indefinitely, blocking an active connector from publishing its DNS record and causing 5xx at the edge.

Two production issues confirmed this: one where the cited owner had been deleted and the active connector never recovered, and one where a cross-account connector with an expired lease displaced an actively-serving session.

Fix

Ownership arbitration needs a liveness signal that any reconciler can check regardless of which project cluster the claimant lives in. The downstream DNS cluster is the one place all reconcilers can reach by design, so liveness state now lives there: the claim holder writes a Lease next to its DNSRecordSet and renews it on every heartbeat. A competitor checks that Lease before deferring - if it's absent or expired, the competitor takes over.

@savme savme requested review from a team and privateip and removed request for a team June 10, 2026 21:32
@drewr drewr requested a review from scotwells June 11, 2026 18:37
@drewr

drewr commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Thanks @savme! For context to reviewers, this is (hopefully) going to address the remaining stability issue with tunnels. Check the linked issue for more detail, but essentially a tunnel can be killed by a zombie connector in a totally different user account.

@scotwells

Copy link
Copy Markdown
Contributor

I wonder if we have a gap in the design? Seems odd that the same public key can be used across multiple connectors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants