Add CLI tunnel and auth commands by drewr · Pull Request #130 · datum-cloud/app

drewr · 2026-03-27T20:12:57Z

Summary

This PR ships the CLI client for Datum Connect tunneling — the headless equivalent of the desktop UI. It lets users authenticate, manage projects, and expose local services to public hostnames without launching the GUI.

Building

Rust tooling only (no Nix required):

cargo run -p datum-connect -- --help

Or with Nix:

nix run .#cli -- --help

Commands

auth

datum-connect auth login       # OAuth via browser; prompts to select a project after login
datum-connect auth logout
datum-connect auth status      # Shows authenticated user and active org/project
datum-connect auth list
datum-connect auth switch      # Logs out and re-authenticates; prompts to select a project

projects

datum-connect projects list    # Lists all orgs and projects; marks the active one with *
datum-connect projects switch  # Interactive prompt to change the active project

tunnel

datum-connect tunnel listen --endpoint 127.0.0.1:8080
datum-connect tunnel listen --endpoint 127.0.0.1:8080 --label my-tunnel
datum-connect tunnel listen --endpoint 127.0.0.1:8080 --project <project-id>
datum-connect tunnel list
datum-connect tunnel update --id <id> --label new-name
datum-connect tunnel delete --id <id>

tunnel listen runs in the foreground. It creates or reuses a tunnel for the given endpoint, starts the heartbeat agent so the gateway has routing info, enables the tunnel, and polls until it is accepted and programmed before printing the public hostname. Ctrl+C disables the tunnel and exits.

The --project flag overrides the active project for a single invocation without changing the stored selection.

Project selection

The active project is stored in config.yml (default: ~/.local/share/Datum/config.yml, overridable via $DATUM_CONNECT_REPO). It is set interactively after auth login or auth switch, or explicitly with projects switch.

Example session

$ cargo run -p datum-connect -- auth login
# browser opens for OAuth
Logged in as Jane Smith (jane@example.com)

Select a project:
  [1] Acme Corp / production
  [2] Acme Corp / staging
Enter number [1-2]: 2
Selected project: Acme Corp / staging

$ cargo run -p datum-connect -- tunnel listen --endpoint 127.0.0.1:3000
Created tunnel:
  id: httpp-abc123
  label: f3a9c2e1b047

Your endpoint ID: 30a9ddf5...
Setting up tunnel...
Tunnel ready after 8 sec: https://f3a9c2e1b047.tunnels.datum.net
Press Ctrl+C to stop...

Bug fixes (found during testing)

Tunnels created from CLI never route traffic: CLI was missing the HeartbeatAgent that continuously patches status.connectionDetails on the connector. Without it the gateway has no routing info. Fixed: tunnel listen now starts the heartbeat and registers the project before enabling the tunnel.
Re-running tunnel listen on an existing endpoint always prompted for update: Random label was generated before checking for an existing tunnel, so it always differed. Fixed: label generation moved into the create-new path; existing tunnels reuse their stored label unless --label is explicitly given.
Tunnel delete silently no-ops when connector is missing: delete_project returned early if no connector was found, skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy. Fixed: connector lookup is only needed for post-deletion cleanup and no longer gates resource deletion.
Auto-generated label used tunnel-<u16> format: Collided visually with resource ID format. Switched to 12 hex chars of random entropy (e.g. a3f9c2e1b047).

Test plan

cargo run -p datum-connect -- auth login completes OAuth and prompts for project selection
projects list shows all orgs/projects with active one marked
projects switch persists new selection to config.yml
tunnel listen --endpoint 127.0.0.1:<port> creates tunnel, prints hostname, disables on Ctrl+C
Re-running tunnel listen on the same endpoint reuses the existing tunnel without prompting
tunnel listen --project <id> uses the specified project
tunnel list shows tunnels in the active project
tunnel delete removes a tunnel cleanly

zachsmith1 · 2026-03-27T20:20:09Z

Do we want a separate cli for tunnels or do we want to bake in functionality into datumctl?

drewr · 2026-03-27T20:41:57Z

Yeah, it's why this is a draft. I needed the functionality and didn't want to commit one way or the other yet. I explored doing it in datumctl and it would involve either replicating the Iroh sidecar in go or making the project hybrid with a rust component.

This method uses all the same machinery as the GUI which felt like a better first pass.

- Add 'tunnel' subcommand to datum-connect CLI with: - 'tunnel list': read-only listing of tunnels (no side effects) - 'tunnel listen': create/update and run tunnel in foreground - 'tunnel update': update tunnel label/endpoint - 'tunnel delete': delete a tunnel - Add 'nix run .#connect' app to flake.nix - Split find_connector_readonly for list operations - Remove side effects from tunnel list (no patching Connector) - Listen command: - Generates random label if not provided - Confirms before updating existing tunnel - Handles Ctrl+C to disable tunnel on exit

- Add 'auth' subcommand to CLI with: - 'auth status': Show current authentication and selected context - 'auth login': Log in via browser OAuth with account picker - 'auth logout': Log out and clear credentials - 'auth list': Show current authenticated user - 'auth switch': Log out current user and prompt for new login Also add is_authenticated(), login(), logout() methods to DatumCloudClient.

zachsmith1 · 2026-03-27T21:00:12Z

Ya the challenge is the core stuff we need is in rust so we'll need some magic to make the UX good

scotwells · 2026-03-27T21:03:03Z

How does this interact with the GUI based application? Would auth be shared?

Since the GUI is locked to a specific project (because connectors are project-scoped resources), switching the authenticated user could break existing tunnels without the user knowing and it doesn't seem like we warn the user.

drewr · 2026-03-27T21:32:19Z

It's all shared. I'll show what it looks like when Rust is done compiling...

delete_project returned early when find_connector returned None, skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy. Connector lookup is only needed for post-deletion cleanup (deciding whether to delete the shared connector). Move it into an Option and gate the cleanup block on Some, so resource deletion always proceeds. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Three interrelated bugs fixed in the tunnel listen command: - Random label was generated before checking for an existing tunnel, so re-running listen on the same endpoint always triggered the update prompt. Moved label generation into the create-new path only; existing tunnels reuse their stored label unless --label is explicitly provided and differs. - Default label format changed from tunnel-<u16> (collides with resource ID format) to 12 hex chars of random entropy (e.g. a3f9c2e1b047). Adds hex as a dependency. - tunnel listen was missing the HeartbeatAgent that continuously patches status.connectionDetails on the connector (relay URL, addresses, public key). Without it the gateway has no routing info and tunnels never carry traffic. Now starts the heartbeat and registers the project before enabling the tunnel, then polls until accepted+programmed before printing the hostname. Also simplifies tunnel delete output: connector cleanup is an internal detail, so "Deleted tunnel <id>" replaces "(connector deleted: false)". Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- After auth login/switch, prompt user to select an org and project and persist the selection as the active context - Store the selected context in config.yml instead of a separate file - Add --project flag to the tunnel command to override the active project for a single invocation - Add projects list and projects switch commands for managing the active project outside of the auth flow - Fix tunnel listen to print id and label after creation

drewr · 2026-03-31T20:57:06Z

Here's a short demo of where I've gotten with this:

headless-tunnels-demo.mp4

bmertens-datum · 2026-04-01T15:18:50Z

@drewr Nice demo.

zachsmith1 · 2026-04-01T16:21:59Z

@drewr this is slick. lm planning on splitting off the app repo from the gateway repo and we should consider where we'd want this cli to live. last piece there would be a small enhancement around how we could inject this rust binary into datumctl (if we want to)

richardhenwood · 2026-04-03T18:20:40Z

I've had a moment to try this - and following your excellent demo video, and typing datum-connect -- tunnel listen --endpoint localhost:8080 I got a 'connector' appearing in the Datum cloud UI. This is a really powerful way to think about connections for me - so I'm very excited to play around :)

FYI: this is on my Fedora 43 workstation.

drewr · 2026-04-09T18:09:52Z

Great feedback @richardhenwood, thanks!

drewr · 2026-04-09T18:15:46Z

@zachsmith1 wrote:

where we'd want this cli to live

I think if we factor out the local process to a standalone rust utility like you're proposing it makes more sense for this to live in datumctl. I originally went that direction but didn't want to either rewrite the iroh integration in go or repackage this in an awkward way.

drewr · 2026-04-09T20:15:24Z

I've had some instability with this and had both gpt-5.4 and sonnet-4.6 chewing on it:

Found it. The UpstreamProxy authorizes incoming iroh connections by checking self.state.get().proxies, but the CLI tunnel listen flow never calls listen.set_proxy() to register the tunnel in local state. The gateway connects over iroh, the auth handler finds no matching proxy, returns Forbidden, and the gateway sees a connection reset.

Fix incoming.

gianarb · 2026-04-09T21:26:39Z

There is a lot to unwrap in my opinion here, a lot around product so I am not sure I have enough context to help here.

Something is an old discussion we had here datum-cloud/enhancements#582 if you look for the ecosystem chapter:

Now my attention turns to "do we want to keep consistency in the ecosystem?". Do we want for example to get Datum Desktop to look at that file as well? So a switch context in the CLI will switch context in desktop?

In practice what I was trying to highlight here is the mood kubernetes and other cli tool develop when you do that everything you run starts from a unique source of truth (for kubernetes it is the ~/.kube/config file. If we can agree on something similar it will be a lot easier to bring other CTL or applications into a consistent state.

It will feel a lot easier to push for a plugin ecosystem like the one kubectl and others developed where binaries starting with kubectl- gets called from the main ctl. In this case we can release a binary datumctl-connect that will be callable like datumctl connect.

But if we can not agree on some common practices, like authentication the outcome for a user will be pretty poor, in this case I feel like we should just "give up" and release different binaries working their own way.

I am not saying that we should have in place the ability to switch and persist in between accounts/instances because I know we do not know yet datum-cloud/enhancements#653 (comment) but maybe since we do not know we can just take what we have today as common denominator until we figure out what's next.

So the way I envision the evolution of this PR is a binary that serves only the business logic to manage tunnels and connections and demands authentication to the same login used by the datumctl (or the datumctl changes to turn to the same used here and from desktop)

This is what I am trying to push to but as I said product wise I am not sure I have enough context to push into a direction vs another.

The gateway sends `CONNECT localhost:<port>` regardless of whether the tunnel was registered with `localhost` or `127.0.0.1`, causing auth to fail with Forbidden and the caller to see "upstream connect error or disconnect/reset before headers." Normalize `localhost`, `127.0.0.1`, and `::1` to a canonical form on both sides of the host comparison in `tcp_proxy_exists`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewr · 2026-04-09T21:50:53Z

Gateway hostname normalization — needs investigation

While debugging an "upstream connect error or disconnect/reset before headers" report, I found the root cause in commit e2d868d: the gateway sends CONNECT localhost:<port> in the iroh HTTP CONNECT request regardless of what address was stored in the ConnectorAdvertisement (in this case 127.0.0.1). The strict string comparison in tcp_proxy_exists failed, returning Forbidden, which the gateway surfaces as a connection reset.

The client-side fix normalizes localhost, 127.0.0.1, and ::1 to a canonical form at comparison time, which handles the mismatch. But the gateway behavior is worth examining:

Observable behavior (from /tmp/datum-2026-04-09T19:45:27+00:00.log, line 8132):

handle request req=HttpRequest { version: HTTP/1.1, headers: {}, uri: localhost:3300, method: CONNECT }

The ConnectorAdvertisement had address: "127.0.0.1", but the gateway sent CONNECT localhost:3300.

Questions for the gateway team:

Is this intentional? Does the gateway always normalize 127.0.0.1 → localhost for loopback addresses?
Should the gateway instead use the ConnectorAdvertisement address verbatim in the CONNECT target?
If the gateway normalizes to localhost, should it normalize to 127.0.0.1 instead (since that's what users typically specify as the --endpoint)?

The client-side fix is defensive and correct either way, but if the gateway is doing unintended normalization, fixing it there would be cleaner and might surface other subtle issues.

When the Lease the heartbeat owns is removed server-side (TTL cleanup, namespace reap, manual delete), the renew loop kept patching the dead name forever — only logging a warn each tick — because the cache was preserved on every error. The tunnel went silently dark. Route both the fetch-lease and renew error arms through a single classifier that resets the cache on 404 so the next iteration re-resolves the connector and lease from scratch, while still force-refreshing on 401 and retaining the cache on transient errors.

The CLI persisted a single iroh secret key at the repo root and reused it for every project the user touched. The network-services-operator explicitly treats two Connectors with the same iroh public key as a collision: the iroh DNS controller picks one winner and marks the losers with IrohDNSPublished=False; Reason=DeferredToOwner. The losing project's tunnel reports Ready but silently drops data because the iroh DNS record points at the wrong Connector. Move listen_key under <repo>/projects/<project_id>/ so each project has a distinct iroh identity. On first per-project access, migrate any legacy flat listen_key into that project's directory — the first project the user runs against keeps continuity with its existing server-side Connector; subsequent projects get fresh keys and stop joining the race. Leave connect_key (no server-side Connector) and gateway_key (separate daemon identity) flat. The UI and Serve paths continue to use the flat listen_key for now; converting them needs its own pass. The Tunnel command now requires a selected project and fails with a clear message if none is set, since the per-project key path needs a project id at node construction time.

The setup loop only checked accepted && programmed && !hostnames.is_empty() and slept 2s at a time, so the user saw "Setting up tunnel..." then a 2s silence and then "Tunnel ready" — even when the Connector's IrohDNSPublished=False; DeferredToOwner condition meant the data plane was silently unreachable. Surface the six controller conditions that already exist on the HTTPProxy and Connector (Accepted, CertificatesReady, ConnectorReady, IrohDNSPublished, Programmed, ConnectorMetadataProgrammed) through a typed TunnelProgress, and stream each transition as a checklist line. Bail immediately when IrohDNSPublished comes back False with reason DeferredToOwner — that's the cross-project iroh-key collision case where waiting longer can't help, so we print the operator's message naming the owning Connector and exit non-zero. Also warn on stdout when any step stays pending past 30 seconds, since the controller's reason string is the most useful diagnostic when something genuinely stalls. Polling at 750ms is fine: get_active_progress does two reads (HTTPProxy + Connector) on an already-warm PCP client, and server-side reconcile latency dominates.

…reakage Setup-time progress checks aren't enough. Today's failure mode: a tunnel came up cleanly, ran for ~9 minutes, then the iroh DNS controller re-reconciled and flipped IrohDNSPublished from True back to False because a deleted Connector's DNS claim was never cleaned up server-side. The data plane went dark while the CLI kept reporting healthy — Ready was still True, the heartbeat was still renewing the lease, and there was no client-side signal that anything had changed. Poll progress every 10s alongside the existing login-state watch. When terminal_failure() trips (currently IrohDNSPublished=False with reason DeferredToOwner), print the same message the setup path emits and break out of the run loop cleanly so the operator gets disable+cleanup instead of a silent zombie. Tunnel-deleted-from-under-us also breaks out; transient poll errors only warn and retry. Factor the failure message into format_terminal_failure() so setup-time and runtime emit identical wording — the user shouldn't have to learn two error shapes for the same diagnosis.

drewr · 2026-06-07T18:32:50Z

Diagnosis notes from a debugging session

Over a single CLI session I hit three distinct failure modes that all presented the same way — "Tunnel ready" with the data plane silently dropping. Recording the chain here so the cause-effect is preserved alongside the fix commits.

1. Heartbeat wedge on deleted Lease

Symptom. Tunnel went silent after no visible errors; logs eventually showed a warn loop:

heartbeat: lease renew failed: ApiError: leases.coordination.k8s.io "datum-connect-jttwh" not found: NotFound

firing every 30s indefinitely.

Root cause. run_for_project cached the resolved lease_name once and the renew error arms unconditionally retained the cache. A server-side delete of the Lease (TTL cleanup, namespace reap, manual) put the loop in a state it couldn't recover from — the cache never cleared, so the loop never went back through the connector-probe / lease-resolve path at the top.

Fix. c901b01 classifies the kube error per arm: 404 → reset cache, 401 → force refresh + retain, anything else → retain. Three unit tests cover the decision.

2. Machine-wide iroh identity colliding across projects

Symptom. A freshly-created tunnel reported Tunnel ready cleanly, but curl returned Connection reset by peer at the TLS handshake. The Connector's status held:

Type: IrohDNSPublished     Status: False
Reason: DeferredToOwner
Message: iroh DNS record is owned by Connector /<other-project>/default/datum-connect-<other>

Root cause. Repo::listen_key() resolved to a single flat path (<repo>/listen_key) regardless of project. Running the CLI against multiple projects from the same machine registered the same iroh public key under multiple Connector objects; the iroh DNS controller (network-services-operator/internal/controller/iroh_dns_controller.go) explicitly treats that as a collision ("Multiple Connectors that share the same iroh keypair … collapse to one DNSRecordSet — the first to claim wins, and the loser surfaces a DeferredToOwner condition"). The losing project's Connector is Ready=True but unreachable.

Fix. 76987b7 scopes listen_key under <repo>/projects/<project_id>/ with a one-shot migration that moves the legacy flat key into the first project that requests it (so the user keeps continuity with one existing server-side Connector; the rest get fresh keys). connect_key and gateway_key stay flat — neither registers a per-project server-side identity.

The audit of network-services-operator confirmed per-project (per-Connector) identity is the intended design model, not just an acceptable workaround.

3. Setup readiness check too shallow

Symptom. Tunnels reported Tunnel ready after 2 sec even when the connector's IrohDNSPublished was DeferredToOwner — i.e., when they were guaranteed not to carry traffic.

Root cause. The setup loop only checked accepted && programmed && !hostnames.is_empty() on the HTTPProxy. The Connector's conditions weren't read at all, so collisions and stuck states never surfaced to the user.

Fix. 90a7ab3 exposes a typed TunnelProgress over six controller conditions (ProxyAccepted, CertificatesReady, ConnectorReady, IrohDnsPublished, ProxyProgrammed, ConnectorMetadataProgrammed) and streams each transition during setup. terminal_failure() matches specifically on IrohDNSPublished=False; DeferredToOwner and bails immediately rather than waiting forever.

4. Stale DNS owner re-emerging post-setup

Symptom. A tunnel that came up cleanly worked for ~9 minutes and then quietly went dark with no client-side signal. Ready=True, lease being heartbeated on schedule, no errors in CLI logs.

Root cause. A previously-deleted Connector's iroh DNS claim was never cleaned up server-side. When the iroh DNS controller re-reconciled, it flipped the live Connector's IrohDNSPublished from True back to False with DeferredToOwner, naming the dead Connector's UID as the owner. The DNS claim outliving its owning Connector looks like either a missed owner-reference cascade on Connector delete or a controller-side cache that isn't invalidated when the Connector goes away.

Fix (client-side, this PR). 6264818 adds a runtime watch alongside the login-state watch — poll get_active_progress every 10s after setup completes, surface terminal failures with the same format_terminal_failure message the setup path uses, and break out of the run loop so the disable/cleanup path runs. Tunnel-deleted-server-side also breaks out; transient query errors just warn and retry.

Server-side follow-up (out of scope for this PR). The iroh DNS controller leaking a claim past its owning Connector's lifetime is a real bug. Worth filing against the operator team — concrete instance from today's session was datum-connect-jttwh (UID 226a90b6-3cad-4242-9eff-c2c71a335545) showing up as Owner of a DNS record after the Connector itself returned 404.

Recovery playbook this PR enables

If a user's tunnel goes terminal mid-session, the CLI now:

Prints the operator's message naming the conflicting Connector,
Exits the run loop cleanly,
Disables the tunnel server-side on shutdown.

Operator action: delete the per-project listen_key (<repo>/projects/<project_id>/listen_key), then rerun. The new identity sidesteps the stale claim entirely.

…jects HeartbeatAgent::start() auto-enrolls every project the user has access to, which is correct for the UI (it surfaces tunnels across projects) but wrong for the CLI tunnel-listen command, which owns exactly one project. Today's logs showed the fan-out clearly: heartbeat: no connector yet project_id=drewr-y4nd1b heartbeat: no connector yet project_id=drewr3-ceu4gt The CLI silently maintained presence in drewr3-ceu4gt — a project the user never mentioned — for the lifetime of the tunnel. Harmless today, but it makes logs misleading, multiplies API load, and would create real risk if a misconfigured token granted access to a project that shouldn't be touched. Add start_manual() that skips the watcher entirely. The CLI now starts manual mode and explicitly registers its single project. Per-project loops still handle 401s via force-refresh, so transient auth blips are tolerated; the CLI's own login-state watch surfaces permanent logout. Keeps start() unchanged so the UI continues to auto-enroll. The new entry point is documented with a pointer to start_manual for callers in the CLI-style pattern.

…stname The endpoint-match adoption path is anchored on the connector record matching the agent's current iroh endpoint id, then filtered to HTTPProxies whose backend references that connector by NAME. Any time the connector record gets renamed server-side (delete + ensure_connector recreates with a new generateName suffix) or the iroh identity rotates, all the previous HTTPProxies become invisible to list_project and adoption silently misses, spawning a fresh tunnel with a fresh hostname. Add --id <tunnel-id> to tunnel listen for the "I have already shared this URL with others" case. The path looks up the HTTPProxy by name (direct API call, no connector filter), calls update_active to re-point its backend at the current connector, and re-enables its ConnectorAdvertisement. The hostname lives on the HTTPProxy resource and is preserved across connector identity changes. Also refactor TunnelService::get_active to do a direct fetch instead of filtering list_active. The previous implementation filtered out tunnels whose backend pointed at any connector other than this agent's current one, which made tunnel update/delete by id silently fail-as-NotFound after a connector rename. Direct fetch matches the user's intent when they explicitly name a tunnel. Factor summary_from_proxy() so get_active and list_project don't drift.

--endpoint is now optional. The four input shapes: --id <id> resume the tunnel verbatim using its stored endpoint; re-point the connector backend at this agent's current iroh identity. --id <id> --endpoint <e> same, but assert the user-provided endpoint matches the stored one (after the same normalization the lib applies). A mismatch fails hard with a message pointing at 'tunnel update' for explicit changes — we don't silently retarget a tunnel whose URL the user may have shared with others. --endpoint <e> existing endpoint-match adoption path. neither error with a hint at both flags. Expose lib::normalize_endpoint so the CLI compares endpoint strings using the same canonicalization the stored TunnelSummary value was produced with (trim + prepend http:// if no scheme), instead of doing naïve string equality that would spuriously fail on scheme/whitespace differences.

Running 'tunnel listen' bare now pops an interactive picker on a TTY, listing existing tunnels with hostname → endpoint [label]. Pick one with ↑↓/Enter to resume it (treated as if --id had been given). Tunnels without hostnames (still pending) are excluded — picking them would just produce 'tunnel not found'. Enabled tunnels sort above disabled (marked '○') so the most-likely-relevant ones come first. Non-TTY (CI, piped) keeps the existing fail-fast error so scripts don't hang on stdin. Empty candidate list (no tunnels in the project, or no tunnels with hostnames yet) also falls through to the error — there's nothing to pick. Adds inquire 0.9 for the picker. Considered rolling raw crossterm to avoid a dep, but the cost/benefit didn't justify it for one prompt.

The iroh relay-actor (and other chatty modules under iroh::magicsock / lib::*) write to the same TTY inquire is repainting on. The collision looked like: ? Resume which tunnel? 2026-06-07T19:10:30Z INFO relay-actor: ... roxy.net → http://127.0.0.1:11434 [d38f5f413beb] — a log line spliced across the picker's first option, leaving the terminal unreadable. Wrap the EnvFilter in a reload handle exposed via a OnceLock so the picker can engage a RAII QuietTracing guard that swaps the filter to 'error' for the lifetime of the inquire prompt and restores it on drop. Captures the previous filter via EnvFilter's Display impl since it doesn't implement Clone; round-trip through to_string()/try_new() is the supported path. This is a targeted fix for the symptom. A cleaner long-term move would be to defer ListenNode construction past the picker so iroh hasn't booted yet — but that requires TunnelService to support read-only listing without a node, which is a larger refactor.

…ration 'tunnel listen --id' PATCHes the HTTPProxy spec to re-point its backend at the current connector. That bumps metadata.generation, but the controllers' prior True conditions still carry observedGeneration from the previous generation until they re-reconcile. The progress check was reading those as Ready, so the CLI reported "Tunnel ready after 0 sec" while the data plane was still serving 503s from the old Envoy config. Mark a step Ready only when status == "True" AND observed_generation >= metadata.generation. Stale-but-True conditions become Pending (still progressing) — same code path as the 30s stuck warning, so the user gets the controller's reason rather than a false- positive Ready. None observedGeneration is treated as 0, which falls through to Pending on any non-zero generation (correct — the controller hasn't reconciled this resource yet). Test progress_pending_when_status_is_stale_for_current_generation covers both halves: stale-True is Pending; once the controller catches up, the same condition flips to Ready. Existing tests still pass because their fixtures default to generation=None == 0 on both sides.

A 0-second "Tunnel ready" with a 503-serving data plane (the observedGeneration bug we just fixed) made it clear that users need a fast pivot from a stuck progress line to 'datumctl describe ...' on the exact resource. Without that, the operator's reason string is useful text but its provenance is buried. Add a 'resource: Option<String>' field on ProgressStep, pre-formatted as "HTTPProxy/<tunnel-id>" or "Connector/<connector-name>", populated from the live resource metadata in from_resources. Mapping per step: tunnel accepted → HTTPProxy/<tunnel-id> TLS certificate issued → HTTPProxy/<tunnel-id> connector ready → Connector/<connector-name> iroh DNS published → Connector/<connector-name> route programmed → HTTPProxy/<tunnel-id> envoy metadata propagated → HTTPProxy/<tunnel-id> CLI renders the label inline: ✓ tunnel accepted (0.1s) [HTTPProxy/tunnel-gchhg] … route programmed still pending after 30s [HTTPProxy/tunnel-gchhg]: … ProgressStepKind::resource_kind() is the source of truth for which kind backs each step, used by the test that asserts the wiring is correct across all six steps. No extra API call needed — the connector name was already in scope inside TunnelService::get_active_progress.

… picker When there's exactly one tunnel to resume, inquire still rendered the prompt with a forced '>' marker on the lone row and arrow keys were no-ops. The terminal cursor sat on the '? Resume which tunnel?' line — visually it looked like the selector hadn't moved 'to the correct line' because there was no other line to move to. The picker only earns its keep when there's a choice to make. Short-circuit at one candidate: print 'Resuming the only tunnel ...' naming the hostname and id, return it directly. The >1 case still pops the picker, where arrow-key movement was verified working under pty with cursor-position+selected-index alignment.

The operator's quota service occasionally times out the admission check on Create requests and returns 403 with: "Your request took too long to be checked against your quota. Please try again in a moment — if this keeps happening, contact support." The error message itself says "try again". Until today the CLI just surfaced the raw 403 and bailed mid-listen, which the user has to recover from manually. Add is_quota_check_timeout() that matches the specific 403 message (distinct from real quota exhaustion, which uses different wording), and with_quota_check_retry() that retries up to ~15s (1s, 2s, 4s, 8s, final attempt) on that exact class. Other 403s — real exhaustion, IAM denials, admission rejections — return immediately so genuine failures still surface fast. Prints a one-line stderr notice on first retry so the user knows we're waiting on the server, not wedged. Apply at every kube .create() site in the tunnel lifecycle: - HTTPProxy create (fresh tunnel) - ConnectorAdvertisement create (fresh tunnel) - TrafficProtectionPolicy create (fresh tunnel) - ConnectorAdvertisement create (set_enabled when resuming) - Connector create (ensure_connector first run) Also tighten format_quota_error to skip the timeout phrase: when retries exhaust, the user should see the actual server message rather than "Quota limit exceeded for ConnectorAdvertisement", which is the wrong diagnosis. Real "Insufficient quota" exhaustion still gets the helpful message. Test covers the classifier and the formatter carve-out on both the timeout and real-exhaustion shapes.

drewr · 2026-06-07T21:13:28Z

Quota 403 diagnosis

While shipping d8f7c96 (CLI-side retry for the transient quota-check timeout), I traced the chain that produces this error:

ApiError: connectoradvertisements.networking.datumapis.com "tunnel-gchhg" is forbidden:
Your request took too long to be checked against your quota.
Please try again in a moment — if this keeps happening, contact support.: Forbidden (code: 403)

⚠ Best-effort diagnosis. Reconstructed from network-services-operator/config/quota/ resources, compute:api/v1alpha/instance_types.go reason codes, and engineering ops reviews. @drewr — please sanity-check against Milo internals (the actual webhook + quota backend code lives in a Milo repo I don't have local access to) before relying on this.

The chain

API server receives Create ConnectorAdvertisement.
Milo's quota admission webhook (quota.miloapis.com) intercepts.
Webhook reads the matching ClaimCreationPolicy (e.g. connector-claim-policy.yaml), constructs a ResourceClaim of amount: 1 against ResourceType: networking.datumapis.com/connectors.
Webhook calls a quota backend to check project remaining capacity.
If the backend round-trip exceeds the webhook's timeoutSeconds, admission fails closed and returns the 403 above.

The failure class is already named in datum-cloud code

From datum-cloud/compute:api/v1alpha/instance_types.go:

// InstanceQuotaGrantedReasonBackendUnavailable indicates quota enforcement
// is configured but the Milo quota backend is unreachable (network error,
// timeout, transient failure).
//
// InstanceQuotaGrantedReasonMisconfigured indicates the ResourceClaim was
// rejected by the Milo admission plugin (403/422): ResourceRegistration absent
// or the policy is malformed.

Our 403 is the user-facing surface of BackendUnavailable — the webhook itself is healthy, but the backend it consults is slow/unreachable.

Most likely root causes (in order of probability)

Webhook → quota backend round-trip latency. Admission webhooks have a hard cap (timeoutSeconds, max 30s). If the backend call routinely lands in the slow tail (GC pause on the backend pod, network blip, cold start), some fraction of admissions will time out. The user hitting this in a tight resume loop — same project, same resource type — is consistent with the backend being intermittently slow.
Expensive per-request counting. If the webhook computes "current usage" by listing/aggregating all ResourceClaims for the project each request, latency scales with project state. Caching the count between admissions would fix it.
Resource limits on the webhook or backend. CPU throttling or memory pressure on those pods drives request latency up unpredictably.
Storage-side latency on the quota service (etcd or whatever Milo's quota backend uses). The 5/18 ops review tracks 409s on quota.miloapis.com as elevated — suggests this surface has known stability concerns.

How to confirm

Pull quota.miloapis.com webhook latency metrics — p50/p95/p99 over the period when this user hit it. If p99 is near the webhook's timeoutSeconds, that's the root cause.
Check whether the webhook caches per-project usage or recomputes per admission.
Inspect resource limits and throttling on the quota-webhook and quota-backend pods.
Look at whether the webhook does any internal retry on transient backend failures before returning the user-visible 403.

Server-side fixes worth considering

(Not in this PR's scope — the actual implementation is in Milo — but flagging for whoever owns it.)

Cache project quota usage in the webhook with a short TTL (1–5s); coalesce admissions of the same project so a burst doesn't N+1 the backend.
Raise timeoutSeconds to the Kubernetes max (30s) so a slow tail doesn't immediately fail. Cheap mitigation.
Async claim model for resources where transient over-allocation is tolerable: admit immediately, create the claim async, revoke if it turns out to be over. The compute repo's reason codes (Misconfigured, BackendUnavailable) already accommodate this shape; the design supports it.
Internal webhook retry on backend transient failures before returning the user-visible 403. The error message already says "Please try again in a moment" — the webhook is asking the client to do work that should be done in the server.

Mitigation on the CLI side (this PR)

d8f7c96 adds is_quota_check_timeout / with_quota_check_retry and wraps every .create() site in the tunnel listen flow. Up to ~15s of backoff before propagating the error. Distinct from real quota exhaustion (which uses different wording — Insufficient quota — and is preserved as the friendly "exceeded" message). This makes the experience tolerable but it's a workaround; the right fix is in Milo.

Action for @drewr

If the Milo side analysis above looks plausible, this is worth a separate placeholder issue against whichever Milo repo owns the quota webhook + backend, including the operator-side mitigations as suggestions. Happy to draft that issue too if you confirm the diagnosis.

Controllers reporting Ready (with observedGeneration in sync) still doesn't mean the data plane is actually carrying traffic — Envoy programming a route is not the same as Envoy serving it. The user reported a ~2-minute window where every condition was True but https://<proxy>/ returned 503. Whatever's behind it (xDS push lag, edge config-not-yet-loaded, iroh peer connection still settling), it's invisible from the controller's view. Add a "Verifying connectivity..." phase between the condition checklist and "Tunnel ready". Every 10s, probe in parallel: - the origin URL the user gave (so a downed local service is named explicitly instead of being blamed on the tunnel) - the public proxy URL (https://<hostname>/) Any response under 500 counts as "reachable" — 4xx like 401/404 are fine because the edge is forwarding; only 5xx + transport errors block. On each tick we print a ✓ line for newly-reachable endpoints and a … line for ones still failing, with the controller's last error so the user can act ("origin connection refused" vs "proxy 503"). New --timeout flag (default 10m, humantime) caps total setup including verification. On expiry the command exits non-zero with a per-side summary so an unverified tunnel doesn't get treated as healthy. Sleep is clamped to the remaining budget so an early success on one side doesn't waste the last 10s before bailing on the other.

@drewr

When the CLI's --id / picker-resume calls update_active, it almost always passes the same label, endpoint, and current connector that the existing HTTPProxy already references. The previous behavior PATCHed HTTPProxy.spec.rules and metadata.annotations unconditionally with that identical content. Whether the apiserver bumps metadata.generation on content-identical Patch::Merge is implementation-dependent — and in practice we've seen the spec touch correlate with downstream Envoy re-reconciles and a 5xx data-plane window of 1–3 minutes after the controller conditions all flip Ready. Make update_project (and the ad sub-step) skip the PATCH when the existing spec already matches what we'd write. Comparison is on serde_json::Value so it's stable against Option<...> serde-default quirks that would otherwise trip naive structural equality. This makes the lib's update verb idempotent at the lib boundary — which is the property the upcoming "extract shared connect logic into lib for cli + ui + datumctl plugin" work depends on. As a bonus the UI's Edit-tunnel dialog (which currently PATCHes even when the user hits Save without changing anything) gets the no-churn behavior for free, with no UI-side changes. This is hygiene, not the cold-resume latency fix: even with the PATCH skipped, runs continue to show intermittent multi-minute 5xx windows caused by edge-side iroh peer-establishment latency (separate issue, not yet filed pending @drewr's review). Tests cover both comparators across the relevant drift axes (different connector, different endpoint, different label, missing annotation, ad port change, ad connector change).

drewr · 2026-06-07T22:11:04Z

Status update

Picking up where the diagnosis-arc comment left off. Significant ground covered since the runtime-watch fix.

Landed since last update

Tunnel selection & resume UX

ca4470f — `tunnel listen --id ` pins an existing tunnel and re-points its connector backend at the current iroh identity. Direct-API lookup; bypasses the connector filter that was hiding tunnels from list_active after identity rotation. get_active refactored to a direct fetch (no filter), shared summary_from_proxy helper extracted.
a68d8ae — --endpoint is now optional. --id alone resumes verbatim from stored endpoint; --id + --endpoint must agree exactly or fail hard (preserving any URL already shared with others). lib::normalize_endpoint exposed so the CLI uses the same canonicalization as TunnelSummary.endpoint.
7de50c7 — arrow-key picker (inquire) when called with no flags, showing hostname → endpoint [label] rows. Adds `inquire = 0.9.4` to CLI deps.
fe57b24 — picker uses a tracing_subscriber::reload::Handle + RAII guard to silence tracing while inquire is repainting (iroh log lines were splicing into the picker's first option line).
cff37e7 — single-candidate auto-adopts with a one-liner instead of popping a picker that has no movement possible (terminal-cursor confusion).

Progress / verification

c39d9ee — each progress checkpoint annotated with its underlying Kubernetes resource ([HTTPProxy/tunnel-…] / [Connector/datum-connect-…]). Lets a user pivot from a stuck step straight into datumctl describe ….
d65ec4d — progress steps require observedGeneration ≥ metadata.generation for Ready. Closed the "Ready after 0 sec but data plane is dead" gap on resume-time spec patches (still relevant for any caller that does spec patch — even though we now skip the patch entirely, see below).
1ed969e — new `Verifying connectivity...` phase. Probes origin and proxy URL every 10s after controller conditions all flip Ready. New `--timeout` flag (default 10m) caps total setup. Distinguishes "your local server is down" from "edge still 503ing" in the failure summary. This is the visibility surface that exposed all the iroh-dial-latency findings below.

Lib hardening

d8f7c96 — quota-check-timeout retry. Operator's 403 "took too long to be checked against your quota" is now retried with ~15s of backoff at every .create() in the listen flow (HTTPProxy, ConnectorAdvertisement, TrafficProtectionPolicy, Connector). Real quota exhaustion still surfaces the friendly "Quota limit exceeded" message; the carve-out is on the exact transient phrase.
b7e9d6b — HeartbeatAgent::start_manual() skips the auto-fan-out across every accessible project. CLI tunnel-listen now heartbeats only the project the tunnel lives in. UI's start() is unchanged (it actually wants the multi-project model).
0311960 — update_project is idempotent at the lib boundary. Skips the HTTPProxy.spec and ConnectorAdvertisement.spec patches when the existing state already matches what we'd write. Comparison is on serde_json::Value for stable representation. This is hygiene for the upcoming factor-shared-tunnel-logic-into-lib work (cli + ui + future datumctl plugin all benefit automatically); the UI's Edit-without-changes case also stops causing data-plane churn for free.

Issues filed against other repos (all placeholders awaiting @drewr review)

Issue	Scope
datum-cloud/network-services-operator#174	Stale iroh DNS claim outlives its owning Connector — `DeferredToOwner` cites a UID that 404s in the API.
datum-cloud/network-services-operator#175	Narrowed today: fresh-tunnel HTTPProxy `Programmed` / `ConnectorMetadataProgrammed` blocked ~3min on `EnvoyPatchPolicy has no status yet`. Original conflation with the resume-side 5xx symptom is split out (next row).
datum-cloud/iroh-gateway#12	New today: variable multi-minute iroh dial latency from edge to a freshly-started listen node. Two contrasting runs (530ms vs 170s, same machine, 15min apart) as evidence. All control-plane and CLI-side causes are explicitly ruled out.
(PR-comment)	Quota 403 diagnosis — apiserver → Milo quota admission webhook → quota backend chain analysis with concrete mitigations on the Milo side. Not filed against a Milo repo yet, pending @drewr direction.

Known still-flaky behavior (mitigated, not fixed)

On a resume run, the proxy URL still occasionally returns 5xx for 1–3 minutes after Tunnel ready would have fired without the new verify phase. The verify phase makes this visible and waits for actual 2xx instead of false-positive-claiming Ready, so the user experience is "honest progress + slow first request" rather than "instant Ready + broken tunnel." The underlying iroh-gateway dial latency is tracked in iroh-gateway#12.

Test coverage on this branch

47 lib tests, all green. Notable new coverage:

quota_check_timeout_classifier_matches_transient_403 — retry classifier + format carve-out.
progress_pending_when_status_is_stale_for_current_generation — observedGeneration check.
progress_step_carries_resource_label — every step backed by the right Kubernetes kind.
http_proxy_spec_matches_* / advertisement_spec_matches_* — idempotency comparators across drift axes.
start_manual_does_not_auto_enroll — heartbeat manual mode.
listen_key_for_project_* (3 cases) — per-project key migration.

What's next

A separate work stream to factor shared tunnel-management logic out of cli/ and ui/ into the lib, so a future datumctl connect plugin can share the same authoritative implementation. The idempotency landing was a prerequisite for that — lib verbs are now the right shape to be the shared abstraction. Will be tracked separately when it kicks off.

A single 503 from the Datum API server's Envoy front-end ("upstream connect error or disconnect/reset before headers. reset reason: connection termination" — typical when kube apiserver briefly drops connections behind Envoy) was killing in-progress tunnel setups that the next 750ms poll tick would have ridden over. Observed mid-EnvoyPatch Policy-reconcile wait on a fresh tunnel: setup conditions were on the slow-but-working path and the run aborted at the unrelated transient. The runtime watch already handles this correctly — log on error and keep going. Mirror that in await_tunnel_progress with a bounded retry: up to MAX_CONSECUTIVE_POLL_ERRORS (10 ≈ 7.5s at the current cadence) before bailing. Long enough to ride out a brief blip; short enough that a genuinely unreachable control plane still surfaces fast. The change lives in await_tunnel_progress (cli/src/main.rs) but the function is on the future connect-lib side of the boundary discussed in datum-cloud/enhancements#756 comment 4644292554 — it's pure orchestration over TunnelService::get_active_progress, no rendering, no clap. The shape (consecutive-error counter + bounded retry + bail-fast on hard signals) is the one the lib will inherit.

…s work The CLI accepts --endpoint 127.0.0.1:11434 (no scheme) and passes that string through to verify_endpoints, which hands it to reqwest. Reqwest's request builder refuses to build a request from a URL without a scheme and returns a "builder error" — which our probe was reporting as "origin not reachable" indefinitely: ✓ proxy responding (0.4s) [https://...]: HTTP 200 … origin not reachable (0s) [127.0.0.1:11434]: builder error … origin not reachable (10s) [127.0.0.1:11434]: builder error ... The actual origin was reachable the whole time — the proxy probe got HTTP 200 through the tunnel back to the same host:port. Only the CLI's local probe was wedged. Apply lib::normalize_endpoint (the same canonicalization that TunnelSummary.endpoint stores) at the top of verify_endpoints so any bare host:port works as input. The displayed URL becomes the canonical form (http://127.0.0.1:11434), matching what's stored on the HTTPProxy. verify_endpoints is on the connect-lib side of the boundary we sketched in datum-cloud/enhancements#756 comment 4644292554 — defensive normalization belongs here so other callers (UI Edit dialog, the future plugin foreground listen path) don't have to remember to canonicalize.

cargo-zigbuild for aarch64-unknown-linux-gnu failed at openssl-sys because pkg-config can't find target-arch openssl headers, and we can't easily provide them outside the workspace. The transitive pull is: iroh -> pkarr -> reqwest 0.13 (default features = "default-tls") -> hyper-tls -> native-tls -> openssl-sys reqwest 0.13 ships from pkarr's lockfile with native-tls included. Patching iroh/pkarr to switch reqwest features isn't on our path; the workspace's own reqwest 0.12 dep is unrelated (and adding default-features = false there doesn't reach the 0.13 instance — they're separate version-graph nodes). Add `openssl = { version = "0.10", features = ["vendored"] }` to the CLI. Cargo feature unification enables the vendored build for the transitive openssl-sys, so cross-compiling no longer needs target-arch system headers — openssl gets compiled from source as part of the build. Static link, no runtime libssl/libcrypto dependency. Verified: native check passes, aarch64-unknown-linux-gnu cross-build via cargo-zigbuild produces a valid 251MB unstripped ARM aarch64 ELF.

The Parser-derived Args lacked a version attribute, so clap rejected both --version and -V. Add #[command(version)] which sources the version from Cargo.toml's package.version via env!("CARGO_PKG_VERSION") at compile time, giving recipients of distributed binaries a built-in "which build do I have?" check without depending on filename, mtime, or sha256 verification. $ datum-connect --version datum-connect 0.1.0

The existing `auth login` uses an authorization-code-with-PKCE flow that binds a localhost HTTP server to receive the OIDC redirect. On a remote machine over SSH, in CI, or in a container, that pattern is unreachable — the browser running on the operator's laptop can't reach a port bound on the remote box without a separate SSH port-forward. The standard escape hatch is RFC 8628 OAuth2 device authorization, which is what datumctl's own `login --no-browser` uses. Mirror that here: - StatelessClient::login_device_code() — fetches the OIDC discovery JSON directly (openidconnect's CoreProviderMetadata doesn't surface device_authorization_endpoint), rebuilds the OIDC client with set_device_authorization_url(), starts the grant, hands the DeviceCodeInfo (verification URL + user code + expiry) to a caller- supplied display callback, and polls exchange_device_access_token via tokio::time::sleep. Token-response parsing reuses the existing parse_token_response with a nonce verifier that allows missing-nonce (device flow doesn't bind one). - AuthClient::login_device_code() — always performs a fresh login; callers wanting refresh-eligible token reuse should use the normal login() instead. - DatumCloudClient::login_device_code() — top-level entry point. - DeviceCodeInfo re-exported from lib::datum_cloud so the CLI doesn't take a direct dep on openidconnect's Core types. CLI side, AuthCommands::Login and AuthCommands::Switch both gain a --no-browser flag that routes to the new method. The display callback prints the verification URL + user code prominently to stderr so it doesn't tangle with structured stdout (relevant for future plugin modes). Verified against the production auth server's OIDC discovery (Datum's Zitadel exposes device_authorization_endpoint and lists urn:ietf:params:oauth:grant-type:device_code in grant_types_supported).

Adds --no-browser device-flow login (e14d689) since cli-v0.1.0.

Our own OIDC client (datum-desktop-app, configured in datum-cloud/infra apps/datum-iam-system/.../zitadel-setup/pulumi/index.ts) has only AUTHORIZATION_CODE + REFRESH_TOKEN in its allow-listed grantTypes. Zitadel correctly rejects the device-code grant against it: unauthorized_client: grant_type "...device_code" not allowed datumctl-cli (a sibling OIDC app in the same Zitadel project) already has DEVICE_CODE in its grantTypes and has stable, well-known IDs in datumctl's source: Staging: 325848904128073754 Production: 328728232771788043 Borrow them for the --no-browser path until the planned datumctl connect plugin ships with its own properly-scoped client. Tokens are minted by Zitadel against the same project, so downstream Datum API calls don't care which client minted them. The audience verifier on id_token_verifier already allows any audience. Regular `auth login` (browser flow) is unchanged — it stays on the datum-desktop-app client.

drewr marked this pull request as draft March 27, 2026 20:14

drewr force-pushed the cli-tunnel-and-auth branch from ea2df66 to 80ffdf7 Compare March 27, 2026 20:39

drewr added 2 commits March 27, 2026 15:51

drewr force-pushed the cli-tunnel-and-auth branch from 80ffdf7 to 01c3ab8 Compare March 27, 2026 20:51

drewr self-assigned this Mar 27, 2026

drewr added 2 commits March 27, 2026 16:23

docs: Add nix instructions

bdb635b

chore: update nix targets

e586049

drewr and others added 5 commits March 27, 2026 16:37

fix: show which user auth'd

9546bee

chore: don't show INFO logs by default

3f977da

drewr assigned gianarb and unassigned drewr Apr 9, 2026

drewr added 5 commits June 6, 2026 18:29

chore: clear unused-import and dead-code warnings in lib and CLI

1bfc67e

drewr mentioned this pull request Jun 7, 2026

IrohDNSPublished=True flips to DeferredToOwner mid-session on an actively-serving Connector datum-cloud/network-services-operator#174

Open

drewr mentioned this pull request Jun 7, 2026

[PLACEHOLDER — needs review] Fresh-tunnel: HTTPProxy Programmed/ConnectorMetadataProgrammed blocked ~3min on 'Downstream EnvoyPatchPolicy has no status yet' datum-cloud/network-services-operator#175

Open

drewr added 8 commits June 7, 2026 18:52

drewr added 2 commits June 7, 2026 21:23

drewr mentioned this pull request Jun 7, 2026

[PLACEHOLDER — needs review] Variable multi-minute iroh dial latency from edge to a freshly-started listen node datum-cloud/iroh-gateway#12

Open

drewr mentioned this pull request Jun 7, 2026

feat: add Headless Tunnel CLI enhancement datum-cloud/enhancements#756

Draft

4 tasks

drewr added 7 commits June 8, 2026 02:48

chore(cli): bump version to 0.2.0

e0ce7c0

Adds --no-browser device-flow login (e14d689) since cli-v0.1.0.

Conversation

drewr commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Building

Commands

auth

projects

tunnel

Project selection

Example session

Bug fixes (found during testing)

Test plan

Uh oh!

zachsmith1 commented Mar 27, 2026

Uh oh!

drewr commented Mar 27, 2026

Uh oh!

zachsmith1 commented Mar 27, 2026

Uh oh!

scotwells commented Mar 27, 2026

Uh oh!

drewr commented Mar 27, 2026

Uh oh!

drewr commented Mar 31, 2026

Uh oh!

bmertens-datum commented Apr 1, 2026

Uh oh!

zachsmith1 commented Apr 1, 2026

Uh oh!

richardhenwood commented Apr 3, 2026

Uh oh!

drewr commented Apr 9, 2026

Uh oh!

drewr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drewr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianarb commented Apr 9, 2026

Uh oh!

drewr commented Apr 9, 2026

Gateway hostname normalization — needs investigation

Uh oh!

drewr commented Jun 7, 2026

Diagnosis notes from a debugging session

1. Heartbeat wedge on deleted Lease

2. Machine-wide iroh identity colliding across projects

3. Setup readiness check too shallow

4. Stale DNS owner re-emerging post-setup

Recovery playbook this PR enables

Uh oh!

drewr commented Jun 7, 2026

Quota 403 diagnosis

The chain

The failure class is already named in datum-cloud code

Most likely root causes (in order of probability)

How to confirm

Server-side fixes worth considering

Mitigation on the CLI side (this PR)

Action for @drewr

Uh oh!

drewr commented Jun 7, 2026

Status update

Landed since last update

Issues filed against other repos (all placeholders awaiting @drewr review)

Known still-flaky behavior (mitigated, not fixed)

Test coverage on this branch

What's next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

drewr commented Mar 27, 2026 •

edited

Loading

drewr commented Apr 9, 2026 •

edited

Loading

drewr commented Apr 9, 2026 •

edited

Loading