Add CLI tunnel and auth commands#130
Conversation
|
Do we want a separate cli for tunnels or do we want to bake in functionality into datumctl? |
ea2df66 to
80ffdf7
Compare
|
Yeah, it's why this is a draft. I needed the functionality and didn't want to commit one way or the other yet. I explored doing it in datumctl and it would involve either replicating the Iroh sidecar in go or making the project hybrid with a rust component. This method uses all the same machinery as the GUI which felt like a better first pass. |
- Add 'tunnel' subcommand to datum-connect CLI with: - 'tunnel list': read-only listing of tunnels (no side effects) - 'tunnel listen': create/update and run tunnel in foreground - 'tunnel update': update tunnel label/endpoint - 'tunnel delete': delete a tunnel - Add 'nix run .#connect' app to flake.nix - Split find_connector_readonly for list operations - Remove side effects from tunnel list (no patching Connector) - Listen command: - Generates random label if not provided - Confirms before updating existing tunnel - Handles Ctrl+C to disable tunnel on exit
- Add 'auth' subcommand to CLI with: - 'auth status': Show current authentication and selected context - 'auth login': Log in via browser OAuth with account picker - 'auth logout': Log out and clear credentials - 'auth list': Show current authenticated user - 'auth switch': Log out current user and prompt for new login Also add is_authenticated(), login(), logout() methods to DatumCloudClient.
80ffdf7 to
01c3ab8
Compare
|
Ya the challenge is the core stuff we need is in rust so we'll need some magic to make the UX good |
|
How does this interact with the GUI based application? Would auth be shared? Since the GUI is locked to a specific project (because connectors are project-scoped resources), switching the authenticated user could break existing tunnels without the user knowing and it doesn't seem like we warn the user. |
|
It's all shared. I'll show what it looks like when Rust is done compiling... |
delete_project returned early when find_connector returned None, skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy. Connector lookup is only needed for post-deletion cleanup (deciding whether to delete the shared connector). Move it into an Option and gate the cleanup block on Some, so resource deletion always proceeds. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Three interrelated bugs fixed in the tunnel listen command: - Random label was generated before checking for an existing tunnel, so re-running listen on the same endpoint always triggered the update prompt. Moved label generation into the create-new path only; existing tunnels reuse their stored label unless --label is explicitly provided and differs. - Default label format changed from tunnel-<u16> (collides with resource ID format) to 12 hex chars of random entropy (e.g. a3f9c2e1b047). Adds hex as a dependency. - tunnel listen was missing the HeartbeatAgent that continuously patches status.connectionDetails on the connector (relay URL, addresses, public key). Without it the gateway has no routing info and tunnels never carry traffic. Now starts the heartbeat and registers the project before enabling the tunnel, then polls until accepted+programmed before printing the hostname. Also simplifies tunnel delete output: connector cleanup is an internal detail, so "Deleted tunnel <id>" replaces "(connector deleted: false)". Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- After auth login/switch, prompt user to select an org and project
and persist the selection as the active context
- Store the selected context in config.yml instead of a separate file
- Add --project flag to the tunnel command to override the active
project for a single invocation
- Add projects list and projects switch commands for managing the
active project outside of the auth flow
- Fix tunnel listen to print id and label after creation
|
Here's a short demo of where I've gotten with this: headless-tunnels-demo.mp4 |
|
@drewr Nice demo. |
|
@drewr this is slick. lm planning on splitting off the app repo from the gateway repo and we should consider where we'd want this cli to live. last piece there would be a small enhancement around how we could inject this rust binary into datumctl (if we want to) |
|
Great feedback @richardhenwood, thanks! |
|
@zachsmith1 wrote:
I think if we factor out the local process to a standalone rust utility like you're proposing it makes more sense for this to live in datumctl. I originally went that direction but didn't want to either rewrite the iroh integration in go or repackage this in an awkward way. |
|
I've had some instability with this and had both gpt-5.4 and sonnet-4.6 chewing on it:
Fix incoming. |
|
There is a lot to unwrap in my opinion here, a lot around product so I am not sure I have enough context to help here. Something is an old discussion we had here datum-cloud/enhancements#582 if you look for the
In practice what I was trying to highlight here is the mood kubernetes and other cli tool develop when you do that everything you run starts from a unique source of truth (for kubernetes it is the It will feel a lot easier to push for a plugin ecosystem like the one kubectl and others developed where binaries starting with But if we can not agree on some common practices, like authentication the outcome for a user will be pretty poor, in this case I feel like we should just "give up" and release different binaries working their own way. I am not saying that we should have in place the ability to switch and persist in between accounts/instances because I know we do not know yet datum-cloud/enhancements#653 (comment) but maybe since we do not know we can just take what we have today as common denominator until we figure out what's next. So the way I envision the evolution of this PR is a binary that serves only the business logic to manage tunnels and connections and demands authentication to the same login used by the datumctl (or the datumctl changes to turn to the same used here and from desktop) This is what I am trying to push to but as I said product wise I am not sure I have enough context to push into a direction vs another. |
The gateway sends `CONNECT localhost:<port>` regardless of whether the tunnel was registered with `localhost` or `127.0.0.1`, causing auth to fail with Forbidden and the caller to see "upstream connect error or disconnect/reset before headers." Normalize `localhost`, `127.0.0.1`, and `::1` to a canonical form on both sides of the host comparison in `tcp_proxy_exists`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gateway hostname normalization — needs investigationWhile debugging an "upstream connect error or disconnect/reset before headers" report, I found the root cause in commit e2d868d: the gateway sends The client-side fix normalizes Observable behavior (from The Questions for the gateway team:
The client-side fix is defensive and correct either way, but if the gateway is doing unintended normalization, fixing it there would be cleaner and might surface other subtle issues. |
When the Lease the heartbeat owns is removed server-side (TTL cleanup, namespace reap, manual delete), the renew loop kept patching the dead name forever — only logging a warn each tick — because the cache was preserved on every error. The tunnel went silently dark. Route both the fetch-lease and renew error arms through a single classifier that resets the cache on 404 so the next iteration re-resolves the connector and lease from scratch, while still force-refreshing on 401 and retaining the cache on transient errors.
The CLI persisted a single iroh secret key at the repo root and reused it for every project the user touched. The network-services-operator explicitly treats two Connectors with the same iroh public key as a collision: the iroh DNS controller picks one winner and marks the losers with IrohDNSPublished=False; Reason=DeferredToOwner. The losing project's tunnel reports Ready but silently drops data because the iroh DNS record points at the wrong Connector. Move listen_key under <repo>/projects/<project_id>/ so each project has a distinct iroh identity. On first per-project access, migrate any legacy flat listen_key into that project's directory — the first project the user runs against keeps continuity with its existing server-side Connector; subsequent projects get fresh keys and stop joining the race. Leave connect_key (no server-side Connector) and gateway_key (separate daemon identity) flat. The UI and Serve paths continue to use the flat listen_key for now; converting them needs its own pass. The Tunnel command now requires a selected project and fails with a clear message if none is set, since the per-project key path needs a project id at node construction time.
The setup loop only checked accepted && programmed && !hostnames.is_empty() and slept 2s at a time, so the user saw "Setting up tunnel..." then a 2s silence and then "Tunnel ready" — even when the Connector's IrohDNSPublished=False; DeferredToOwner condition meant the data plane was silently unreachable. Surface the six controller conditions that already exist on the HTTPProxy and Connector (Accepted, CertificatesReady, ConnectorReady, IrohDNSPublished, Programmed, ConnectorMetadataProgrammed) through a typed TunnelProgress, and stream each transition as a checklist line. Bail immediately when IrohDNSPublished comes back False with reason DeferredToOwner — that's the cross-project iroh-key collision case where waiting longer can't help, so we print the operator's message naming the owning Connector and exit non-zero. Also warn on stdout when any step stays pending past 30 seconds, since the controller's reason string is the most useful diagnostic when something genuinely stalls. Polling at 750ms is fine: get_active_progress does two reads (HTTPProxy + Connector) on an already-warm PCP client, and server-side reconcile latency dominates.
…reakage Setup-time progress checks aren't enough. Today's failure mode: a tunnel came up cleanly, ran for ~9 minutes, then the iroh DNS controller re-reconciled and flipped IrohDNSPublished from True back to False because a deleted Connector's DNS claim was never cleaned up server-side. The data plane went dark while the CLI kept reporting healthy — Ready was still True, the heartbeat was still renewing the lease, and there was no client-side signal that anything had changed. Poll progress every 10s alongside the existing login-state watch. When terminal_failure() trips (currently IrohDNSPublished=False with reason DeferredToOwner), print the same message the setup path emits and break out of the run loop cleanly so the operator gets disable+cleanup instead of a silent zombie. Tunnel-deleted-from-under-us also breaks out; transient poll errors only warn and retry. Factor the failure message into format_terminal_failure() so setup-time and runtime emit identical wording — the user shouldn't have to learn two error shapes for the same diagnosis.
Diagnosis notes from a debugging sessionOver a single CLI session I hit three distinct failure modes that all presented the same way — "Tunnel ready" with the data plane silently dropping. Recording the chain here so the cause-effect is preserved alongside the fix commits. 1. Heartbeat wedge on deleted LeaseSymptom. Tunnel went silent after no visible errors; logs eventually showed a warn loop: firing every 30s indefinitely. Root cause. Fix. c901b01 classifies the kube error per arm: 404 → reset cache, 401 → force refresh + retain, anything else → retain. Three unit tests cover the decision. 2. Machine-wide iroh identity colliding across projectsSymptom. A freshly-created tunnel reported Root cause. Fix. 76987b7 scopes The audit of 3. Setup readiness check too shallowSymptom. Tunnels reported Root cause. The setup loop only checked Fix. 90a7ab3 exposes a typed 4. Stale DNS owner re-emerging post-setupSymptom. A tunnel that came up cleanly worked for ~9 minutes and then quietly went dark with no client-side signal. Root cause. A previously-deleted Connector's iroh DNS claim was never cleaned up server-side. When the iroh DNS controller re-reconciled, it flipped the live Connector's Fix (client-side, this PR). 6264818 adds a runtime watch alongside the login-state watch — poll Server-side follow-up (out of scope for this PR). The iroh DNS controller leaking a claim past its owning Connector's lifetime is a real bug. Worth filing against the operator team — concrete instance from today's session was Recovery playbook this PR enablesIf a user's tunnel goes terminal mid-session, the CLI now:
Operator action: delete the per-project listen_key ( |
…jects
HeartbeatAgent::start() auto-enrolls every project the user has access
to, which is correct for the UI (it surfaces tunnels across projects)
but wrong for the CLI tunnel-listen command, which owns exactly one
project. Today's logs showed the fan-out clearly:
heartbeat: no connector yet project_id=drewr-y4nd1b
heartbeat: no connector yet project_id=drewr3-ceu4gt
The CLI silently maintained presence in drewr3-ceu4gt — a project the
user never mentioned — for the lifetime of the tunnel. Harmless today,
but it makes logs misleading, multiplies API load, and would create
real risk if a misconfigured token granted access to a project that
shouldn't be touched.
Add start_manual() that skips the watcher entirely. The CLI now starts
manual mode and explicitly registers its single project. Per-project
loops still handle 401s via force-refresh, so transient auth blips are
tolerated; the CLI's own login-state watch surfaces permanent logout.
Keeps start() unchanged so the UI continues to auto-enroll. The new
entry point is documented with a pointer to start_manual for callers
in the CLI-style pattern.
…stname The endpoint-match adoption path is anchored on the connector record matching the agent's current iroh endpoint id, then filtered to HTTPProxies whose backend references that connector by NAME. Any time the connector record gets renamed server-side (delete + ensure_connector recreates with a new generateName suffix) or the iroh identity rotates, all the previous HTTPProxies become invisible to list_project and adoption silently misses, spawning a fresh tunnel with a fresh hostname. Add --id <tunnel-id> to tunnel listen for the "I have already shared this URL with others" case. The path looks up the HTTPProxy by name (direct API call, no connector filter), calls update_active to re-point its backend at the current connector, and re-enables its ConnectorAdvertisement. The hostname lives on the HTTPProxy resource and is preserved across connector identity changes. Also refactor TunnelService::get_active to do a direct fetch instead of filtering list_active. The previous implementation filtered out tunnels whose backend pointed at any connector other than this agent's current one, which made tunnel update/delete by id silently fail-as-NotFound after a connector rename. Direct fetch matches the user's intent when they explicitly name a tunnel. Factor summary_from_proxy() so get_active and list_project don't drift.
--endpoint is now optional. The four input shapes:
--id <id> resume the tunnel verbatim using its stored
endpoint; re-point the connector backend at
this agent's current iroh identity.
--id <id> --endpoint <e> same, but assert the user-provided endpoint
matches the stored one (after the same
normalization the lib applies). A mismatch
fails hard with a message pointing at
'tunnel update' for explicit changes — we
don't silently retarget a tunnel whose URL
the user may have shared with others.
--endpoint <e> existing endpoint-match adoption path.
neither error with a hint at both flags.
Expose lib::normalize_endpoint so the CLI compares endpoint strings
using the same canonicalization the stored TunnelSummary value was
produced with (trim + prepend http:// if no scheme), instead of doing
naïve string equality that would spuriously fail on scheme/whitespace
differences.
Running 'tunnel listen' bare now pops an interactive picker on a TTY, listing existing tunnels with hostname → endpoint [label]. Pick one with ↑↓/Enter to resume it (treated as if --id had been given). Tunnels without hostnames (still pending) are excluded — picking them would just produce 'tunnel not found'. Enabled tunnels sort above disabled (marked '○') so the most-likely-relevant ones come first. Non-TTY (CI, piped) keeps the existing fail-fast error so scripts don't hang on stdin. Empty candidate list (no tunnels in the project, or no tunnels with hostnames yet) also falls through to the error — there's nothing to pick. Adds inquire 0.9 for the picker. Considered rolling raw crossterm to avoid a dep, but the cost/benefit didn't justify it for one prompt.
The iroh relay-actor (and other chatty modules under iroh::magicsock /
lib::*) write to the same TTY inquire is repainting on. The collision
looked like:
? Resume which tunnel? 2026-06-07T19:10:30Z INFO relay-actor: ...
roxy.net → http://127.0.0.1:11434 [d38f5f413beb]
— a log line spliced across the picker's first option, leaving the
terminal unreadable.
Wrap the EnvFilter in a reload handle exposed via a OnceLock so the
picker can engage a RAII QuietTracing guard that swaps the filter to
'error' for the lifetime of the inquire prompt and restores it on drop.
Captures the previous filter via EnvFilter's Display impl since it
doesn't implement Clone; round-trip through to_string()/try_new() is
the supported path.
This is a targeted fix for the symptom. A cleaner long-term move would
be to defer ListenNode construction past the picker so iroh hasn't
booted yet — but that requires TunnelService to support read-only
listing without a node, which is a larger refactor.
…ration 'tunnel listen --id' PATCHes the HTTPProxy spec to re-point its backend at the current connector. That bumps metadata.generation, but the controllers' prior True conditions still carry observedGeneration from the previous generation until they re-reconcile. The progress check was reading those as Ready, so the CLI reported "Tunnel ready after 0 sec" while the data plane was still serving 503s from the old Envoy config. Mark a step Ready only when status == "True" AND observed_generation >= metadata.generation. Stale-but-True conditions become Pending (still progressing) — same code path as the 30s stuck warning, so the user gets the controller's reason rather than a false- positive Ready. None observedGeneration is treated as 0, which falls through to Pending on any non-zero generation (correct — the controller hasn't reconciled this resource yet). Test progress_pending_when_status_is_stale_for_current_generation covers both halves: stale-True is Pending; once the controller catches up, the same condition flips to Ready. Existing tests still pass because their fixtures default to generation=None == 0 on both sides.
A 0-second "Tunnel ready" with a 503-serving data plane (the observedGeneration bug we just fixed) made it clear that users need a fast pivot from a stuck progress line to 'datumctl describe ...' on the exact resource. Without that, the operator's reason string is useful text but its provenance is buried. Add a 'resource: Option<String>' field on ProgressStep, pre-formatted as "HTTPProxy/<tunnel-id>" or "Connector/<connector-name>", populated from the live resource metadata in from_resources. Mapping per step: tunnel accepted → HTTPProxy/<tunnel-id> TLS certificate issued → HTTPProxy/<tunnel-id> connector ready → Connector/<connector-name> iroh DNS published → Connector/<connector-name> route programmed → HTTPProxy/<tunnel-id> envoy metadata propagated → HTTPProxy/<tunnel-id> CLI renders the label inline: ✓ tunnel accepted (0.1s) [HTTPProxy/tunnel-gchhg] … route programmed still pending after 30s [HTTPProxy/tunnel-gchhg]: … ProgressStepKind::resource_kind() is the source of truth for which kind backs each step, used by the test that asserts the wiring is correct across all six steps. No extra API call needed — the connector name was already in scope inside TunnelService::get_active_progress.
… picker When there's exactly one tunnel to resume, inquire still rendered the prompt with a forced '>' marker on the lone row and arrow keys were no-ops. The terminal cursor sat on the '? Resume which tunnel?' line — visually it looked like the selector hadn't moved 'to the correct line' because there was no other line to move to. The picker only earns its keep when there's a choice to make. Short-circuit at one candidate: print 'Resuming the only tunnel ...' naming the hostname and id, return it directly. The >1 case still pops the picker, where arrow-key movement was verified working under pty with cursor-position+selected-index alignment.
The operator's quota service occasionally times out the admission check on Create requests and returns 403 with: "Your request took too long to be checked against your quota. Please try again in a moment — if this keeps happening, contact support." The error message itself says "try again". Until today the CLI just surfaced the raw 403 and bailed mid-listen, which the user has to recover from manually. Add is_quota_check_timeout() that matches the specific 403 message (distinct from real quota exhaustion, which uses different wording), and with_quota_check_retry() that retries up to ~15s (1s, 2s, 4s, 8s, final attempt) on that exact class. Other 403s — real exhaustion, IAM denials, admission rejections — return immediately so genuine failures still surface fast. Prints a one-line stderr notice on first retry so the user knows we're waiting on the server, not wedged. Apply at every kube .create() site in the tunnel lifecycle: - HTTPProxy create (fresh tunnel) - ConnectorAdvertisement create (fresh tunnel) - TrafficProtectionPolicy create (fresh tunnel) - ConnectorAdvertisement create (set_enabled when resuming) - Connector create (ensure_connector first run) Also tighten format_quota_error to skip the timeout phrase: when retries exhaust, the user should see the actual server message rather than "Quota limit exceeded for ConnectorAdvertisement", which is the wrong diagnosis. Real "Insufficient quota" exhaustion still gets the helpful message. Test covers the classifier and the formatter carve-out on both the timeout and real-exhaustion shapes.
Quota 403 diagnosisWhile shipping
The chain
The failure class is already named in datum-cloud codeFrom // InstanceQuotaGrantedReasonBackendUnavailable indicates quota enforcement
// is configured but the Milo quota backend is unreachable (network error,
// timeout, transient failure).
//
// InstanceQuotaGrantedReasonMisconfigured indicates the ResourceClaim was
// rejected by the Milo admission plugin (403/422): ResourceRegistration absent
// or the policy is malformed.Our 403 is the user-facing surface of Most likely root causes (in order of probability)
How to confirm
Server-side fixes worth considering(Not in this PR's scope — the actual implementation is in Milo — but flagging for whoever owns it.)
Mitigation on the CLI side (this PR)d8f7c96 adds Action for @drewrIf the Milo side analysis above looks plausible, this is worth a separate placeholder issue against whichever Milo repo owns the quota webhook + backend, including the operator-side mitigations as suggestions. Happy to draft that issue too if you confirm the diagnosis. |
Controllers reporting Ready (with observedGeneration in sync) still
doesn't mean the data plane is actually carrying traffic — Envoy
programming a route is not the same as Envoy serving it. The user
reported a ~2-minute window where every condition was True but
https://<proxy>/ returned 503. Whatever's behind it (xDS push lag, edge
config-not-yet-loaded, iroh peer connection still settling), it's
invisible from the controller's view.
Add a "Verifying connectivity..." phase between the condition checklist
and "Tunnel ready". Every 10s, probe in parallel:
- the origin URL the user gave (so a downed local service is named
explicitly instead of being blamed on the tunnel)
- the public proxy URL (https://<hostname>/)
Any response under 500 counts as "reachable" — 4xx like 401/404 are
fine because the edge is forwarding; only 5xx + transport errors block.
On each tick we print a ✓ line for newly-reachable endpoints and a …
line for ones still failing, with the controller's last error so the
user can act ("origin connection refused" vs "proxy 503").
New --timeout flag (default 10m, humantime) caps total setup including
verification. On expiry the command exits non-zero with a per-side
summary so an unverified tunnel doesn't get treated as healthy.
Sleep is clamped to the remaining budget so an early success on one
side doesn't waste the last 10s before bailing on the other.
When the CLI's --id / picker-resume calls update_active, it almost always passes the same label, endpoint, and current connector that the existing HTTPProxy already references. The previous behavior PATCHed HTTPProxy.spec.rules and metadata.annotations unconditionally with that identical content. Whether the apiserver bumps metadata.generation on content-identical Patch::Merge is implementation-dependent — and in practice we've seen the spec touch correlate with downstream Envoy re-reconciles and a 5xx data-plane window of 1–3 minutes after the controller conditions all flip Ready. Make update_project (and the ad sub-step) skip the PATCH when the existing spec already matches what we'd write. Comparison is on serde_json::Value so it's stable against Option<...> serde-default quirks that would otherwise trip naive structural equality. This makes the lib's update verb idempotent at the lib boundary — which is the property the upcoming "extract shared connect logic into lib for cli + ui + datumctl plugin" work depends on. As a bonus the UI's Edit-tunnel dialog (which currently PATCHes even when the user hits Save without changing anything) gets the no-churn behavior for free, with no UI-side changes. This is hygiene, not the cold-resume latency fix: even with the PATCH skipped, runs continue to show intermittent multi-minute 5xx windows caused by edge-side iroh peer-establishment latency (separate issue, not yet filed pending @drewr's review). Tests cover both comparators across the relevant drift axes (different connector, different endpoint, different label, missing annotation, ad port change, ad connector change).
Status updatePicking up where the diagnosis-arc comment left off. Significant ground covered since the runtime-watch fix. Landed since last updateTunnel selection & resume UX
Progress / verification
Lib hardening
Issues filed against other repos (all placeholders awaiting @drewr review)
Known still-flaky behavior (mitigated, not fixed)On a resume run, the proxy URL still occasionally returns 5xx for 1–3 minutes after Test coverage on this branch47 lib tests, all green. Notable new coverage:
What's nextA separate work stream to factor shared tunnel-management logic out of |
A single 503 from the Datum API server's Envoy front-end ("upstream
connect error or disconnect/reset before headers. reset reason:
connection termination" — typical when kube apiserver briefly drops
connections behind Envoy) was killing in-progress tunnel setups that
the next 750ms poll tick would have ridden over. Observed mid-EnvoyPatch
Policy-reconcile wait on a fresh tunnel: setup conditions were on the
slow-but-working path and the run aborted at the unrelated transient.
The runtime watch already handles this correctly — log on error and
keep going. Mirror that in await_tunnel_progress with a bounded retry:
up to MAX_CONSECUTIVE_POLL_ERRORS (10 ≈ 7.5s at the current cadence)
before bailing. Long enough to ride out a brief blip; short enough that
a genuinely unreachable control plane still surfaces fast.
The change lives in await_tunnel_progress (cli/src/main.rs) but the
function is on the future connect-lib side of the boundary discussed in
datum-cloud/enhancements#756 comment 4644292554 — it's pure orchestration
over TunnelService::get_active_progress, no rendering, no clap. The
shape (consecutive-error counter + bounded retry + bail-fast on hard
signals) is the one the lib will inherit.
…s work The CLI accepts --endpoint 127.0.0.1:11434 (no scheme) and passes that string through to verify_endpoints, which hands it to reqwest. Reqwest's request builder refuses to build a request from a URL without a scheme and returns a "builder error" — which our probe was reporting as "origin not reachable" indefinitely: ✓ proxy responding (0.4s) [https://...]: HTTP 200 … origin not reachable (0s) [127.0.0.1:11434]: builder error … origin not reachable (10s) [127.0.0.1:11434]: builder error ... The actual origin was reachable the whole time — the proxy probe got HTTP 200 through the tunnel back to the same host:port. Only the CLI's local probe was wedged. Apply lib::normalize_endpoint (the same canonicalization that TunnelSummary.endpoint stores) at the top of verify_endpoints so any bare host:port works as input. The displayed URL becomes the canonical form (http://127.0.0.1:11434), matching what's stored on the HTTPProxy. verify_endpoints is on the connect-lib side of the boundary we sketched in datum-cloud/enhancements#756 comment 4644292554 — defensive normalization belongs here so other callers (UI Edit dialog, the future plugin foreground listen path) don't have to remember to canonicalize.
cargo-zigbuild for aarch64-unknown-linux-gnu failed at openssl-sys
because pkg-config can't find target-arch openssl headers, and we
can't easily provide them outside the workspace. The transitive pull
is:
iroh -> pkarr -> reqwest 0.13 (default features = "default-tls") ->
hyper-tls -> native-tls -> openssl-sys
reqwest 0.13 ships from pkarr's lockfile with native-tls included.
Patching iroh/pkarr to switch reqwest features isn't on our path; the
workspace's own reqwest 0.12 dep is unrelated (and adding
default-features = false there doesn't reach the 0.13 instance —
they're separate version-graph nodes).
Add `openssl = { version = "0.10", features = ["vendored"] }` to the
CLI. Cargo feature unification enables the vendored build for the
transitive openssl-sys, so cross-compiling no longer needs target-arch
system headers — openssl gets compiled from source as part of the
build. Static link, no runtime libssl/libcrypto dependency.
Verified: native check passes, aarch64-unknown-linux-gnu cross-build
via cargo-zigbuild produces a valid 251MB unstripped ARM aarch64 ELF.
The Parser-derived Args lacked a version attribute, so clap rejected
both --version and -V. Add #[command(version)] which sources the
version from Cargo.toml's package.version via env!("CARGO_PKG_VERSION")
at compile time, giving recipients of distributed binaries a built-in
"which build do I have?" check without depending on filename, mtime,
or sha256 verification.
$ datum-connect --version
datum-connect 0.1.0
The existing `auth login` uses an authorization-code-with-PKCE flow that
binds a localhost HTTP server to receive the OIDC redirect. On a remote
machine over SSH, in CI, or in a container, that pattern is unreachable
— the browser running on the operator's laptop can't reach a port bound
on the remote box without a separate SSH port-forward. The standard
escape hatch is RFC 8628 OAuth2 device authorization, which is what
datumctl's own `login --no-browser` uses.
Mirror that here:
- StatelessClient::login_device_code() — fetches the OIDC discovery
JSON directly (openidconnect's CoreProviderMetadata doesn't surface
device_authorization_endpoint), rebuilds the OIDC client with
set_device_authorization_url(), starts the grant, hands the
DeviceCodeInfo (verification URL + user code + expiry) to a caller-
supplied display callback, and polls exchange_device_access_token
via tokio::time::sleep. Token-response parsing reuses the existing
parse_token_response with a nonce verifier that allows missing-nonce
(device flow doesn't bind one).
- AuthClient::login_device_code() — always performs a fresh login;
callers wanting refresh-eligible token reuse should use the normal
login() instead.
- DatumCloudClient::login_device_code() — top-level entry point.
- DeviceCodeInfo re-exported from lib::datum_cloud so the CLI doesn't
take a direct dep on openidconnect's Core types.
CLI side, AuthCommands::Login and AuthCommands::Switch both gain a
--no-browser flag that routes to the new method. The display callback
prints the verification URL + user code prominently to stderr so it
doesn't tangle with structured stdout (relevant for future plugin
modes).
Verified against the production auth server's OIDC discovery (Datum's
Zitadel exposes device_authorization_endpoint and lists
urn:ietf:params:oauth:grant-type:device_code in grant_types_supported).
Adds --no-browser device-flow login (e14d689) since cli-v0.1.0.
Our own OIDC client (datum-desktop-app, configured in datum-cloud/infra apps/datum-iam-system/.../zitadel-setup/pulumi/index.ts) has only AUTHORIZATION_CODE + REFRESH_TOKEN in its allow-listed grantTypes. Zitadel correctly rejects the device-code grant against it: unauthorized_client: grant_type "...device_code" not allowed datumctl-cli (a sibling OIDC app in the same Zitadel project) already has DEVICE_CODE in its grantTypes and has stable, well-known IDs in datumctl's source: Staging: 325848904128073754 Production: 328728232771788043 Borrow them for the --no-browser path until the planned datumctl connect plugin ships with its own properly-scoped client. Tokens are minted by Zitadel against the same project, so downstream Datum API calls don't care which client minted them. The audience verifier on id_token_verifier already allows any audience. Regular `auth login` (browser flow) is unchanged — it stays on the datum-desktop-app client.

Summary
This PR ships the CLI client for Datum Connect tunneling — the headless equivalent of the desktop UI. It lets users authenticate, manage projects, and expose local services to public hostnames without launching the GUI.
Building
Rust tooling only (no Nix required):
Or with Nix:
nix run .#cli -- --helpCommands
auth
projects
tunnel
tunnel listenruns in the foreground. It creates or reuses a tunnel for the given endpoint, starts the heartbeat agent so the gateway has routing info, enables the tunnel, and polls until it is accepted and programmed before printing the public hostname.Ctrl+Cdisables the tunnel and exits.The
--projectflag overrides the active project for a single invocation without changing the stored selection.Project selection
The active project is stored in
config.yml(default:~/.local/share/Datum/config.yml, overridable via$DATUM_CONNECT_REPO). It is set interactively afterauth loginorauth switch, or explicitly withprojects switch.Example session
Bug fixes (found during testing)
HeartbeatAgentthat continuously patchesstatus.connectionDetailson the connector. Without it the gateway has no routing info. Fixed:tunnel listennow starts the heartbeat and registers the project before enabling the tunnel.tunnel listenon an existing endpoint always prompted for update: Random label was generated before checking for an existing tunnel, so it always differed. Fixed: label generation moved into the create-new path; existing tunnels reuse their stored label unless--labelis explicitly given.delete_projectreturned early if no connector was found, skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy. Fixed: connector lookup is only needed for post-deletion cleanup and no longer gates resource deletion.tunnel-<u16>format: Collided visually with resource ID format. Switched to 12 hex chars of random entropy (e.g.a3f9c2e1b047).Test plan
cargo run -p datum-connect -- auth logincompletes OAuth and prompts for project selectionprojects listshows all orgs/projects with active one markedprojects switchpersists new selection toconfig.ymltunnel listen --endpoint 127.0.0.1:<port>creates tunnel, prints hostname, disables on Ctrl+Ctunnel listenon the same endpoint reuses the existing tunnel without promptingtunnel listen --project <id>uses the specified projecttunnel listshows tunnels in the active projecttunnel deleteremoves a tunnel cleanly