Skip to content

Add CLI tunnel and auth commands#130

Draft
drewr wants to merge 42 commits into
mainfrom
cli-tunnel-and-auth
Draft

Add CLI tunnel and auth commands#130
drewr wants to merge 42 commits into
mainfrom
cli-tunnel-and-auth

Conversation

@drewr

@drewr drewr commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR ships the CLI client for Datum Connect tunneling — the headless equivalent of the desktop UI. It lets users authenticate, manage projects, and expose local services to public hostnames without launching the GUI.

Building

Rust tooling only (no Nix required):

cargo run -p datum-connect -- --help

Or with Nix:

nix run .#cli -- --help

Commands

auth

datum-connect auth login       # OAuth via browser; prompts to select a project after login
datum-connect auth logout
datum-connect auth status      # Shows authenticated user and active org/project
datum-connect auth list
datum-connect auth switch      # Logs out and re-authenticates; prompts to select a project

projects

datum-connect projects list    # Lists all orgs and projects; marks the active one with *
datum-connect projects switch  # Interactive prompt to change the active project

tunnel

datum-connect tunnel listen --endpoint 127.0.0.1:8080
datum-connect tunnel listen --endpoint 127.0.0.1:8080 --label my-tunnel
datum-connect tunnel listen --endpoint 127.0.0.1:8080 --project <project-id>
datum-connect tunnel list
datum-connect tunnel update --id <id> --label new-name
datum-connect tunnel delete --id <id>

tunnel listen runs in the foreground. It creates or reuses a tunnel for the given endpoint, starts the heartbeat agent so the gateway has routing info, enables the tunnel, and polls until it is accepted and programmed before printing the public hostname. Ctrl+C disables the tunnel and exits.

The --project flag overrides the active project for a single invocation without changing the stored selection.

Project selection

The active project is stored in config.yml (default: ~/.local/share/Datum/config.yml, overridable via $DATUM_CONNECT_REPO). It is set interactively after auth login or auth switch, or explicitly with projects switch.

Example session

$ cargo run -p datum-connect -- auth login
# browser opens for OAuth
Logged in as Jane Smith (jane@example.com)

Select a project:
  [1] Acme Corp / production
  [2] Acme Corp / staging
Enter number [1-2]: 2
Selected project: Acme Corp / staging

$ cargo run -p datum-connect -- tunnel listen --endpoint 127.0.0.1:3000
Created tunnel:
  id: httpp-abc123
  label: f3a9c2e1b047

Your endpoint ID: 30a9ddf5...
Setting up tunnel...
Tunnel ready after 8 sec: https://f3a9c2e1b047.tunnels.datum.net
Press Ctrl+C to stop...

Bug fixes (found during testing)

  • Tunnels created from CLI never route traffic: CLI was missing the HeartbeatAgent that continuously patches status.connectionDetails on the connector. Without it the gateway has no routing info. Fixed: tunnel listen now starts the heartbeat and registers the project before enabling the tunnel.
  • Re-running tunnel listen on an existing endpoint always prompted for update: Random label was generated before checking for an existing tunnel, so it always differed. Fixed: label generation moved into the create-new path; existing tunnels reuse their stored label unless --label is explicitly given.
  • Tunnel delete silently no-ops when connector is missing: delete_project returned early if no connector was found, skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy. Fixed: connector lookup is only needed for post-deletion cleanup and no longer gates resource deletion.
  • Auto-generated label used tunnel-<u16> format: Collided visually with resource ID format. Switched to 12 hex chars of random entropy (e.g. a3f9c2e1b047).

Test plan

  • cargo run -p datum-connect -- auth login completes OAuth and prompts for project selection
  • projects list shows all orgs/projects with active one marked
  • projects switch persists new selection to config.yml
  • tunnel listen --endpoint 127.0.0.1:<port> creates tunnel, prints hostname, disables on Ctrl+C
  • Re-running tunnel listen on the same endpoint reuses the existing tunnel without prompting
  • tunnel listen --project <id> uses the specified project
  • tunnel list shows tunnels in the active project
  • tunnel delete removes a tunnel cleanly

@drewr drewr marked this pull request as draft March 27, 2026 20:14
@zachsmith1

Copy link
Copy Markdown
Contributor

Do we want a separate cli for tunnels or do we want to bake in functionality into datumctl?

@drewr drewr force-pushed the cli-tunnel-and-auth branch from ea2df66 to 80ffdf7 Compare March 27, 2026 20:39
@drewr

drewr commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

Yeah, it's why this is a draft. I needed the functionality and didn't want to commit one way or the other yet. I explored doing it in datumctl and it would involve either replicating the Iroh sidecar in go or making the project hybrid with a rust component.

This method uses all the same machinery as the GUI which felt like a better first pass.

drewr added 2 commits March 27, 2026 15:51
- Add 'tunnel' subcommand to datum-connect CLI with:
  - 'tunnel list': read-only listing of tunnels (no side effects)
  - 'tunnel listen': create/update and run tunnel in foreground
  - 'tunnel update': update tunnel label/endpoint
  - 'tunnel delete': delete a tunnel
- Add 'nix run .#connect' app to flake.nix
- Split find_connector_readonly for list operations
- Remove side effects from tunnel list (no patching Connector)
- Listen command:
  - Generates random label if not provided
  - Confirms before updating existing tunnel
  - Handles Ctrl+C to disable tunnel on exit
- Add 'auth' subcommand to CLI with:
  - 'auth status': Show current authentication and selected context
  - 'auth login': Log in via browser OAuth with account picker
  - 'auth logout': Log out and clear credentials
  - 'auth list': Show current authenticated user
  - 'auth switch': Log out current user and prompt for new login

Also add is_authenticated(), login(), logout() methods to DatumCloudClient.
@drewr drewr force-pushed the cli-tunnel-and-auth branch from 80ffdf7 to 01c3ab8 Compare March 27, 2026 20:51
@drewr drewr self-assigned this Mar 27, 2026
@zachsmith1

Copy link
Copy Markdown
Contributor

Ya the challenge is the core stuff we need is in rust so we'll need some magic to make the UX good

@scotwells

Copy link
Copy Markdown
Contributor

How does this interact with the GUI based application? Would auth be shared?

Since the GUI is locked to a specific project (because connectors are project-scoped resources), switching the authenticated user could break existing tunnels without the user knowing and it doesn't seem like we warn the user.

@drewr

drewr commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

It's all shared. I'll show what it looks like when Rust is done compiling...

drewr and others added 5 commits March 27, 2026 16:37
delete_project returned early when find_connector returned None,
skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy.

Connector lookup is only needed for post-deletion cleanup (deciding
whether to delete the shared connector). Move it into an Option and
gate the cleanup block on Some, so resource deletion always proceeds.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Three interrelated bugs fixed in the tunnel listen command:

- Random label was generated before checking for an existing tunnel,
  so re-running listen on the same endpoint always triggered the update
  prompt. Moved label generation into the create-new path only; existing
  tunnels reuse their stored label unless --label is explicitly provided
  and differs.

- Default label format changed from tunnel-<u16> (collides with resource
  ID format) to 12 hex chars of random entropy (e.g. a3f9c2e1b047).
  Adds hex as a dependency.

- tunnel listen was missing the HeartbeatAgent that continuously patches
  status.connectionDetails on the connector (relay URL, addresses, public
  key). Without it the gateway has no routing info and tunnels never
  carry traffic. Now starts the heartbeat and registers the project before
  enabling the tunnel, then polls until accepted+programmed before
  printing the hostname.

Also simplifies tunnel delete output: connector cleanup is an internal
detail, so "Deleted tunnel <id>" replaces "(connector deleted: false)".

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
  - After auth login/switch, prompt user to select an org and project
    and persist the selection as the active context
  - Store the selected context in config.yml instead of a separate file
  - Add --project flag to the tunnel command to override the active
    project for a single invocation
  - Add projects list and projects switch commands for managing the
    active project outside of the auth flow
  - Fix tunnel listen to print id and label after creation
@drewr

drewr commented Mar 31, 2026

Copy link
Copy Markdown
Contributor Author

Here's a short demo of where I've gotten with this:

headless-tunnels-demo.mp4

@bmertens-datum

Copy link
Copy Markdown

@drewr Nice demo.

@zachsmith1

Copy link
Copy Markdown
Contributor

@drewr this is slick. lm planning on splitting off the app repo from the gateway repo and we should consider where we'd want this cli to live. last piece there would be a small enhancement around how we could inject this rust binary into datumctl (if we want to)

@richardhenwood

Copy link
Copy Markdown

I've had a moment to try this - and following your excellent demo video, and typing datum-connect -- tunnel listen --endpoint localhost:8080 I got a 'connector' appearing in the Datum cloud UI. This is a really powerful way to think about connections for me - so I'm very excited to play around :)

image

FYI: this is on my Fedora 43 workstation.

@drewr

drewr commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

Great feedback @richardhenwood, thanks!

@drewr drewr assigned gianarb and unassigned drewr Apr 9, 2026
@drewr

drewr commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

@zachsmith1 wrote:

where we'd want this cli to live

I think if we factor out the local process to a standalone rust utility like you're proposing it makes more sense for this to live in datumctl. I originally went that direction but didn't want to either rewrite the iroh integration in go or repackage this in an awkward way.

@drewr

drewr commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

I've had some instability with this and had both gpt-5.4 and sonnet-4.6 chewing on it:

Found it. The UpstreamProxy authorizes incoming iroh connections by checking self.state.get().proxies, but the CLI tunnel listen flow never calls listen.set_proxy() to register the tunnel in local state. The gateway connects over iroh, the auth handler finds no matching proxy, returns Forbidden, and the gateway sees a connection reset.

Fix incoming.

@gianarb

gianarb commented Apr 9, 2026

Copy link
Copy Markdown
Collaborator

There is a lot to unwrap in my opinion here, a lot around product so I am not sure I have enough context to help here.

Something is an old discussion we had here datum-cloud/enhancements#582 if you look for the ecosystem chapter:

Now my attention turns to "do we want to keep consistency in the ecosystem?". Do we want for example to get Datum Desktop to look at that file as well? So a switch context in the CLI will switch context in desktop?

In practice what I was trying to highlight here is the mood kubernetes and other cli tool develop when you do that everything you run starts from a unique source of truth (for kubernetes it is the ~/.kube/config file. If we can agree on something similar it will be a lot easier to bring other CTL or applications into a consistent state.

It will feel a lot easier to push for a plugin ecosystem like the one kubectl and others developed where binaries starting with kubectl- gets called from the main ctl. In this case we can release a binary datumctl-connect that will be callable like datumctl connect.

But if we can not agree on some common practices, like authentication the outcome for a user will be pretty poor, in this case I feel like we should just "give up" and release different binaries working their own way.

I am not saying that we should have in place the ability to switch and persist in between accounts/instances because I know we do not know yet datum-cloud/enhancements#653 (comment) but maybe since we do not know we can just take what we have today as common denominator until we figure out what's next.

So the way I envision the evolution of this PR is a binary that serves only the business logic to manage tunnels and connections and demands authentication to the same login used by the datumctl (or the datumctl changes to turn to the same used here and from desktop)

This is what I am trying to push to but as I said product wise I am not sure I have enough context to push into a direction vs another.

The gateway sends `CONNECT localhost:<port>` regardless of whether the
tunnel was registered with `localhost` or `127.0.0.1`, causing auth to
fail with Forbidden and the caller to see "upstream connect error or
disconnect/reset before headers."

Normalize `localhost`, `127.0.0.1`, and `::1` to a canonical form on
both sides of the host comparison in `tcp_proxy_exists`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewr

drewr commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

Gateway hostname normalization — needs investigation

While debugging an "upstream connect error or disconnect/reset before headers" report, I found the root cause in commit e2d868d: the gateway sends CONNECT localhost:<port> in the iroh HTTP CONNECT request regardless of what address was stored in the ConnectorAdvertisement (in this case 127.0.0.1). The strict string comparison in tcp_proxy_exists failed, returning Forbidden, which the gateway surfaces as a connection reset.

The client-side fix normalizes localhost, 127.0.0.1, and ::1 to a canonical form at comparison time, which handles the mismatch. But the gateway behavior is worth examining:

Observable behavior (from /tmp/datum-2026-04-09T19:45:27+00:00.log, line 8132):

handle request req=HttpRequest { version: HTTP/1.1, headers: {}, uri: localhost:3300, method: CONNECT }

The ConnectorAdvertisement had address: "127.0.0.1", but the gateway sent CONNECT localhost:3300.

Questions for the gateway team:

  • Is this intentional? Does the gateway always normalize 127.0.0.1localhost for loopback addresses?
  • Should the gateway instead use the ConnectorAdvertisement address verbatim in the CONNECT target?
  • If the gateway normalizes to localhost, should it normalize to 127.0.0.1 instead (since that's what users typically specify as the --endpoint)?

The client-side fix is defensive and correct either way, but if the gateway is doing unintended normalization, fixing it there would be cleaner and might surface other subtle issues.

drewr added 5 commits June 6, 2026 18:29
When the Lease the heartbeat owns is removed server-side (TTL cleanup,
namespace reap, manual delete), the renew loop kept patching the dead
name forever — only logging a warn each tick — because the cache was
preserved on every error. The tunnel went silently dark.

Route both the fetch-lease and renew error arms through a single
classifier that resets the cache on 404 so the next iteration re-resolves
the connector and lease from scratch, while still force-refreshing on 401
and retaining the cache on transient errors.
The CLI persisted a single iroh secret key at the repo root and reused it
for every project the user touched. The network-services-operator
explicitly treats two Connectors with the same iroh public key as a
collision: the iroh DNS controller picks one winner and marks the losers
with IrohDNSPublished=False; Reason=DeferredToOwner. The losing project's
tunnel reports Ready but silently drops data because the iroh DNS record
points at the wrong Connector.

Move listen_key under <repo>/projects/<project_id>/ so each project has a
distinct iroh identity. On first per-project access, migrate any legacy
flat listen_key into that project's directory — the first project the
user runs against keeps continuity with its existing server-side
Connector; subsequent projects get fresh keys and stop joining the race.

Leave connect_key (no server-side Connector) and gateway_key (separate
daemon identity) flat. The UI and Serve paths continue to use the flat
listen_key for now; converting them needs its own pass.

The Tunnel command now requires a selected project and fails with a
clear message if none is set, since the per-project key path needs a
project id at node construction time.
The setup loop only checked accepted && programmed && !hostnames.is_empty()
and slept 2s at a time, so the user saw "Setting up tunnel..." then a 2s
silence and then "Tunnel ready" — even when the Connector's
IrohDNSPublished=False; DeferredToOwner condition meant the data plane
was silently unreachable.

Surface the six controller conditions that already exist on the HTTPProxy
and Connector (Accepted, CertificatesReady, ConnectorReady,
IrohDNSPublished, Programmed, ConnectorMetadataProgrammed) through a
typed TunnelProgress, and stream each transition as a checklist line.
Bail immediately when IrohDNSPublished comes back False with reason
DeferredToOwner — that's the cross-project iroh-key collision case
where waiting longer can't help, so we print the operator's message
naming the owning Connector and exit non-zero. Also warn on stdout when
any step stays pending past 30 seconds, since the controller's reason
string is the most useful diagnostic when something genuinely stalls.

Polling at 750ms is fine: get_active_progress does two reads
(HTTPProxy + Connector) on an already-warm PCP client, and server-side
reconcile latency dominates.
…reakage

Setup-time progress checks aren't enough. Today's failure mode: a tunnel
came up cleanly, ran for ~9 minutes, then the iroh DNS controller
re-reconciled and flipped IrohDNSPublished from True back to False
because a deleted Connector's DNS claim was never cleaned up server-side.
The data plane went dark while the CLI kept reporting healthy — Ready
was still True, the heartbeat was still renewing the lease, and there
was no client-side signal that anything had changed.

Poll progress every 10s alongside the existing login-state watch.
When terminal_failure() trips (currently IrohDNSPublished=False with
reason DeferredToOwner), print the same message the setup path emits and
break out of the run loop cleanly so the operator gets disable+cleanup
instead of a silent zombie. Tunnel-deleted-from-under-us also breaks
out; transient poll errors only warn and retry.

Factor the failure message into format_terminal_failure() so setup-time
and runtime emit identical wording — the user shouldn't have to learn
two error shapes for the same diagnosis.
@drewr

drewr commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Diagnosis notes from a debugging session

Over a single CLI session I hit three distinct failure modes that all presented the same way — "Tunnel ready" with the data plane silently dropping. Recording the chain here so the cause-effect is preserved alongside the fix commits.

1. Heartbeat wedge on deleted Lease

Symptom. Tunnel went silent after no visible errors; logs eventually showed a warn loop:

heartbeat: lease renew failed: ApiError: leases.coordination.k8s.io "datum-connect-jttwh" not found: NotFound

firing every 30s indefinitely.

Root cause. run_for_project cached the resolved lease_name once and the renew error arms unconditionally retained the cache. A server-side delete of the Lease (TTL cleanup, namespace reap, manual) put the loop in a state it couldn't recover from — the cache never cleared, so the loop never went back through the connector-probe / lease-resolve path at the top.

Fix. c901b01 classifies the kube error per arm: 404 → reset cache, 401 → force refresh + retain, anything else → retain. Three unit tests cover the decision.

2. Machine-wide iroh identity colliding across projects

Symptom. A freshly-created tunnel reported Tunnel ready cleanly, but curl returned Connection reset by peer at the TLS handshake. The Connector's status held:

Type: IrohDNSPublished     Status: False
Reason: DeferredToOwner
Message: iroh DNS record is owned by Connector /<other-project>/default/datum-connect-<other>

Root cause. Repo::listen_key() resolved to a single flat path (<repo>/listen_key) regardless of project. Running the CLI against multiple projects from the same machine registered the same iroh public key under multiple Connector objects; the iroh DNS controller (network-services-operator/internal/controller/iroh_dns_controller.go) explicitly treats that as a collision ("Multiple Connectors that share the same iroh keypair … collapse to one DNSRecordSet — the first to claim wins, and the loser surfaces a DeferredToOwner condition"). The losing project's Connector is Ready=True but unreachable.

Fix. 76987b7 scopes listen_key under <repo>/projects/<project_id>/ with a one-shot migration that moves the legacy flat key into the first project that requests it (so the user keeps continuity with one existing server-side Connector; the rest get fresh keys). connect_key and gateway_key stay flat — neither registers a per-project server-side identity.

The audit of network-services-operator confirmed per-project (per-Connector) identity is the intended design model, not just an acceptable workaround.

3. Setup readiness check too shallow

Symptom. Tunnels reported Tunnel ready after 2 sec even when the connector's IrohDNSPublished was DeferredToOwner — i.e., when they were guaranteed not to carry traffic.

Root cause. The setup loop only checked accepted && programmed && !hostnames.is_empty() on the HTTPProxy. The Connector's conditions weren't read at all, so collisions and stuck states never surfaced to the user.

Fix. 90a7ab3 exposes a typed TunnelProgress over six controller conditions (ProxyAccepted, CertificatesReady, ConnectorReady, IrohDnsPublished, ProxyProgrammed, ConnectorMetadataProgrammed) and streams each transition during setup. terminal_failure() matches specifically on IrohDNSPublished=False; DeferredToOwner and bails immediately rather than waiting forever.

4. Stale DNS owner re-emerging post-setup

Symptom. A tunnel that came up cleanly worked for ~9 minutes and then quietly went dark with no client-side signal. Ready=True, lease being heartbeated on schedule, no errors in CLI logs.

Root cause. A previously-deleted Connector's iroh DNS claim was never cleaned up server-side. When the iroh DNS controller re-reconciled, it flipped the live Connector's IrohDNSPublished from True back to False with DeferredToOwner, naming the dead Connector's UID as the owner. The DNS claim outliving its owning Connector looks like either a missed owner-reference cascade on Connector delete or a controller-side cache that isn't invalidated when the Connector goes away.

Fix (client-side, this PR). 6264818 adds a runtime watch alongside the login-state watch — poll get_active_progress every 10s after setup completes, surface terminal failures with the same format_terminal_failure message the setup path uses, and break out of the run loop so the disable/cleanup path runs. Tunnel-deleted-server-side also breaks out; transient query errors just warn and retry.

Server-side follow-up (out of scope for this PR). The iroh DNS controller leaking a claim past its owning Connector's lifetime is a real bug. Worth filing against the operator team — concrete instance from today's session was datum-connect-jttwh (UID 226a90b6-3cad-4242-9eff-c2c71a335545) showing up as Owner of a DNS record after the Connector itself returned 404.

Recovery playbook this PR enables

If a user's tunnel goes terminal mid-session, the CLI now:

  1. Prints the operator's message naming the conflicting Connector,
  2. Exits the run loop cleanly,
  3. Disables the tunnel server-side on shutdown.

Operator action: delete the per-project listen_key (<repo>/projects/<project_id>/listen_key), then rerun. The new identity sidesteps the stale claim entirely.

…jects

HeartbeatAgent::start() auto-enrolls every project the user has access
to, which is correct for the UI (it surfaces tunnels across projects)
but wrong for the CLI tunnel-listen command, which owns exactly one
project. Today's logs showed the fan-out clearly:

    heartbeat: no connector yet project_id=drewr-y4nd1b
    heartbeat: no connector yet project_id=drewr3-ceu4gt

The CLI silently maintained presence in drewr3-ceu4gt — a project the
user never mentioned — for the lifetime of the tunnel. Harmless today,
but it makes logs misleading, multiplies API load, and would create
real risk if a misconfigured token granted access to a project that
shouldn't be touched.

Add start_manual() that skips the watcher entirely. The CLI now starts
manual mode and explicitly registers its single project. Per-project
loops still handle 401s via force-refresh, so transient auth blips are
tolerated; the CLI's own login-state watch surfaces permanent logout.

Keeps start() unchanged so the UI continues to auto-enroll. The new
entry point is documented with a pointer to start_manual for callers
in the CLI-style pattern.
drewr added 8 commits June 7, 2026 18:52
…stname

The endpoint-match adoption path is anchored on the connector record
matching the agent's current iroh endpoint id, then filtered to
HTTPProxies whose backend references that connector by NAME. Any time
the connector record gets renamed server-side (delete + ensure_connector
recreates with a new generateName suffix) or the iroh identity rotates,
all the previous HTTPProxies become invisible to list_project and
adoption silently misses, spawning a fresh tunnel with a fresh hostname.

Add --id <tunnel-id> to tunnel listen for the "I have already shared
this URL with others" case. The path looks up the HTTPProxy by name
(direct API call, no connector filter), calls update_active to re-point
its backend at the current connector, and re-enables its
ConnectorAdvertisement. The hostname lives on the HTTPProxy resource
and is preserved across connector identity changes.

Also refactor TunnelService::get_active to do a direct fetch instead of
filtering list_active. The previous implementation filtered out tunnels
whose backend pointed at any connector other than this agent's current
one, which made tunnel update/delete by id silently fail-as-NotFound
after a connector rename. Direct fetch matches the user's intent when
they explicitly name a tunnel.

Factor summary_from_proxy() so get_active and list_project don't drift.
--endpoint is now optional. The four input shapes:

  --id <id>                   resume the tunnel verbatim using its stored
                              endpoint; re-point the connector backend at
                              this agent's current iroh identity.
  --id <id> --endpoint <e>    same, but assert the user-provided endpoint
                              matches the stored one (after the same
                              normalization the lib applies). A mismatch
                              fails hard with a message pointing at
                              'tunnel update' for explicit changes — we
                              don't silently retarget a tunnel whose URL
                              the user may have shared with others.
  --endpoint <e>              existing endpoint-match adoption path.
  neither                     error with a hint at both flags.

Expose lib::normalize_endpoint so the CLI compares endpoint strings
using the same canonicalization the stored TunnelSummary value was
produced with (trim + prepend http:// if no scheme), instead of doing
naïve string equality that would spuriously fail on scheme/whitespace
differences.
Running 'tunnel listen' bare now pops an interactive picker on a TTY,
listing existing tunnels with hostname → endpoint [label]. Pick one
with ↑↓/Enter to resume it (treated as if --id had been given).

Tunnels without hostnames (still pending) are excluded — picking them
would just produce 'tunnel not found'. Enabled tunnels sort above
disabled (marked '○') so the most-likely-relevant ones come first.

Non-TTY (CI, piped) keeps the existing fail-fast error so scripts don't
hang on stdin. Empty candidate list (no tunnels in the project, or no
tunnels with hostnames yet) also falls through to the error — there's
nothing to pick.

Adds inquire 0.9 for the picker. Considered rolling raw crossterm to
avoid a dep, but the cost/benefit didn't justify it for one prompt.
The iroh relay-actor (and other chatty modules under iroh::magicsock /
lib::*) write to the same TTY inquire is repainting on. The collision
looked like:

    ? Resume which tunnel? 2026-06-07T19:10:30Z INFO relay-actor: ...
    roxy.net  →  http://127.0.0.1:11434  [d38f5f413beb]

— a log line spliced across the picker's first option, leaving the
terminal unreadable.

Wrap the EnvFilter in a reload handle exposed via a OnceLock so the
picker can engage a RAII QuietTracing guard that swaps the filter to
'error' for the lifetime of the inquire prompt and restores it on drop.
Captures the previous filter via EnvFilter's Display impl since it
doesn't implement Clone; round-trip through to_string()/try_new() is
the supported path.

This is a targeted fix for the symptom. A cleaner long-term move would
be to defer ListenNode construction past the picker so iroh hasn't
booted yet — but that requires TunnelService to support read-only
listing without a node, which is a larger refactor.
…ration

'tunnel listen --id' PATCHes the HTTPProxy spec to re-point its backend
at the current connector. That bumps metadata.generation, but the
controllers' prior True conditions still carry observedGeneration from
the previous generation until they re-reconcile. The progress check was
reading those as Ready, so the CLI reported "Tunnel ready after 0 sec"
while the data plane was still serving 503s from the old Envoy config.

Mark a step Ready only when status == "True" AND
observed_generation >= metadata.generation. Stale-but-True conditions
become Pending (still progressing) — same code path as the 30s stuck
warning, so the user gets the controller's reason rather than a false-
positive Ready. None observedGeneration is treated as 0, which falls
through to Pending on any non-zero generation (correct — the controller
hasn't reconciled this resource yet).

Test progress_pending_when_status_is_stale_for_current_generation
covers both halves: stale-True is Pending; once the controller catches
up, the same condition flips to Ready. Existing tests still pass
because their fixtures default to generation=None == 0 on both sides.
A 0-second "Tunnel ready" with a 503-serving data plane (the
observedGeneration bug we just fixed) made it clear that users need a
fast pivot from a stuck progress line to 'datumctl describe ...' on the
exact resource. Without that, the operator's reason string is useful
text but its provenance is buried.

Add a 'resource: Option<String>' field on ProgressStep, pre-formatted
as "HTTPProxy/<tunnel-id>" or "Connector/<connector-name>", populated
from the live resource metadata in from_resources. Mapping per step:

  tunnel accepted              → HTTPProxy/<tunnel-id>
  TLS certificate issued       → HTTPProxy/<tunnel-id>
  connector ready              → Connector/<connector-name>
  iroh DNS published           → Connector/<connector-name>
  route programmed             → HTTPProxy/<tunnel-id>
  envoy metadata propagated    → HTTPProxy/<tunnel-id>

CLI renders the label inline:

  ✓ tunnel accepted (0.1s) [HTTPProxy/tunnel-gchhg]
  … route programmed still pending after 30s [HTTPProxy/tunnel-gchhg]: …

ProgressStepKind::resource_kind() is the source of truth for which kind
backs each step, used by the test that asserts the wiring is correct
across all six steps. No extra API call needed — the connector name was
already in scope inside TunnelService::get_active_progress.
… picker

When there's exactly one tunnel to resume, inquire still rendered the
prompt with a forced '>' marker on the lone row and arrow keys were
no-ops. The terminal cursor sat on the '? Resume which tunnel?' line —
visually it looked like the selector hadn't moved 'to the correct line'
because there was no other line to move to. The picker only earns its
keep when there's a choice to make.

Short-circuit at one candidate: print 'Resuming the only tunnel ...'
naming the hostname and id, return it directly. The >1 case still pops
the picker, where arrow-key movement was verified working under pty
with cursor-position+selected-index alignment.
The operator's quota service occasionally times out the admission check
on Create requests and returns 403 with:

  "Your request took too long to be checked against your quota.
   Please try again in a moment — if this keeps happening, contact support."

The error message itself says "try again". Until today the CLI just
surfaced the raw 403 and bailed mid-listen, which the user has to
recover from manually.

Add is_quota_check_timeout() that matches the specific 403 message
(distinct from real quota exhaustion, which uses different wording), and
with_quota_check_retry() that retries up to ~15s (1s, 2s, 4s, 8s, final
attempt) on that exact class. Other 403s — real exhaustion, IAM denials,
admission rejections — return immediately so genuine failures still
surface fast. Prints a one-line stderr notice on first retry so the
user knows we're waiting on the server, not wedged.

Apply at every kube .create() site in the tunnel lifecycle:

  - HTTPProxy create (fresh tunnel)
  - ConnectorAdvertisement create (fresh tunnel)
  - TrafficProtectionPolicy create (fresh tunnel)
  - ConnectorAdvertisement create (set_enabled when resuming)
  - Connector create (ensure_connector first run)

Also tighten format_quota_error to skip the timeout phrase: when retries
exhaust, the user should see the actual server message rather than
"Quota limit exceeded for ConnectorAdvertisement", which is the wrong
diagnosis. Real "Insufficient quota" exhaustion still gets the helpful
message.

Test covers the classifier and the formatter carve-out on both the
timeout and real-exhaustion shapes.
@drewr

drewr commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Quota 403 diagnosis

While shipping d8f7c96 (CLI-side retry for the transient quota-check timeout), I traced the chain that produces this error:

ApiError: connectoradvertisements.networking.datumapis.com "tunnel-gchhg" is forbidden:
Your request took too long to be checked against your quota.
Please try again in a moment — if this keeps happening, contact support.: Forbidden (code: 403)

⚠ Best-effort diagnosis. Reconstructed from network-services-operator/config/quota/ resources, compute:api/v1alpha/instance_types.go reason codes, and engineering ops reviews. @drewr — please sanity-check against Milo internals (the actual webhook + quota backend code lives in a Milo repo I don't have local access to) before relying on this.

The chain

  1. API server receives Create ConnectorAdvertisement.
  2. Milo's quota admission webhook (quota.miloapis.com) intercepts.
  3. Webhook reads the matching ClaimCreationPolicy (e.g. connector-claim-policy.yaml), constructs a ResourceClaim of amount: 1 against ResourceType: networking.datumapis.com/connectors.
  4. Webhook calls a quota backend to check project remaining capacity.
  5. If the backend round-trip exceeds the webhook's timeoutSeconds, admission fails closed and returns the 403 above.

The failure class is already named in datum-cloud code

From datum-cloud/compute:api/v1alpha/instance_types.go:

// InstanceQuotaGrantedReasonBackendUnavailable indicates quota enforcement
// is configured but the Milo quota backend is unreachable (network error,
// timeout, transient failure).
//
// InstanceQuotaGrantedReasonMisconfigured indicates the ResourceClaim was
// rejected by the Milo admission plugin (403/422): ResourceRegistration absent
// or the policy is malformed.

Our 403 is the user-facing surface of BackendUnavailable — the webhook itself is healthy, but the backend it consults is slow/unreachable.

Most likely root causes (in order of probability)

  1. Webhook → quota backend round-trip latency. Admission webhooks have a hard cap (timeoutSeconds, max 30s). If the backend call routinely lands in the slow tail (GC pause on the backend pod, network blip, cold start), some fraction of admissions will time out. The user hitting this in a tight resume loop — same project, same resource type — is consistent with the backend being intermittently slow.
  2. Expensive per-request counting. If the webhook computes "current usage" by listing/aggregating all ResourceClaims for the project each request, latency scales with project state. Caching the count between admissions would fix it.
  3. Resource limits on the webhook or backend. CPU throttling or memory pressure on those pods drives request latency up unpredictably.
  4. Storage-side latency on the quota service (etcd or whatever Milo's quota backend uses). The 5/18 ops review tracks 409s on quota.miloapis.com as elevated — suggests this surface has known stability concerns.

How to confirm

  • Pull quota.miloapis.com webhook latency metrics — p50/p95/p99 over the period when this user hit it. If p99 is near the webhook's timeoutSeconds, that's the root cause.
  • Check whether the webhook caches per-project usage or recomputes per admission.
  • Inspect resource limits and throttling on the quota-webhook and quota-backend pods.
  • Look at whether the webhook does any internal retry on transient backend failures before returning the user-visible 403.

Server-side fixes worth considering

(Not in this PR's scope — the actual implementation is in Milo — but flagging for whoever owns it.)

  • Cache project quota usage in the webhook with a short TTL (1–5s); coalesce admissions of the same project so a burst doesn't N+1 the backend.
  • Raise timeoutSeconds to the Kubernetes max (30s) so a slow tail doesn't immediately fail. Cheap mitigation.
  • Async claim model for resources where transient over-allocation is tolerable: admit immediately, create the claim async, revoke if it turns out to be over. The compute repo's reason codes (Misconfigured, BackendUnavailable) already accommodate this shape; the design supports it.
  • Internal webhook retry on backend transient failures before returning the user-visible 403. The error message already says "Please try again in a moment" — the webhook is asking the client to do work that should be done in the server.

Mitigation on the CLI side (this PR)

d8f7c96 adds is_quota_check_timeout / with_quota_check_retry and wraps every .create() site in the tunnel listen flow. Up to ~15s of backoff before propagating the error. Distinct from real quota exhaustion (which uses different wording — Insufficient quota — and is preserved as the friendly "exceeded" message). This makes the experience tolerable but it's a workaround; the right fix is in Milo.

Action for @drewr

If the Milo side analysis above looks plausible, this is worth a separate placeholder issue against whichever Milo repo owns the quota webhook + backend, including the operator-side mitigations as suggestions. Happy to draft that issue too if you confirm the diagnosis.

drewr added 2 commits June 7, 2026 21:23
Controllers reporting Ready (with observedGeneration in sync) still
doesn't mean the data plane is actually carrying traffic — Envoy
programming a route is not the same as Envoy serving it. The user
reported a ~2-minute window where every condition was True but
https://<proxy>/ returned 503. Whatever's behind it (xDS push lag, edge
config-not-yet-loaded, iroh peer connection still settling), it's
invisible from the controller's view.

Add a "Verifying connectivity..." phase between the condition checklist
and "Tunnel ready". Every 10s, probe in parallel:

  - the origin URL the user gave (so a downed local service is named
    explicitly instead of being blamed on the tunnel)
  - the public proxy URL (https://<hostname>/)

Any response under 500 counts as "reachable" — 4xx like 401/404 are
fine because the edge is forwarding; only 5xx + transport errors block.
On each tick we print a ✓ line for newly-reachable endpoints and a …
line for ones still failing, with the controller's last error so the
user can act ("origin connection refused" vs "proxy 503").

New --timeout flag (default 10m, humantime) caps total setup including
verification. On expiry the command exits non-zero with a per-side
summary so an unverified tunnel doesn't get treated as healthy.

Sleep is clamped to the remaining budget so an early success on one
side doesn't waste the last 10s before bailing on the other.
When the CLI's --id / picker-resume calls update_active, it almost
always passes the same label, endpoint, and current connector that the
existing HTTPProxy already references. The previous behavior PATCHed
HTTPProxy.spec.rules and metadata.annotations unconditionally with that
identical content. Whether the apiserver bumps metadata.generation on
content-identical Patch::Merge is implementation-dependent — and in
practice we've seen the spec touch correlate with downstream Envoy
re-reconciles and a 5xx data-plane window of 1–3 minutes after the
controller conditions all flip Ready.

Make update_project (and the ad sub-step) skip the PATCH when the
existing spec already matches what we'd write. Comparison is on
serde_json::Value so it's stable against Option<...> serde-default
quirks that would otherwise trip naive structural equality.

This makes the lib's update verb idempotent at the lib boundary —
which is the property the upcoming "extract shared connect logic into
lib for cli + ui + datumctl plugin" work depends on. As a bonus the
UI's Edit-tunnel dialog (which currently PATCHes even when the user
hits Save without changing anything) gets the no-churn behavior for
free, with no UI-side changes.

This is hygiene, not the cold-resume latency fix: even with the PATCH
skipped, runs continue to show intermittent multi-minute 5xx windows
caused by edge-side iroh peer-establishment latency (separate issue,
not yet filed pending @drewr's review).

Tests cover both comparators across the relevant drift axes (different
connector, different endpoint, different label, missing annotation,
ad port change, ad connector change).
@drewr

drewr commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Status update

Picking up where the diagnosis-arc comment left off. Significant ground covered since the runtime-watch fix.

Landed since last update

Tunnel selection & resume UX

  • ca4470f — `tunnel listen --id ` pins an existing tunnel and re-points its connector backend at the current iroh identity. Direct-API lookup; bypasses the connector filter that was hiding tunnels from list_active after identity rotation. get_active refactored to a direct fetch (no filter), shared summary_from_proxy helper extracted.
  • a68d8ae--endpoint is now optional. --id alone resumes verbatim from stored endpoint; --id + --endpoint must agree exactly or fail hard (preserving any URL already shared with others). lib::normalize_endpoint exposed so the CLI uses the same canonicalization as TunnelSummary.endpoint.
  • 7de50c7 — arrow-key picker (inquire) when called with no flags, showing hostname → endpoint [label] rows. Adds `inquire = 0.9.4` to CLI deps.
  • fe57b24 — picker uses a tracing_subscriber::reload::Handle + RAII guard to silence tracing while inquire is repainting (iroh log lines were splicing into the picker's first option line).
  • cff37e7 — single-candidate auto-adopts with a one-liner instead of popping a picker that has no movement possible (terminal-cursor confusion).

Progress / verification

  • c39d9ee — each progress checkpoint annotated with its underlying Kubernetes resource ([HTTPProxy/tunnel-…] / [Connector/datum-connect-…]). Lets a user pivot from a stuck step straight into datumctl describe ….
  • d65ec4d — progress steps require observedGeneration ≥ metadata.generation for Ready. Closed the "Ready after 0 sec but data plane is dead" gap on resume-time spec patches (still relevant for any caller that does spec patch — even though we now skip the patch entirely, see below).
  • 1ed969e — new `Verifying connectivity...` phase. Probes origin and proxy URL every 10s after controller conditions all flip Ready. New `--timeout` flag (default 10m) caps total setup. Distinguishes "your local server is down" from "edge still 503ing" in the failure summary. This is the visibility surface that exposed all the iroh-dial-latency findings below.

Lib hardening

  • d8f7c96 — quota-check-timeout retry. Operator's 403 "took too long to be checked against your quota" is now retried with ~15s of backoff at every .create() in the listen flow (HTTPProxy, ConnectorAdvertisement, TrafficProtectionPolicy, Connector). Real quota exhaustion still surfaces the friendly "Quota limit exceeded" message; the carve-out is on the exact transient phrase.
  • b7e9d6bHeartbeatAgent::start_manual() skips the auto-fan-out across every accessible project. CLI tunnel-listen now heartbeats only the project the tunnel lives in. UI's start() is unchanged (it actually wants the multi-project model).
  • 0311960update_project is idempotent at the lib boundary. Skips the HTTPProxy.spec and ConnectorAdvertisement.spec patches when the existing state already matches what we'd write. Comparison is on serde_json::Value for stable representation. This is hygiene for the upcoming factor-shared-tunnel-logic-into-lib work (cli + ui + future datumctl plugin all benefit automatically); the UI's Edit-without-changes case also stops causing data-plane churn for free.

Issues filed against other repos (all placeholders awaiting @drewr review)

Issue Scope
datum-cloud/network-services-operator#174 Stale iroh DNS claim outlives its owning Connector — DeferredToOwner cites a UID that 404s in the API.
datum-cloud/network-services-operator#175 Narrowed today: fresh-tunnel HTTPProxy Programmed / ConnectorMetadataProgrammed blocked ~3min on EnvoyPatchPolicy has no status yet. Original conflation with the resume-side 5xx symptom is split out (next row).
datum-cloud/iroh-gateway#12 New today: variable multi-minute iroh dial latency from edge to a freshly-started listen node. Two contrasting runs (530ms vs 170s, same machine, 15min apart) as evidence. All control-plane and CLI-side causes are explicitly ruled out.
(PR-comment) Quota 403 diagnosis — apiserver → Milo quota admission webhook → quota backend chain analysis with concrete mitigations on the Milo side. Not filed against a Milo repo yet, pending @drewr direction.

Known still-flaky behavior (mitigated, not fixed)

On a resume run, the proxy URL still occasionally returns 5xx for 1–3 minutes after Tunnel ready would have fired without the new verify phase. The verify phase makes this visible and waits for actual 2xx instead of false-positive-claiming Ready, so the user experience is "honest progress + slow first request" rather than "instant Ready + broken tunnel." The underlying iroh-gateway dial latency is tracked in iroh-gateway#12.

Test coverage on this branch

47 lib tests, all green. Notable new coverage:

  • quota_check_timeout_classifier_matches_transient_403 — retry classifier + format carve-out.
  • progress_pending_when_status_is_stale_for_current_generation — observedGeneration check.
  • progress_step_carries_resource_label — every step backed by the right Kubernetes kind.
  • http_proxy_spec_matches_* / advertisement_spec_matches_* — idempotency comparators across drift axes.
  • start_manual_does_not_auto_enroll — heartbeat manual mode.
  • listen_key_for_project_* (3 cases) — per-project key migration.

What's next

A separate work stream to factor shared tunnel-management logic out of cli/ and ui/ into the lib, so a future datumctl connect plugin can share the same authoritative implementation. The idempotency landing was a prerequisite for that — lib verbs are now the right shape to be the shared abstraction. Will be tracked separately when it kicks off.

drewr added 7 commits June 8, 2026 02:48
A single 503 from the Datum API server's Envoy front-end ("upstream
connect error or disconnect/reset before headers. reset reason:
connection termination" — typical when kube apiserver briefly drops
connections behind Envoy) was killing in-progress tunnel setups that
the next 750ms poll tick would have ridden over. Observed mid-EnvoyPatch
Policy-reconcile wait on a fresh tunnel: setup conditions were on the
slow-but-working path and the run aborted at the unrelated transient.

The runtime watch already handles this correctly — log on error and
keep going. Mirror that in await_tunnel_progress with a bounded retry:
up to MAX_CONSECUTIVE_POLL_ERRORS (10 ≈ 7.5s at the current cadence)
before bailing. Long enough to ride out a brief blip; short enough that
a genuinely unreachable control plane still surfaces fast.

The change lives in await_tunnel_progress (cli/src/main.rs) but the
function is on the future connect-lib side of the boundary discussed in
datum-cloud/enhancements#756 comment 4644292554 — it's pure orchestration
over TunnelService::get_active_progress, no rendering, no clap. The
shape (consecutive-error counter + bounded retry + bail-fast on hard
signals) is the one the lib will inherit.
…s work

The CLI accepts --endpoint 127.0.0.1:11434 (no scheme) and passes that
string through to verify_endpoints, which hands it to reqwest. Reqwest's
request builder refuses to build a request from a URL without a scheme
and returns a "builder error" — which our probe was reporting as
"origin not reachable" indefinitely:

  ✓ proxy responding (0.4s) [https://...]: HTTP 200
  … origin not reachable (0s) [127.0.0.1:11434]: builder error
  … origin not reachable (10s) [127.0.0.1:11434]: builder error
  ...

The actual origin was reachable the whole time — the proxy probe got
HTTP 200 through the tunnel back to the same host:port. Only the CLI's
local probe was wedged.

Apply lib::normalize_endpoint (the same canonicalization that
TunnelSummary.endpoint stores) at the top of verify_endpoints so any
bare host:port works as input. The displayed URL becomes the canonical
form (http://127.0.0.1:11434), matching what's stored on the HTTPProxy.

verify_endpoints is on the connect-lib side of the boundary we sketched
in datum-cloud/enhancements#756 comment 4644292554 — defensive
normalization belongs here so other callers (UI Edit dialog, the future
plugin foreground listen path) don't have to remember to canonicalize.
cargo-zigbuild for aarch64-unknown-linux-gnu failed at openssl-sys
because pkg-config can't find target-arch openssl headers, and we
can't easily provide them outside the workspace. The transitive pull
is:

  iroh -> pkarr -> reqwest 0.13 (default features = "default-tls") ->
  hyper-tls -> native-tls -> openssl-sys

reqwest 0.13 ships from pkarr's lockfile with native-tls included.
Patching iroh/pkarr to switch reqwest features isn't on our path; the
workspace's own reqwest 0.12 dep is unrelated (and adding
default-features = false there doesn't reach the 0.13 instance —
they're separate version-graph nodes).

Add `openssl = { version = "0.10", features = ["vendored"] }` to the
CLI. Cargo feature unification enables the vendored build for the
transitive openssl-sys, so cross-compiling no longer needs target-arch
system headers — openssl gets compiled from source as part of the
build. Static link, no runtime libssl/libcrypto dependency.

Verified: native check passes, aarch64-unknown-linux-gnu cross-build
via cargo-zigbuild produces a valid 251MB unstripped ARM aarch64 ELF.
The Parser-derived Args lacked a version attribute, so clap rejected
both --version and -V. Add #[command(version)] which sources the
version from Cargo.toml's package.version via env!("CARGO_PKG_VERSION")
at compile time, giving recipients of distributed binaries a built-in
"which build do I have?" check without depending on filename, mtime,
or sha256 verification.

  $ datum-connect --version
  datum-connect 0.1.0
The existing `auth login` uses an authorization-code-with-PKCE flow that
binds a localhost HTTP server to receive the OIDC redirect. On a remote
machine over SSH, in CI, or in a container, that pattern is unreachable
— the browser running on the operator's laptop can't reach a port bound
on the remote box without a separate SSH port-forward. The standard
escape hatch is RFC 8628 OAuth2 device authorization, which is what
datumctl's own `login --no-browser` uses.

Mirror that here:

  - StatelessClient::login_device_code() — fetches the OIDC discovery
    JSON directly (openidconnect's CoreProviderMetadata doesn't surface
    device_authorization_endpoint), rebuilds the OIDC client with
    set_device_authorization_url(), starts the grant, hands the
    DeviceCodeInfo (verification URL + user code + expiry) to a caller-
    supplied display callback, and polls exchange_device_access_token
    via tokio::time::sleep. Token-response parsing reuses the existing
    parse_token_response with a nonce verifier that allows missing-nonce
    (device flow doesn't bind one).
  - AuthClient::login_device_code() — always performs a fresh login;
    callers wanting refresh-eligible token reuse should use the normal
    login() instead.
  - DatumCloudClient::login_device_code() — top-level entry point.
  - DeviceCodeInfo re-exported from lib::datum_cloud so the CLI doesn't
    take a direct dep on openidconnect's Core types.

CLI side, AuthCommands::Login and AuthCommands::Switch both gain a
--no-browser flag that routes to the new method. The display callback
prints the verification URL + user code prominently to stderr so it
doesn't tangle with structured stdout (relevant for future plugin
modes).

Verified against the production auth server's OIDC discovery (Datum's
Zitadel exposes device_authorization_endpoint and lists
urn:ietf:params:oauth:grant-type:device_code in grant_types_supported).
Adds --no-browser device-flow login (e14d689) since cli-v0.1.0.
Our own OIDC client (datum-desktop-app, configured in
datum-cloud/infra apps/datum-iam-system/.../zitadel-setup/pulumi/index.ts)
has only AUTHORIZATION_CODE + REFRESH_TOKEN in its allow-listed
grantTypes. Zitadel correctly rejects the device-code grant against it:

  unauthorized_client: grant_type "...device_code" not allowed

datumctl-cli (a sibling OIDC app in the same Zitadel project) already
has DEVICE_CODE in its grantTypes and has stable, well-known IDs in
datumctl's source:

  Staging:    325848904128073754
  Production: 328728232771788043

Borrow them for the --no-browser path until the planned datumctl
connect plugin ships with its own properly-scoped client. Tokens are
minted by Zitadel against the same project, so downstream Datum API
calls don't care which client minted them. The audience verifier on
id_token_verifier already allows any audience.

Regular `auth login` (browser flow) is unchanged — it stays on the
datum-desktop-app client.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants