Skip to content

feat(kiloclaw): route requests by Host on *.kiloclaw.ai#2994

Open
pandemicsyn wants to merge 6 commits intomainfrom
florian/feat/name-based-routing
Open

feat(kiloclaw): route requests by Host on *.kiloclaw.ai#2994
pandemicsyn wants to merge 6 commits intomainfrom
florian/feat/name-based-routing

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented May 1, 2026

Summary

Route *.kiloclaw.ai requests by Host header in the KiloClaw worker. The wildcard route and per-instance origin allowlist landed in #2879; this PR wires the worker side up so <label>.kiloclaw.ai resolves to the owning instance. Runs before the existing cookie/path-based branches — those still handle claw.kilosessions.ai and any request whose Host doesn't match the new suffix. Dashboard links still point at claw.kilosessions.ai; flipping user-facing URLs is a follow-up.

What's new

  • KILOCLAW_INSTANCE_HOST_SUFFIX + KILOCLAW_INSTANCE_URL_SCHEME env vars drive a single helper set (instanceUrl, parseInstanceHost, hostMatchesInstanceSuffix). Same config feeds the catch-all host branch and the per-instance origin injector. Both required — no silent fallback; validateRequiredEnv 503s if either is unset.
  • Host-based branch: ownership mismatch → 403. Unparseable / multi-label / bare-suffix → 404. Pre-cutover instance (controllerCapabilitiesVersion < 2) → 404 with a "needs restart" hint; legacy host keeps working meanwhile.
  • Access-gateway skips KILOCLAW_ACTIVE_INSTANCE_COOKIE on matched hosts and derives the gateway token from the host-encoded sandbox with ownership verification, ignoring any mismatched ?instanceId= query param.
  • New proxyThroughTarget helper encapsulates the WS/HTTP proxy dance for the host branch. The three pre-existing duplicated proxy sites are left inline — follow-up should DRY them.

Dev

Default .dev.vars.example is now KILOCLAW_INSTANCE_HOST_SUFFIX=.kiloclaw.localhost:8795 + KILOCLAW_INSTANCE_URL_SCHEME=http. Existing devs: re-run env sync or add those two lines to .dev.vars. Then open http://i-<hex>.kiloclaw.localhost:8795/ to exercise the host branch; http://localhost:8795/ keeps working unchanged.

Verification

  • Local end-to-end on docker-local: opened an instance at http://i-<hex>.kiloclaw.localhost:8795/, host-scoped auth cookie set, no active-instance cookie, SPA + WS upgrade fine.
  • http://localhost:8795/ still routes via the cookie branch.
  • Staging: real Fly-backed instance through the *.kiloclaw.ai wildcard route + CF TLS.
  • v1 gate: pre-cutover instance returns 404 with the restart hint; legacy host still serves it.
  • [ ]

Visual Changes

N/A

Reviewer Notes

  • Capability gate: v1 instances 404 rather than 302 — keeps the "needs restart" signal visible and avoids the URL-rewriting code path that previously hosted an open-redirect bug.
  • No silent fallback for host suffix/scheme. Misconfigured preview 503s instead of injecting prod defaults. gateway/env.ts wraps the injection so a misconfigured worker logs-and-skips instead of aborting machine boot.
  • Risk area: host parser. parseInstanceHost does raw suffix comparison (not DNS parsing) so dev suffixes can include :port. Case-insensitive. Multi-label and bare-suffix hosts 404 inside the instance-host space rather than falling through.
  • Token-mint path (buildRedirectUrlresolveHostSandboxId) is worth a spot-check — ownership is verified before minting, and the host-encoded sandbox wins over any query-param instance.

Extend the catch-all proxy to recognise per-instance virtual hosts
(`i-{hex}.kiloclaw.ai` / `u-{base32hex}.kiloclaw.ai`) as an implicit
instance selector. PR1 made the hostnames reachable and put the
per-instance origin in each machine's allowlist; this change wires
the worker to route by Host so the URL itself carries the routing
signal and each subdomain gets its own cookie jar. Dashboard links
still point at `claw.kilosessions.ai` — PR3 flips those.

- `parseInstanceHost` / `hostMatchesInstanceSuffix` / `instanceUrl`
  share a single `KILOCLAW_INSTANCE_HOST_SUFFIX` + `_URL_SCHEME`
  config so prod can use `.kiloclaw.ai` + `https` and devs can
  emulate per-instance hosts with `.kiloclaw.localhost:8795` + `http`
  (auto-resolves to 127.0.0.1, no /etc/hosts).
- Host branch runs before the cookie branch; ownership mismatch
  returns 403, unparseable/multi-label returns 404, pre-cutover
  (`controllerCapabilitiesVersion < 2`) 302-redirects to the legacy
  host so the user keeps working until the instance restarts onto v2.
- Access-gateway skips `KILOCLAW_ACTIVE_INSTANCE_COOKIE` on matched
  hosts — self-referential when the Host header already identifies
  the instance.
- `gateway/env.ts` now builds the per-instance origin via
  `instanceUrl(label, env)` so dev containers get an Origin that
  matches what the dev browser actually sends.
- New `proxyThroughTarget` helper backs the host branch; existing
  duplicate proxy blocks can be consolidated in a follow-up.
…ance hosts

Two security issues flagged in review of the host-based routing branch:

1. The v1-capability fallback redirect built its legacy-host target via
   `new URL(pathname+search, 'https://claw.kilosessions.ai')`. A request
   path starting with `//` (e.g. `//evil.example/path`) is interpreted
   as scheme-relative by the URL parser, producing
   `https://evil.example/path` — an authenticated open redirect. Build
   the URL via the pathname/search setters instead so the leading `//`
   stays in the path component.

2. On per-instance virtual hosts, the access-gateway still derived its
   gateway token from the `?instanceId=` query param (or the user's
   default sandbox). That could mint a token for one sandbox while the
   catch-all proxy routed by host to a different one — effectively
   handing the OpenClaw SPA a token it can't use. Derive the token from
   the host-encoded sandbox when the request lands on a matched virtual
   host, and ignore any mismatched query param. Added ownership checks:
     - `i-<hex>` label: DO `status.userId` must equal the authenticated
       user; otherwise 403.
     - `u-<base32hex>` label: decoded userId must equal the authenticated
       user; otherwise 403.

Regression tests cover both paths.
… 302

The capability-gate branch previously returned a 302 redirect to
`claw.kilosessions.ai` when the owning instance was pre-cutover (v1). Per
review, we'd rather surface the misconfiguration explicitly than silently
bounce users — a 302 hides the "needs restart" signal from the dashboard
and the user, and any URL-construction logic inside that branch is
potential attack surface.

Return a 404 with an explanatory body instead. The instance stays
reachable at the legacy host until its next restart brings it to v2; no
silent URL rewriting, and the `new URL(...)` construction that risked
the open-redirect class of bugs is gone entirely.

The open-redirect regression test is removed because the code path it
covered no longer exists.
…heme

The host-suffix and URL-scheme helpers in `auth/hostname-label.ts` used to
fall back to `.kiloclaw.ai` / `https` when their env vars were missing.
That hid misconfigurations: a preview environment without the vars set
would quietly inject the production suffix into every machine's origin
allowlist and every catch-all response. Prefer loud failure.

- Helpers now throw when `KILOCLAW_INSTANCE_HOST_SUFFIX` or
  `KILOCLAW_INSTANCE_URL_SCHEME` is unset or empty.
- `validateRequiredEnv` checks both, so request traffic 503s early
  instead of throwing in the host parser.
- `gateway/env.ts` wraps the per-instance origin injection so a
  misconfigured worker doesn't abort machine boot — it logs a warning and
  skips the injection, matching the "no per-instance origin" behaviour
  the catch-all gate produces anyway.
- Tests now provide the vars explicitly (including via
  `createMockEnv`), and a new test asserts the missing-env throw.
- `.dev.vars.example` sets the vars uncommented with a comment pointing
  to the `.kiloclaw.localhost:8795` variant for dev parity.
- `DEVELOPMENT.md` and `AGENTS.md` document both vars and the dev-parity
  recipe.
The previous defaults in `.dev.vars.example` copied production values
(`.kiloclaw.ai` / `https`) into the dev environment. Because
`.dev.vars.example` is what dev-start uses to seed new developers'
`.dev.vars`, this meant the generated dev config never actually
exercised host-based routing — localhost Host headers don't match
`.kiloclaw.ai`, so the new branch stayed inert unless the dev manually
edited their file.

Flip the defaults to the loopback-parity pair:

  KILOCLAW_INSTANCE_HOST_SUFFIX=.kiloclaw.localhost:8795
  KILOCLAW_INSTANCE_URL_SCHEME=http

`.kiloclaw.localhost` auto-resolves to 127.0.0.1 per RFC 6761, so
opening `http://i-<hex>.kiloclaw.localhost:8795/` routes through the
host branch in dev with zero additional setup. Hitting
`http://localhost:8795` directly still works — that host doesn't match
the suffix, so traffic falls through to the existing cookie/path-based
flow.
@pandemicsyn pandemicsyn marked this pull request as ready for review May 1, 2026 16:04
Comment thread services/kiloclaw/src/routes/access-gateway.ts
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 1, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (2 files)
  • services/kiloclaw/src/routes/access-gateway.ts
  • services/kiloclaw/src/routes/access-gateway.test.ts
Resolved Findings
  • services/kiloclaw/src/routes/access-gateway.ts - Previous warning about stale/mismatched ?instanceId= on per-instance hosts is fixed by skipping query-param ownership checks when the Host header is authoritative.

Reviewed by gpt-5.5-20260423 · 480,036 tokens

@pandemicsyn
Copy link
Copy Markdown
Contributor Author

SCR-20260501-jxtd

…anceId=`

The pre-flight `assertInstanceOwnership` on the query-param `instanceId`
used to run before the host-based logic took over. On a per-instance
virtual host the host itself is the routing signal — `buildRedirectUrl`
already ignores the query param via `resolveHostSandboxId` — so a
stale/mismatched `?instanceId=` shouldn't reject an otherwise-valid
request. Before this fix it 403'd the access-gateway redirect.

Gate both pre-flight checks (the POST redeem-and-set-cookie path and the
GET valid-cookie path) on `!requestIsOnInstanceHost(c)`. Added a
regression test that uses separate DO stubs for the host-encoded and
query-param instances, asserts a 302 redirect, and confirms the
query-param instance's DO is never consulted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants