Skip to content
This repository was archived by the owner on Jul 13, 2025. It is now read-only.

Fork Sync: Update from parent repository#36

Open
github-actions[bot] wants to merge 1698 commits into
MultiMx:mainfrom
tailscale:main
Open

Fork Sync: Update from parent repository#36
github-actions[bot] wants to merge 1698 commits into
MultiMx:mainfrom
tailscale:main

Conversation

@github-actions

Copy link
Copy Markdown

No description provided.

bradfitz and others added 30 commits May 14, 2026 15:55
…scale CI

cibuild.On() returns true for any CI environment that sets CI=true,
including Alpine Linux's package build CI. TestTsgoRevInCacheKey was
guarded by cibuild.On() (or use of tsgo), so it ran under Alpine's CI
with stock Go, where go.toolchain.rev isn't blended into build cache
keys, and unsurprisingly failed.

Add cibuild.OnTailscaleCI, which keys off GITHUB_REPOSITORY_OWNER to
distinguish tailscale/tailscale's own GitHub Actions CI from arbitrary
downstream CI, and use it in TestTsgoRevInCacheKey.

Fixes #19754

Change-Id: Id31cfe71903a235f1460dca1e2fdf334e3ba1ee5
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Signed-off-by: License Updater <noreply+license-updater@tailscale.com>
…ls (#19757)

linuxRouter has two blocks (connmark rules and the CGNAT drop rule) that
gate on cfg.NetfilterMode, the requested config state. This may cause an
error when setNetfilterModeLocked fails, since it may keep assuming this
config is valid.

We now gate both blocks on r.netfilterMode, matching the pattern used by
SNAT, stateful, and loopback paths.

Fixes #19737

Change-Id: Ia6003a082db99c376e662132d725661afbac0ee9

Signed-off-by: Fernando Serboncini <fserb@tailscale.com>
Updates tailscale/corp#37904

Change-Id: I09e73b3248b9ddf86dafe33dfb621bd560f6596d
Signed-off-by: Alex Chan <alexc@tailscale.com>
Move the inline CSS and JS into separate files to be more friendly
to Content Security Policies. ServeHTTP is updated to serve these
assets from the '/static/' path.

Updates tailscale/corp#32398

Signed-off-by: Noel O'Brien <noel@tailscale.com>
RouteCheck, which checks that overlapping routers are reachable, is
enabled by default for both tailscaled and tsnet.

Updates #17366
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
The Engine watchdog wrapped every wgengine.Engine method call in a
goroutine with a 45s timeout and crashed the process on timeout. It
was added years ago to surface deadlocks during development, but the
underlying deadlocks have long since been fixed, and even when it did
fire it produced obscure stack traces (from inside the watchdog
goroutine, not the original caller) without buying much.

Audit of userspaceEngine's methods shows none have cyclic locking or
unbounded blocking now that ResetAndStop no longer loops waiting for
DERPs to drain (fa49009). The watchdog is dead weight; remove it
along with the TS_DEBUG_DISABLE_WATCHDOG escape hatch.

Updates #19759

Change-Id: Iba9d718fe1f8718a6631296e336b138c31b99ff1
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Issue #19737 ran into a nil pointer dereference, the cause of which was fixed
by #19761. If we end up on this code path with a nil table again, we should
bubble that up as an error (which is logged by the health warning system)
rather than failing catastrophically.

Signed-off-by: Naman Sood <mail@nsood.in>
If the context given to DialContext has a shorter lifetime than the OS
TCP SYN timeout, and TCP SYNs are dropped from the path to the remote,
DialContext would never fall back to try IPv6 after IPv4.

Instead, use the normal happy eyeballs race if there is more than one
address. This does remove the implicit prioritization of IPv4 over IPv6
in cases where there is only a single IPv4 remote address.

Updates #13346

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
A data race in a package matters more than any individual test
result. Two related problems:

1. Where go test's race detector text ("WARNING: DATA RACE" plus
   the goroutine stack traces) lands in JSON output is timing-
   dependent: it can be attributed to a test that ends up reporting
   PASS (e.g. when the racing goroutines outlive the test that
   spawned them and TSan prints during a different test's window).
   testwrapper's main loop only flushes the logs of failed tests,
   so the race report ends up stuck in a passing test's buffer and
   is silently dropped. The race builders just see a bare
   "FAIL\nFAIL\tpkg\ttime".

2. If the failing test in such a package happens to be marked flaky,
   testwrapper retries it. That is the worst possible response to a
   race: the flaky test might not even be the racy code, and a
   second run without the racy goroutines could "succeed" while
   hiding the real bug.

Address both: scan every output line for the race detector's first-
line marker. Track whether the package observed a race at all, on
the pkgFinished testAttempt. When a race was seen, fold every per-
test log buffer into the package-level logs (so the full report
surfaces from the existing pkg-fail flush path), and drop any
flaky-test retry plans for that package so we fail immediately
instead of running another attempt.

Two new tests:
- TestRaceSuppressesFlakyRetry verifies that a flaky test alongside
  a racy test does NOT get retried.
- TestRaceAttributedToPassingTest verifies that a race attributed by
  test2json to a passing test still surfaces in the output.

Also add a corpus of captured raw test binary outputs under
cmd/testwrapper/testdata/, with one subdirectory per scenario,
documenting the six representative shapes that go test -race can
emit (race in test body, race in goroutines that outlive a test,
race forced into a later test, race in TestMain post-m.Run, and a
parallel-tests split-attribution case via a "=== NAME" redirect
line). See its README.md for details.

Fixes #19603

Change-Id: Ifbfcd67fb3b1882c4907bd9cb2d68a8b5a91dd54
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
…cale/connect

Add Go tests that drive a real headless Chromium (via chromedp) against
the built cmd/tsconnect/pkg/ artifact and verify the @tailscale/connect
public API surface end-to-end. The package has not been republished in
three years, in part because no test exercises the produced artifact at
runtime — only tsc --noEmit and a Go build run in CI.

TestCreateIPN loads pkg.js into the browser, calls createIPN with a junk
auth key, and asserts that pkg.createIPN / pkg.runSSHSession are
functions and that createIPN() returns an IPN with the documented
run/login/logout/ssh/fetch methods. No control-plane traffic.

TestFetchTailnetPeer stands up a full local tailnet (testcontrol +
DERP + a tsnet.Server peer) and verifies that the browser-side WASM
client can join over WebSocket-noise to the same control, connect to
DERP over WSS, and then ipn.fetch() an HTTP service hosted on the tsnet
peer through the tailnet. The test asserts the response body matches a
known string. Browser state transitions are logged: NoState -> NeedsLogin
-> Starting -> Running.

Tests are opt-in via --run-headless-browser-tests (matching the existing
--run-vm-tests pattern in tstest/natlab/vmtest) so they never fire in
casual `go test ./...` runs. When the flag is set, a test is skipped if
cmd/tsconnect/pkg/ has not been built, and fails with t.Error if no
chromium binary is found on $PATH (honoring $CHROME_BIN as an override).
findChromium also falls back to /Applications/Google Chrome.app and
/Applications/Chromium.app on darwin, since macOS Chrome's executable
lives inside an .app bundle and is not on $PATH by default. The
.github/workflows/test.yml wasm job is extended to install
google-chrome-stable and run the tests with the flag after build-pkg.

To prevent silently testing a stale pkg/main.wasm (built from an older
checkout than the rest of the test invocation), build-pkg now writes
pkg/build-info.json recording the sha256 of the raw (pre-wasm-opt)
go-build output. The test does its own `go build` of
cmd/tsconnect/wasm with the same -tags/-trimpath/-ldflags (factored
into a new cmd/tsconnect/wasmbuild package shared by both call sites)
and t.Fatalfs with a "rebuild" instruction on mismatch. Cost is
near-zero because the Go build cache from the prior build-pkg makes
the rebuild a cache hit.

The new wasmbuild package also replaces cmd/tsconnect's hardcoded -tags
string with a minimal-feature-set computation. wasmbuild.Keep names the
small set of feature/featuretags entries the browser client actually
needs (netstack, logtail, dns, health, c2n, ipnbus); wasmbuild.Tags()
emits a ts_omit_<f> for every other
omittable feature in feature/featuretags.Features, with transitive deps
expanded via featuretags.Requires. An init() panics if Keep references
a feature unknown to feature/featuretags so a rename there fails
loudly. Net effect on size: 32M raw / 9.4M brotli before this change,
25M raw / 4.4M brotli after — vs the last-published 1.39.98 at 21M /
3.8M. The transitive package-import graph is unchanged (176
tailscale.com/* packages either way): featuretags omits eliminate
dead code via `const HasX = false`, not imports. Trimming the import
graph would require a separate, larger refactor splitting interface
packages by build tag.

Writing TestFetchTailnetPeer surfaced several real issues, all fixed
here:

  * cmd/tsconnect built the wasm with the nethttpomithttp2 tag, but
    control/ts2021 (since commit 1d93bdc, "control/controlclient:
    remove x/net/http2, use net/http", Oct 2025) requires HTTP/2 from
    net/http's bundled implementation. With nethttpomithttp2 set, the
    bundle is excluded and the wasm client cannot speak HTTP/2 to any
    control plane, including production. Drop the tag. Wasm size grows
    ~1 MB raw / ~300 KB brotli (more than offset by the feature
    pruning above). The last published @tailscale/connect (1.39.98,
    early 2023) pre-dates the regression, which is why no consumer has
    reported the breakage.

  * tstest/integration/testcontrol.Server's /ts2021 noise upgrade
    endpoint rejected anything but POST. WebSocket clients (the only
    transport available to browser-WASM) come in as GET. Allow both;
    the controlhttp AcceptHTTP path dispatches on the Upgrade header,
    so the websocket library still enforces GET for WS upgrades.
    This matches production, where the same controlhttpserver.AcceptHTTP
    routes purely on the Upgrade header without checking method.

  * derp/derphttp's urlString built the DERP URL from node.HostName
    only, dropping node.DERPPort. Non-WS clients use a separate code
    path (connectToHost) that honors DERPPort, but WebSocket-only
    clients (browser-WASM) went through urlString and so could not
    reach a DERP running on any port other than 443. Include the port
    when it differs from the scheme default.

Also move addWebSocketSupport from cmd/derper (where it was main-only)
to derp/derpserver.AddWebSocketSupport so tstest/integration.RunDERPAndSTUN
can wrap its DERP handler with WebSocket support — without that, the
test DERP would not accept the browser's wss connection.

Fixes #9394

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Iff9cdee303e3b239924249b5bffb2fd04e02f391
…19807)

The TestShouldUseOneCGNATRoute test fails when the underlying system
interfaces don’t match what the underlying assumptions of the test.
That assumption was that there would only ever be one CGNAT interface:
the Tailscale one.

This breaks on Linux when border0 is installed because border0 also
creates an interface with a CGNAT route.

This patch stubs netmon.RegisterInterfaceGetter to replace the system
interfaces and netmon.SetTailscaleInterfaceProps to identify the test
data that defines the Tailscale interface.

This patch also tests the control knob override for CGNAT for every
combination of operating system and system interfaces, instead of just
a couple of combinations.

Fixes #19731

Signed-off-by: Simon Law <sfllaw@tailscale.com>
Some netmap updates are guaranteed to affect only the "static" parts of the
netmap, and so should not require us to walk through all the peers and user
profiles when updating the cache. To support this, the new UpdateSelfOnly
method updates only the Self node and other tailnet settings that are not
dependent on the peers and profiles.

Use this when updating the cache on DERP home changes.

Updates #12542

Change-Id: Ifed522b29d579fb76e010b4ff738cc4e0a72d27f
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Fixes #19338

Signed-off-by: Aria Stewart <aredridel@dinhe.net>
serveMap cloned s.nodes[nk], mutated the clone outside the mutex,
then wrote it back via updateNodeLocked. A concurrent UpdateNode,
SetNodeCapMap, or other writer landing between the clone and the
writeback would be silently clobbered. Mutate the live node under
the mutex instead.

Surfaces in tsnet's TestListenService as a flaky ErrUntaggedServiceHost
panic: the test calls control.UpdateNode to attach a tag, a concurrent
updateRoutine map request from the host races, and the host's next
netmap arrives with Tags=[].

Updates #19822

Change-Id: I6c5ebd5e5bf79a40316f53f627157230773cb469
Signed-off-by: James Tucker <james@tailscale.com>
When tailscaled is running in userspace-networking mode behind an
exit node (e.g. as a SOCKS5 proxy), it resolves a hostname and then
dials a single resolved IP through the tunnel. If the name has both
A and AAAA, Go's net.Resolver merges them and we pick ips[0], which
on an IPv6-native host is usually AAAA. If the exit node has no IPv6
egress (or vice versa), the dial fails silently through the tunnel
and the user sees a hang.

Resolve all candidates and race connect attempts across address
families with a 300ms happy-eyeballs delay, matching Go's net.Dialer
default and the existing pattern in net/dnscache (commit ee0a03b).
First success wins; losers are cancelled and any conns they produce
are closed. A failBoost channel wakes the launcher when a connect
fails fast (e.g. ICMP "no route" via the tunnel) so we don't sit on
the 300ms timer when the answer is already known.

userDialResolve is refactored into userDialResolveAll (returns the
full candidate list) plus a thin single-IP wrapper for callers like
UserDialPlan that don't race. UserDial's per-IP dispatch (netstack
vs peer dialer vs SystemDial vs std) is extracted to dialOneUser so
each candidate can route correctly on its own merits.

Also fix serveDial in localapi to pass the original hostname to
UserDial rather than a pre-resolved IP, so the race can fire.

This fix is single-ended: it works against any exit node, including
old ones, with no protocol changes. The trade-off versus filtering
on the exit-node side via PeerAPI DoH is that every dial through an
unreachable-family exit node costs one failed connect attempt per
cache window, rather than zero, which is acceptable given the
simplicity.

Fixes #19792
Fixes #13257

Change-Id: I9d7645d0034caf3ee22ecdd8070798353f77e94b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Updates tailscale/corp#39975

Signed-off-by: Fran Bull <fran@tailscale.com>
The traffic package contains helpers for evaluating traffic steering
scores and picking appropriate nodes. These were extracted from
ipnlocal.suggestExitNodeUsingTrafficSteering so they can be reused by
the new routecheck package to probe exit nodes in priority order.

Updates #17366
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
SetDERPMap spawns a goroutine that calls ReSTUN, which logs via the
test logger. If the test returns before that goroutine logs, the
goroutine races with testing cleanup.

Use tstest.WhileTestRunningLogger so the goroutine's logf call becomes
a no-op once the test finishes.

Fixes #19829

Change-Id: I1097f98e40ffd1c5dd7fb7a715c918255853e3c6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
…stant time

For large tailnets (~50k+ nodes) with frequent peer churn (ephemeral
GitHub Actions workers etc.), tailscaled used to rebuild the full
netmap and fan it out on the IPN bus on every MapResponse that
added or removed a peer. There were two O(N) costs per delta: the
full netmap rebuild + every Notify.NetMap encode to every bus watcher.

This change tackles both:

  1. Plumb O(1) peer add/remove through the delta path. PeersChanged
     and PeersRemoved no longer prevent the delta happy path; instead,
     they mutate the per-node-backend peer map in place.

  2. Restrict ipn.Notify.NetMap emission to the platforms whose host
     GUIs still depend on it (Windows, macOS, iOS) and migrate
     in-tree consumers off it everywhere else:

     - Migrate reactive consumers (containerboot, kube agents,
       sniproxy, tsconsensus, etc.) off Notify.NetMap to the
       previously-added Notify.SelfChange signal so they no longer
       have to subscribe to the full netmap.
     - Add ipn.NotifyNoNetMap so GUI clients on "legacy-emit" platforms
       that have already migrated can opt out of the per-watcher
       NetMap encode.
     - Gate Notify.NetMap emission on the producer side by a compile-
       time GOOS check, so the supporting code is dead-code-eliminated
       on Linux and other geese where no GUI consumer needs it.

Re-running BenchmarkGiantTailnet from tstest/largetailnet, which was
added along with baseline numbers on unmodified main in ad5436a,
the per-delta cost (one peer add+remove pair) is now ~O(1) regardless
of tailnet size N:

    N         no-watcher (ms/op)            bus-watcher (ms/op)
              before    now     factor      before    now     factor
     10000        32   0.11       300x         166   0.13      1300x
     50000       222   0.11      2000x         865   0.13      6700x
    100000       504   0.12      4100x        1765   0.13     13400x
    250000      1551   0.12     12500x        4696   0.15     32400x

Updates #12542

Change-Id: I94e34b37331d1a8ec74c299deffadf4d061fda9e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
In PR tailscale/corp#30448, we originally decided to break ties using
SHA256 for our rendezvous hashing algorithm. Now that we’ve had some
experience with it, we think that FNV-1a is a better choice. It
distributes bits evenly, it’s much faster, and it doesn’t need to be
cryptographically secure. The FNV designers recommend FNV-1a over the
deprecated FNV-1.

This PR makes the switch and updates the related tests, since changing
the algorithm changes which stable pick gets selected. As of 2026-05,
this is the best time to make this change, since there are almost no
clients in the wild with traffic steering enabled.

Updates #17366
Updates tailscale/corp#29964
Updates tailscale/corp#29966
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
…19832)

Updates #19831

Signed-off-by: Simon Law <sfllaw@tailscale.com>
…ssing (#19828)

Holding an exclusive lock while writing to the unbuffered changequeue chan
is likely going to deadlock when the run() path may try to grab the same lock
before reading from the chan to drain it (on map session close). This causes
the client to stop processing new map responses and TSMP disco key advertisements.

There is a good probability of inducing this deadlock using the old code and new
test added in this commit: TestUpdateDiscoForNodeCallback/test_deadlock.

Also fix an unintentional regression in how the client responds to a mapResponse sleep
command. 85bb5f8 moved the processing of mapResponses into a new goroutine,
serialized via mapSession's changequeue. Thus, controlclient stopped sleeping in the
same goroutine servicing mapResponses/control connections. This commit brings us back
to sleeping synchronously in the same goroutine as controlclient.

Updates #12639

Signed-off-by: Amal Bansode <amal@tailscale.com>
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Co-authored-by: Claus Lensbøl <claus@tailscale.com>
In aa5da2e we made the IPN bus include deltas, including the
PeersRemoved, sending a slice of integer NodeIDs that were
removed. But when updating xcode, I realized there was no way to map
those integers to the stable node IDs used in other places.

I was consdering changing the just-added ipn.Notify.PeersRemoved from
an IntID to a string StableID, but then it doesn't match the MapResponse
wire protocol, which we've tried to match so far.

Instead, just add the integer ID as well. Callers can use whichever
world they want, having both. It's a little regrettable that we still
have two worlds of IDs, but oh well. Neither is really suitable to a
hypothetical future fully federated world of control servers anyway,
so we'll need a third type later anyway, so just live with the two we
have for now.

Updates #12542

Change-Id: Ib8fd48a265e1da1f8779152f141f624a7f7260e9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Fixes #19834

Change-Id: I4d48efed00cd080b14c6fd713ff21e53a5a6ee3c
Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
#19846)

There are two places where tailscaled transitions into a paused state:
1. tailscaled’s controlclient is initially created,
2. tailscale down, or the GUI equivalent, commands it to.

This patch unifies the implementation of both scenarios into
LocalBackend.shouldPauseControlClientLocked to prevent the
implementation from drifting.

The flaky tstest/integration.TestNoControlConnWhenDown test exposed
this mismatch, but only by accident. This patch also changes
TestNode.MustDown so that it runs `tailscale down` and then waits for
the testcontrol server to finish handling any associated /machine/map
requests.

Fixes #19831

Signed-off-by: Simon Law <sfllaw@tailscale.com>
Updates #cleanup

Signed-off-by: Simon Law <sfllaw@tailscale.com>
Signed-off-by: Yago Raña Gayoso <yago.rana.gayoso@gmail.com>
Previously we had two maps keyed on a direction-specific tuple, with
distinct values containing the data (action) for that direction.
Values pointed at each other across maps to ensure they were removed
at the same time in the case of tuple overwrite, but LRU eviction
was per-map. So if LRU was turned on, it was possible for one
direction's data (action) to be evicted and leave the other direction
dangling.

NewFlow replaces the two direction-specific flow constructors, and
lookups return the direction-specific PacketAction directly.

Now the values in each map point to the same element, with data for both
directions in the element. A linked list also points to the elements to
implement LRU. The previous flowtrack.Cache is removed.

The single LRU structure will allow us to implement idle time expiration
by walking the list backward starting with the least recently used flow, and
stopping after a fixed number of flows, or at the first non-expired flow.

We add commented-out unused placeholder fields for tracking the
"last seen" timestamp, and an on-removal hook, to document the intent for
the follow-up expiry work.

Updates tailscale/corp#38630

Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
When we use assigned addresses in response to a DNS request, extend the
expiry on the assignment.

Updates tailscale/corp#39975

Signed-off-by: Fran Bull <fran@tailscale.com>
alexwlchan and others added 30 commits June 26, 2026 11:12
Occasionally CI jobs will flake because downloading from GitHub fails.
Allow retrying up to 3 times to reduce CI flakiness.

Updates #cleanup

Change-Id: Ib019e89ac74b81d78f71a40099b20ff60014a81f
Signed-off-by: Alex Chan <alexc@tailscale.com>
…err (#19968)

On optimistic lock error, requeue the event after a short duration.

Resolves a case where a failure to acquire an optimistic lock on the
dnsrecords configmap will cause the operator to drop a reconcile event
and leave the configmap in an undesirable state.

Updates #19946

Signed-off-by: Alex Freestone <freestone.alex@gmail.com>
updates tailscale/corp#44019

WebClient is very useful for remote management
on tvOS (which cannot do ssh).   Let's include it there.
Minimal corresponding tailscale/corp changes to follow
to add UI to set the required prefs.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
We stopped reading this field nearly two years ago, with a TODO comment
to remove it sometime in 2025.

It is now 2026.

Updates #12058

Change-Id: I8ddf1c2e4c3c428e8d45a6491d3899368ec52c30
Signed-off-by: Alex Chan <alexc@tailscale.com>
…nsion

The ACME serialization mutex (acmeMu) was a package-level global, and
several ACME-related fields lived on LocalBackend even though the
cert code is conditional and not linked into every binary. With
multiple tsnet.Servers in one process (each its own LocalBackend),
a process-wide acmeMu also serialized unrelated backends.

Introduce a new feature/acme extension that owns the per-LocalBackend
ACME/cert state in an ipnlocal.CertState value:

  - acmeMu, renewMu, renewCertAt (previously package globals)
  - pendingACMETLSALPNCerts, pendingCertDomains{,Mu},
    getCertForTest, certRefreshCancel (previously LocalBackend
    fields, only meaningful when ACME was compiled in)

ipnlocal/cert.go now reaches the state through b.certState(), which
is routed by a feature.Hook installed at init by feature/acme. The
CertState type lives in ipnlocal so cert.go can access its fields
directly without a method explosion; the extension in feature/acme
constructs and owns it.

This is a baby step. The end goal is for the entire cert/ACME code
to live in feature/acme, with ipnlocal only retaining whatever thin
hooks the rest of LocalBackend needs to call into it. The current
split (CertState and most of cert.go in ipnlocal, extension wrapper
in feature/acme) is a deliberately temporary middle ground that
keeps this PR small while making the next moves mechanical.

The package is named feature/acme to match the existing HasACME /
ts_omit_acme naming. condregister/maybe_acme.go wires it in for
non-js builds.

Updates #12614
Updates #20248
Updates #20249

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I520909f24ad11a9622ef33c2290fe36ad44d6f71
GitHub's built-in CODEOWNERS only supports a hard "block until a team
member reviews" rule, with no way to leave an audit trail when the
requirement is intentionally bypassed. Move review enforcement to
palantir/policy-bot (https://github.com/palantir/policy-bot) running
at https://policybot.corp.ts.net, which lets us express the same
tailcfg/ -> control-protocol-owners rule plus an explicit override:
any other @tailscale/dev member can post

    policybot-override: <reason>

as a PR comment and that comment counts as their approval, with the
reason recorded in the PR conversation as a permanent audit trail.

CODEOWNERS is kept as a one-screen comment so anyone landing on it
expecting the old behavior is directed to .policy.yml.

Updates tailscale/corp#13972

Change-Id: I2dc3619c498d4c4a6decae29aa123f6d67905eed
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The override comment didn't work as expected.
(I'll be updating the policytest package to handle this)

Updates tailscale/corp#13972

Change-Id: Ic5c16eed09c8cb5fa8dab37d43cf05f8dfa75d49
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
prometheus/common v0.66/v0.67 introduced a mandatory
model.ValidationScheme on expfmt.TextParser as part of
prepping for UTF-8 metric/label names in Prometheus 3.0. The
zero value is intentionally UnsetValidation, which panics on
the first call to IsValidMetricName / IsValidLabelName with

  Invalid name validation scheme requested: unset

so the long-standing "var parser expfmt.TextParser" pattern
crashes at runtime. Several big downstreams have hit the same
sharp edge:

  thanos-io/thanos#8823
  grafana/loki#21401

Switch our two callers (parseMetrics in tsnet's
TestUserMetricsByteCounters and the client-metrics scraper in
tstest/natlab/vmtest) to the new expfmt.NewTextParser
constructor with model.LegacyValidation. LegacyValidation
matches the classic ASCII metric/label naming rules that
tailscaled's exporter uses today; if and when we ever emit a
metric with a UTF-8 name, we can revisit.

Goes to v0.69.0 (the latest at the time of writing) rather
than v0.67.5 so we pick up the unrelated security fixes for
cross-host redirects.

Done in advance so a follow-up change can pull in
github.com/tailscale/policybottest (which depends on
palantir/policy-bot, which transitively requires
prometheus/common at v0.67+) without dragging this debugging
into that PR.

Updates tailscale/corp#13972

Change-Id: I4b37db9ad3bebef1a32d9020bf6f8790bab25336
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a .policy-tests.yml file with tests exercising the policy
that was just landed: the tailcfg/ control-protocol-owners gate,
the "policybot-override:" comment escape hatch (including
defaults-regression guards so the override rule does not
silently accept a normal review or a 👍 comment), and the
always-on "any tailscale/dev review" baseline.

Updates tailscale/corp#13972

Change-Id: I42afb06b0771658c803512cb5de4701450c8a704
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
WhoIs lookups for an IPv6-mapped IPv4 address such as
"::ffff:100.87.98.86" failed to match the node's canonical IPv4
address. Unmap the address before looking it up so these resolve.

Fixes #20235

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Bouke van der Bijl <i@bou.ke>
Move all the FooForTest methods on LocalBackend to instead be
methods on a new unexported forTest type which is then given out
to callers in other packages via an exported ForTest method
(panicking in non-test contexts) that returns that unexported type.

This is unusual style (exported returning unexported) but declutters
godoc and makes call sites both more explicit and easier to read
without the "ForTest" suffix polluting the symbols. Now FooForTest()
changes into ForTest().Foo().

This was motivated by a pending change moving a bunch of code out of
LocalBackend into other packages that required adding more ForTest
methods to LocalBackend to keep the tests (now in other packages)
working. Instead, do this refactor now so the future change is prettier.

Updates #12614
Updates #cleanup

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Ib25e6d76d48dc8622ac3a955e0b1220d582e63a8
This was missing in the earlier f5eac39 and meant that tsnet users weren't
getting (all of) acme support.

Thanks to @ChaosInTheCRD and @BeckyPauley for debugging.

Updates #12614
Updates #20252

Change-Id: I176a7b179b2ad3726aca484057f0aae7cc3561c8
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Tests in magicsock_test.go would routinely emit this warning:

	## WARNING: (non-fatal) nil health.Tracker (being strict in CI):

because they would run NewConn without initializing a health.Tracker.

This patch initializes Conn correctly with a health.Tracker. It also
fixes some missing Close calls that can be handled in t.Cleanup.

Fixes #20263

Signed-off-by: Simon Law <sfllaw@tailscale.com>
`TestNetworkSendErrors/network-down` causes a data race because it
tried to `tstest.Replace` the `checkNetworkDownDuringTests` global
while `wgengine.Conn.networkDown` would read from it. This patch moves
this flag into a field within the `wgengine.Conn` struct, so there’s
no chance that two tests could trample on each other.

It also renames this field to `Conn.checkNetworkUpDuringTests`,
because `Conn.networkUp` is the name of the field that gets checked.

Fixes #20260

Signed-off-by: Simon Law <sfllaw@tailscale.com>
…e/acme

f5eac39 ("feature/acme, ipn/ipnlocal: start moving ACME/cert state
into an extension") started to move the cert code into feature/acme
but was meant as a baby step.

This goes further, moving almost everything, leaving only some hooks
in ipnlocal.

When we later move "serve" support out to feature/serve, this will
look a bit different in that the hooks currently in ipnlocal will move
to feature/serve (cert support already depends on serve).

As part of this, cert-related tests move to feaure/acme too, which
means some test infra from ipnlocal now moves to shared ipnlocaltest.
(it's not big at the moment, but I imagine it growing)

Updates #12614

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I9ea89aa9754f12d54b81751b6bd830f2664241ff
Currently, PeerAPI DNS is only allowed if
1. The peer is owned by the same user as this device, or
2. The node is an exit node or app connector
  a. and the peer has access to a hypothetical DNS server at 0.0.0.0:53
     (which approximately means "the peer has access to
     autogroup:internet")

None of this is useful for conn25. This adds the most basic of hooks
(and converts the existing logic to a hook, which should improve clarity
and lead to the possibility of moving the existing checks into feature
packages in future).

There is an extra filter based on the name being queried that is
performed later. It refuses names in
tailcfg.DNSConfig.ExitNodeFilteredSet. That filter is not modified by
this change.

With this change, if conn25 is configured as a connector, then all
PeerAPI DNS queries are permitted (still subject to the
ExitNodeFilteredSet as noted above).

More work is required: the goal before release (i.e. the WIPCode check
is removed) is that each query should be checked against the list of
domains in the requested conn25 app. For now, this only verifies that
conn25 is configured (and does not include the autogroup:internet
check, which is not how conn25 grants will operate when implemented,
soon).

This change has been manually tested against the scenario outlined in
tailscale/corp#40117; unfortunately the code's structure makes writing a
unit test difficult. The more comprehensive changes needed for
tailscale/corp#40076 should include an integration test that covers this
case.

The hook must go in the ipnlocal package rather than the usual extension
host to prevent a circular dependency on the ipnlocal.PeerAPIHandler
interface. Registering PeerAPI handlers uses a similar strategy, likely
because of, at least in part, this same problem.

Updates tailscale/corp#40076
Fixes tailscale/corp#40117

Change-Id: I367714170b509d7a421f62672e5824b3590c2b9c
Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
All issuances serialise through a single mutex in tailscaled. The old
300s timeout fired while a predecessor was legitimately mid-ACME,
causing the queued loop to advance retryCount on a non-failure. 30m
covers ~15 queued flows and works as a wedge detector against true
hangs.

Updates #20288
Updates #42164

Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
This adds a Created field to LoginProfile to normalize the sort order
of login profiles presented in the various client GUIs. The default
sort order for existing profiles remains unchanged and continues to be
based on Name. Newly added profiles will be stamped at creation time
and returned at the top of the list of unstamped profiles, sorted by
creation date in descending order.

The rationale is to ensure that all clients present the user's profile
list in the same order, regardless of newly added accounts, name
changes, or nickname overrides.

The Mac client was recently updated to remove various custom profile
sorting behaviors (tailscale/corp#43847).
iOS, Android, and Windows do not currently perform GUI-level sorting,
so this change should propagate to them seamlessly.

updates tailscale/corp#43843

Signed-off-by: Will Hannah <willh@tailscale.com>
Adds two Gokrazy-based vmtests covering the tailscaled web client at
port 5252:

* TestWebClientLocalAccess enables the web client on a single node
  and exercises the canonical owner session flow against the node's
  own Tailscale IP: an unauthenticated GET /api/auth that identifies
  the caller, a GET /api/auth/session/new that issues a
  TS-Web-Session cookie, and a final GET /api/auth that reports
  authorized=true with the cookie.

* TestWebClientRemoteAccess runs the same session flow from a peer
  node on the same tailnet against a second target node's web
  client, exercising netstack interception of incoming :5252
  traffic, cross-node WhoIs, and the same-user "owner" path. It
  then flips the test control server's AllNodesSameUser off,
  re-logs in the client under a fresh identity, and asserts that
  GET /api/auth/session/new returns 401 with body "not-owner" --
  exercising the cross-user rejection in client/web/auth.go.

To make the natlab test environment exercise the same code path
as production (check mode, where the web client posts to
/machine/webclient/init via Noise and waits on a control-issued
auth URL), this also:

* Allowlists the natlab fake control hostname "control.tailscale"
  in client/web/auth.go's controlSupportsCheckMode so the web
  client follows the check-mode branch rather than the
  no-check-mode shortcut that immediately marks new sessions
  authenticated.

* Adds /machine/webclient/{init,wait} handlers to testcontrol.
  init returns a placeholder auth ID and URL; wait returns
  Complete=true immediately, so the web client's awaitUserAuth
  resolves on its first call. Together these let the tests drive
  the full check-mode session lifecycle without a real
  browser-click loop.

To support the multi-request HTTP flows from the test harness,
this also adds:

* vmtest.Env.HTTPGetStatus, a sister of HTTPGet that returns the
  upstream status code, body, and Set-Cookie cookies (as a
  vmtest.HTTPResponse) and accepts cookies on the outgoing
  request, so tests can drive flows that depend on cookie
  continuity.

* Cookie pass-through in cmd/tta's /http-get handler: it forwards
  the Cookie request header upstream and surfaces upstream
  Set-Cookie response headers downstream. This is what lets
  HTTPGetStatus carry a session cookie across requests.

Previously the only tests of the web client were in-process
httptest-based handler tests in client/web/web_test.go; nothing
exercised the actual port 5252 listener wiring, the cross-node
auth path, cookie-driven session state transitions through the
check-mode control round-trip, or the not-owner rejection end
to end.

Updates #13038

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Idb01486a89b53ac02c6ad3358bcfcceca90dbc36
…date for Gokrazy

Builds on top of the unsigned URL-based GAF update flow added previously
(see referenced issue for context). The pkgs.tailscale.com server now
publishes signed GAFs for the unstable track, with detached ed25519
signatures produced by pkgsign's signdist path (the same distsign scheme
used for every other release artifact). This change consumes them.

The URL-based path (tailscale update --gokrazy-update-from-url=URL) now
verifies the signature by default using clientupdate/distsign.Client,
which fetches distsign.pub from the root of the host serving the GAF and
checks the .sig against the root keys embedded in this binary. The
--unsigned flag stays for TestGokrazyUpdatesItselfToSameImage, whose
in-test fileserver does not publish distsign.pub.

The bare tailscale update path is now wired up for the Tailscale
appliance image. It fetches <pkgs>/<track>/?mode=json, picks the GAF
whose key matches the local device (vm-amd64, vm-arm64, or pi-arm64,
where arm64 is split via /sys/firmware/devicetree/base/model), confirms
the version with the user, and reuses the verified download path above.

To avoid wiping a user's custom Gokrazy build that happens to include
tailscaled, the bare update path is gated on hostinfo.Package == "tsapp",
which is only set when the new ts_appliance build tag is present
(mirroring the existing ts_package_container tag). The
gokrazy/tsapp*/config.json files now pass GoBuildTags ["ts_appliance"]
for the tailscale and tailscaled packages so monogok bakes the tag into
the official appliance builds. The TS_FORCE_ALLOW_TSAPP_UPDATE env var
is an escape hatch for callers who want to force the appliance update
path on a non-appliance build. The URL-based path stays ungated since it
requires explicit user intent (and is exercised by the natlab vmtest).

Updates #20002

Change-Id: I7c7856a88bf3dffb9eb8d3e9111fad0b3906743c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This adds the NotifyInitialPolicy watch option and the Policy field in
Notify so that clients can receive the effective policy snapshot via IPN
bus.

This extends policyclient.Client so ipnlocal can get and watch policy
snapshots, which is used by sysPolicyChanged to notify watchers.

User-scoped policy store registration, management, and cleanup will be
added in a follow-up

Updates tailscale/corp#42259

Signed-off-by: kari <kari@tailscale.com>
Tailscaled had no way to seed device-scope syspolicy settings short of
environment variables or a custom store wired up out of tree. Add a
--syspolicy-file flag whose default points at a well-known JSON file
that, when present, is parsed as a map[string]any and registered as a
device-scope policy source. The default path is
/etc/tailscale/syspolicy.json on every non-Windows platform (Linux, the
BSDs, illumos/Solaris, and tailscaled-without-the-GUI on macOS) and
%ProgramData%\Tailscale\syspolicy.json on Windows. The flag lets users
running tailscaled by hand (development, custom installs) point it at
an alternate file, and "" disables the load entirely.

JSON values map to setting types as expected: strings to
StringValue/PreferenceOptionValue/VisibilityValue/DurationValue (e.g.
"24h" parsed by time.ParseDuration), booleans to BooleanValue, numbers
to IntegerValue, and string arrays to StringListValue. The file is
validated against the registered setting definitions at load time so
unknown keys and value/type mismatches fail startup loudly rather than
producing surprising defaults at first read.

When HuJSON support is linked into the build (default; opt out with
ts_omit_hujsonconf), the file may use HuJSON (comments, trailing
commas). With ts_omit_hujsonconf it must be pure standard JSON. This
mirrors the pattern used by ipn/conffile.

On Windows the JSON file and the existing HKLM registry store both
register at DeviceScope. rsop merges later-registered same-scope
sources over earlier ones, so per-key values in the file override the
registry while keys absent from the file fall back to the registry.

The loader is registered via a feature.Hook from a file gated by
!ts_omit_syspolicy, and called from main after flag parsing. tsnet
still does not depend on the root syspolicy package, so embedders
don't pick this up implicitly.

Fixes #20305

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Ie6326461c14efb226979ac162998a9c6373ce493
We can use them for traffic until they are actually removed from the
table.

Updates tailscale/corp#43180

Co-authored-by: Fran Bull <fran@tailscale.com>
Co-authored-by: Michael Ben-Ami <mzb@tailscale.com>
Signed-off-by: Fran Bull <fran@tailscale.com>
Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
Conn25 hands out dummy IP addresses for use in the connector flow from
limited address pools. When the addresses are no longer in use we expire
the corresponding entry from our table of address mappings and return
the addresses to their pools for reuse.

We currently expire addresses after the DNS TTL for the DNS response
that caused the mappings to be created.

Stop expiring mappings when there are active packet flows for the
addresses in the mappings.

Fixes tailscale/corp#43180

Co-authored-by: Fran Bull <fran@tailscale.com>
Co-authored-by: Michael Ben-Ami <mzb@tailscale.com>
Signed-off-by: Fran Bull <fran@tailscale.com>
Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
Replace the doubling backoff (1m, 2m, 4m, ...) with LE's recommended
1m, 10m, 100m, daily. The old schedule burned retry attempts inside
the rate-limit window without speeding recovery.

Updates #20288
Updates #19895

Signed-off-by: chaosinthecrd <tom@tmlabs.co.uk>
Adds a CLI subcommand that downloads a signed Tailscale appliance
image (Gokrazy archive format, GAF) from pkgs.tailscale.com,
constructs a fresh GPT-partitioned disk from it (mbr.img + a
synthesized partition table + boot.img + root.img), formats /perm
as ext4 in pure Go via go-diskfs, and ejects the disk so a user
running on a regular workstation can flash an SD card or homelab
VM disk in one command without installing e2fsprogs.

On macOS the target disk is auto-discovered via diskutil, skipping
the boot disk and anything bigger than 256 GB out of paranoia. On
Linux the user passes --disk=/dev/sdX explicitly. Windows is not
supported yet and the command returns an error.

The GPT layout matches monogok's full-disk layout via the new
public github.com/bradfitz/monogok/disklayout package; a drift-
guard test inside monogok asserts the two implementations stay
byte-identical so OTA updates against monogok-built images keep
working.

Behind a ts_omit_flashappliance build tag (on by default).

Updates #1866

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: Ic1a8cd185e7039edccb7702ab4104544fcb58d29
Add three new helpers to the existing progresstracking package:

  - Ticker: spawns a 1 Hz goroutine that calls a report function with
    the current value of an atomic counter and a total. Returns a stop
    function (safe to call multiple times via sync.OnceFunc) that fires
    one final report and blocks until the goroutine exits.

  - NewWriter: wraps an io.Writer and calls onProgress at most once per
    interval with the cumulative byte count.

  - CountingWriter: an io.Writer that atomically counts bytes written,
    for use with Ticker.

These will be used by the appliance flash and OTA update code in
subsequent commits.

Updates #1866

Change-Id: If353cea6506f5351b6fb19bfdb7bc9b78fe7855e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
We borked this in 30a89ad
and started including skipped extensions (e.g., conn25 when
TAILSCALE_USE_WIP_CODE != 1) in the list of active ones.

This doesn't have any impact other than on logging, though.

Updates #cleanup

Signed-off-by: Nick Khyl <nickk@tailscale.com>
Update ts-gokrazy to b83088f which includes:
      - Skip hardware watchdog when nowatchdog is on kernel cmdline
      - gokrazy.log_to_serial=1 tees service logs to /dev/console
      - Fix /etc/resolv.conf symlink (point at /tmp/resolv.conf where
        userspace DHCP writes, not /proc/net/pnp which is always empty)

All these things are more emulating a Raspberry Pi in qemu when doing
local development of the appliance image.

Updates #1866

Change-Id: Iba7847e5deb237b1e485b74a4126e31fd118333a
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.