Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
492eb6c
feat: implement federated deployment scheduling across POP cells
scotwells May 20, 2026
400144d
feat: add kustomize overlays for federated deployment to staging and …
scotwells May 26, 2026
0f69956
feat: replace cert-manager certificate resources with CSI volume moun…
scotwells May 26, 2026
a11861e
feat: remove webhook CA injection — Milo trusts the cert issuer directly
scotwells May 26, 2026
721e5a9
feat: make CSI webhook cert component generic, patch issuer per overlay
scotwells May 26, 2026
1b42ba5
fix: simplify WD status aggregation to pass through single member status
scotwells May 26, 2026
7116d78
refactor: move webhook cert issuer patch to infra repo
scotwells May 26, 2026
cbd64ab
fix: repair dev and single-cluster overlays after certmanager base re…
scotwells May 26, 2026
dfb266a
fix: only run WorkloadReconciler on management cluster
scotwells May 26, 2026
85d5281
fix: reduce cyclomatic complexity of main by extracting loadServerConfig
scotwells May 27, 2026
079a282
fix: forward Extra claims in network SubjectAccessReview
scotwells May 27, 2026
fba4f68
fix: move metricRules from billing to quota in ServiceConfiguration
scotwells May 27, 2026
c27ad39
feat: Route ResourceClaims to Milo project control planes
scotwells May 27, 2026
d184d1a
fix: add missing quota metrics to ServiceConfiguration
scotwells May 27, 2026
33eeb95
fix: default controller flags to false, move clusterName validation t…
scotwells May 27, 2026
15fe991
fix: correctly remove clusterName validation from initializeClusterDi…
scotwells May 27, 2026
ecc0f49
fix: grant compute-manager namespace read/write in Karmada downstream…
scotwells May 28, 2026
586bc4f
feat: switch to LocationBinding for location discovery and availability
scotwells May 28, 2026
cd007e4
fix: add RBAC for networking.datumapis.com and quota.miloapis.com res…
scotwells May 28, 2026
b896d12
refactor: rename downstream-kubeconfig flag to upstream-kubeconfig
scotwells May 28, 2026
7e90f65
fix: bump milo to feat/upgrade-controller-runtime-v0.23-clean for Pro…
scotwells May 28, 2026
1d1760e
fix: repair cell-controller instance reconciler wiring in main.go
scotwells May 28, 2026
5c04e29
feat: add quota client support for single-tenant cell mode
scotwells May 28, 2026
81e73c3
ci: bump Go version to 1.25 to match go.mod requirement
scotwells May 28, 2026
bed3d12
ci: bump golangci-lint to v2.2.2 for Go 1.25 compatibility
scotwells May 28, 2026
0d26598
ci: bump golangci-lint to v2.12.2 (latest, built with Go 1.25)
scotwells May 28, 2026
951d022
fix: write instance back to Karmada on every reconcile, not just on s…
scotwells May 28, 2026
3ac5115
fix: skip upstream write-back when spec, labels, and status are uncha…
scotwells May 28, 2026
c15161e
refactor: rename writeBackToDownstream -> writeBackToUpstream
scotwells May 28, 2026
385b974
refactor: rename DownstreamClient -> UpstreamClient across all contro…
scotwells May 28, 2026
a5916b9
fix: remove clusterName requirement in Milo mode for management plane
scotwells May 28, 2026
9cee4fa
feat: wire cell-mode quota enforcement and remove vendored deps
scotwells May 28, 2026
da63916
fix: treat missing quota kubeconfig file as quota-disabled rather tha…
scotwells May 28, 2026
9b36548
fix: mark compute-quota-credentials secret volume as optional
scotwells May 28, 2026
72a51ad
fix: instance reconcile hangs on cluster-scoped Namespace informer wh…
scotwells May 28, 2026
b662eb9
fix: projectIDForInstance reads wrong label — use upstream-cluster-na…
scotwells May 28, 2026
1a5b05d
fix: wrong claim namespace and resourceRef cause quota POST 403 in si…
scotwells May 28, 2026
aca19a6
fix(lint): resolve gofmt, lll, and reduce gocyclo introduced by quota…
scotwells May 29, 2026
71fc7f2
feat: declare datum-managed supported location class for compute
scotwells May 29, 2026
5486adf
fix(lint): go fully green — migrate webhook to non-deprecated APIs an…
scotwells May 29, 2026
426d547
fix(lint): resolve goconst/prealloc CI failures and add error-returni…
scotwells May 29, 2026
e3e31d0
fix(quota): harden quota failure modes — fail-loud startup, fail-clos…
scotwells May 29, 2026
97f2165
fix(lint): nolint GetEventRecorderFor — new events API has incompatib…
scotwells May 29, 2026
c1c6261
Merge pull request #118 from datum-cloud/fix/instance-namespace-infor…
scotwells May 29, 2026
6ae41d4
fix: fail loud on missing federation kubeconfig; rename to Federation…
scotwells May 29, 2026
553af62
Merge pull request #120 from datum-cloud/fix/mgmt-controller-fail-loud
scotwells May 29, 2026
5024b97
feat: add NetworkingIntegration feature gate to bypass VPC/NSO on edg…
scotwells May 29, 2026
70579e3
Merge pull request #121 from datum-cloud/feat/networking-feature-flag
scotwells May 29, 2026
77610b8
feat(config): wire feature gates through FEATURE_GATES env var in bas…
scotwells May 29, 2026
cd052a6
Merge pull request #122 from datum-cloud/feat/feature-gates-env-var
scotwells May 29, 2026
8059396
fix: Carry workload linking labels through instance write-back
scotwells May 29, 2026
2a97077
feat(api): add Command and Args fields to SandboxContainer
scotwells May 29, 2026
4ed7bb7
fix(lint): replace "d1-standard-2" literals with testInstanceType con…
scotwells May 29, 2026
fa711b9
Merge pull request #125 from datum-cloud/feat/sandbox-container-comma…
scotwells May 29, 2026
dd3421a
feat: stamp self-describing labels and location on instances
scotwells May 29, 2026
3574861
fix: Resolve instance projection owner by workload deployment name
scotwells May 29, 2026
94ad201
fix: backfill controller labels on existing instances
scotwells May 29, 2026
82955e2
fix: remove Quota scheduling gate in the same reconcile pass as statu…
scotwells May 29, 2026
6bcd822
fix: Guard nil Instance controller status when counting replicas
scotwells May 29, 2026
6583d48
fix: Stamp project WorkloadDeployment UID on instance projections
scotwells May 29, 2026
d25bdf3
feat: Sync WorkloadDeployment status to the project plane
scotwells May 29, 2026
3ffe45e
docs: Tighten code comments and add comment conventions
scotwells May 30, 2026
50dc0c0
docs: Trim narration comments in status syncer and wiring
scotwells May 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '~1.24.0'
go-version: '~1.25.0'

- name: Run linter
uses: golangci/golangci-lint-action@v8
with:
version: v2.1.5
version: v2.12.2
3 changes: 3 additions & 0 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ jobs:
secrets: inherit

publish-kustomize-bundles:
needs: publish-container-image
permissions:
id-token: write
contents: read
Expand All @@ -26,4 +27,6 @@ jobs:
with:
bundle-name: ghcr.io/datum-cloud/compute-kustomize
bundle-path: config
image-name: ghcr.io/datum-cloud/compute
image-overlays: config/base/manager
secrets: inherit
2 changes: 1 addition & 1 deletion .github/workflows/test-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '~1.24.0'
go-version: '~1.25.0'

- name: Install the latest version of kind
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '~1.24.0'
go-version: '~1.25.0'

- name: Running Tests
run: |
Expand Down
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@
# Output of the go coverage tool, specifically when used with LiteIDE
*.out

# Dependency directories (remove the comment below to include it)
# vendor/
# Dependency directories
vendor/

# Go workspace file
go.work
Expand All @@ -25,3 +25,6 @@ go.work.sum
.env

bin/

# Local e2e environment artefacts (Kind kubeconfigs, etc.)
tmp/
10 changes: 10 additions & 0 deletions .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ linters:
- dupl
- lll
path: internal/*
# field.ErrorList{} is the idiomatic Kubernetes validation init pattern;
# preallocating requires knowing the error count in advance which is not
# possible in recursive validation helpers.
- linters:
- prealloc
path: internal/validation/
# Test helpers that build slices via append are clearer without prealloc.
- linters:
- prealloc
path: internal/controller/instancecontrol/
paths:
- third_party$
- builtin$
Expand Down
221 changes: 221 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this is

`compute` defines the APIs and core controllers for the `compute.datumapis.com`
API group (`Workload`, `WorkloadDeployment`, `Instance`). It does **not**
provision infrastructure itself — it expresses intent that infrastructure
providers (e.g. `infra-provider-gcp`) act on, and it federates deployments
across edge/POP cells via Karmada. Networking types (`Network`,
`NetworkBinding`, `SubnetClaim`) come from the separate
`network-services-operator` repo.

## Commands

Build/test/lint use **Make**. The e2e environment is driven by **Task** (`Taskfile.yaml`).

```sh
make build # build bin/manager (runs manifests, generate, fmt, vet first)
make run # run controller locally against ~/.kube/config
make test # envtest-backed unit + integration tests (excludes /e2e)
make lint # golangci-lint (v2.1.5); lint-fix to auto-fix
make manifests # regenerate CRDs/RBAC/webhook config (controller-gen)
make generate # regenerate zz_generated.deepcopy.go + defaults
```

After editing any `*_types.go` in `api/v1alpha/`, run `make manifests generate`
(or just `make build`/`make test`, which depend on both).

**Single test** — `make test` resolves envtest assets via `KUBEBUILDER_ASSETS`.
For integration tests that need an apiserver, set it first:

```sh
export KUBEBUILDER_ASSETS=$(bin/setup-envtest use 1.31.0 --bin-dir bin -p path)
go test ./internal/controller/... -run TestName # testify
go test ./internal/controller/... -args -ginkgo.focus "pattern" # Ginkgo specs
```

Pure unit tests (no apiserver) run with plain `go test ./pkg/...`.

**E2E** (Kind + Karmada, Chainsaw): `task e2e:up` to stand up clusters and join
them to Karmada, `task e2e:test` to run, `task e2e:down` to tear down.

Tool versions are pinned in `Makefile`: envtest K8s `1.31.0`, controller-gen
`v0.16.4`, golangci-lint `v2.1.5`. Boilerplate header (AGPL-3.0-only) is
`hack/boilerplate.go.txt`.

## Architecture

### Resource hierarchy

A single `Workload` fans out into many `Instance`s through two levels, each
owned and reconciled by the level above:

```
Workload ──(per placement × city)──▶ WorkloadDeployment ──(per replica)──▶ Instance
```

- **Workload** (`api/v1alpha/workload_types.go`) — user-facing. Holds an
instance template and a list of `Placements` (city codes + min/max replicas).
- **WorkloadDeployment** (`workloaddeployment_types.go`) — one per
city/placement. Status aggregates its instances' replica counts/conditions.
- **Instance** (`instance_types.go`) — a single container (`SandboxRuntime`) or
VM (`VirtualMachineRuntime`). Carries network interfaces, volumes, scheduling
gates, and quota conditions.

Objects are stamped with `compute.datumapis.com/workload-uid` and
`...workload-deployment-uid` labels for indexed lookups (`internal/controller/indexers.go`).

### Controllers are split by plane

The binary (`cmd/main.go`) enables controller sets via flags. The two planes run
in different clusters in production:

**Management plane** (`--enable-management-controllers`):
- `WorkloadReconciler` — Workload → desired WorkloadDeployments; aggregates status back.
- `WorkloadDeploymentFederator` — replicates project-namespace WorkloadDeployments
into the downstream Karmada control plane and creates a `PropagationPolicy` per
city code (city-code label selector routes to matching cells). Federation only
(requires `--upstream-kubeconfig`).
- `InstanceProjector` — watches Instances that edge cells wrote back to Karmada
and creates read-only Instance projections in the originating project cluster,
owned by the WorkloadDeployment. Federation only.

**Cell / edge plane** (`--enable-cell-controllers`):
- `WorkloadDeploymentReconciler` — drives instance lifecycle via the instance-control
strategy, reconciles networking (NetworkBinding/SubnetClaim), manages scheduling gates.
- `InstanceReconciler` — manages quota (`ResourceClaim` against Milo project control
planes), clears the quota scheduling gate when granted, and writes the Instance
back upstream/to Karmada for projection.

### Instance control strategy

`internal/controller/instancecontrol/` computes Create/Update/Delete/Wait actions
for a deployment's instances. The `stateful/` implementation behaves like a
StatefulSet: ordered creation (wait for Ready before next), reverse-order updates
and deletes, template-hash tracking for rolling updates. Scheduling gates
(`scheduling_gates.go`) block an instance until networking and quota are ready.

### Multi-cluster (Milo + multicluster-runtime)

Controllers run on `sigs.k8s.io/multicluster-runtime` (`mcmanager.New`) with a
pluggable cluster provider chosen by discovery mode (`internal/config/config.go`):

- **Single** mode — one local cluster (`mcsingle`), no project discovery.
- **Milo** mode (`milomulticluster`) — each Milo project becomes a runtime cluster;
the cluster name doubles as the `projectID`.

Reconcilers receive `mcreconcile.Request` (carries `ClusterName`). `InstanceProjector`
is the exception — it uses a plain single-cluster `manager.Manager` pointed at the
downstream Karmada control plane.

### Quota (Milo)

`internal/quota/` — `ProjectQuotaClientManager` caches per-project clients, each
rewriting the REST host to
`/apis/resourcemanager.miloapis.com/v1alpha1/projects/{projectID}/control-plane`.
`InstanceReconciler` creates a `ResourceClaim` in the project's control plane,
watches it (`PendingEvaluation` → `QuotaAvailable`/`QuotaExceeded`), and removes
the quota scheduling gate on grant. **Claims are immutable — created once, never
updated** (the absent Update path is intentional). Quota is optional: a missing
quota kubeconfig means quota-disabled, not fatal.

### Federation data flow (when `--upstream-kubeconfig` is set)

1. Federator replicates WorkloadDeployments into Karmada namespace `ns-{project-uid}`
(via Milo's `MappedNamespaceResourceStrategy`) and creates city-code PropagationPolicies.
2. Karmada propagates each WorkloadDeployment to POP cells whose label matches the city code.
3. Edge `WorkloadDeploymentReconciler`/`InstanceReconciler` create Instances + ResourceClaims,
then write Instances back to Karmada (labeled with owner cluster/namespace/deployment-uid).
4. `InstanceProjector` resolves those labels and projects each Instance back into the
project cluster for the user to observe; status flows back up the same chain.

## Working with subagents

This is a large multi-cluster codebase — controller flows, generated code, and
federation paths span many files. Keep the main thread as an **orchestrator**:
delegate substantive work to subagents and synthesize their results, rather than
loading raw file dumps and command output into the main context.

- **Delegate by default.** Diagnosis, code search, cluster/`kubectl` checks,
multi-file edits, and PR prep should run in subagents. The main thread plans,
dispatches, and integrates the conclusions.
- **Read-only fan-out → `Explore`.** Use it to locate code or trace a flow across
many files when you only need the conclusion, not the file contents.
- **Match the specialized agent to the task** (see their descriptions):
`datum-platform:plan` for design before implementation;
`datum-platform:api-dev` for Go on the API server/controllers;
`datum-platform:test-engineer` for tests; `datum-platform:sre` for
Kustomize/CI/RBAC/manifests; `datum-platform:code-reviewer` as a post-change
gate; `datum-platform:tech-writer` for docs.
- **Parallelize independent work** — dispatch multiple agents in one turn when
their tasks don't depend on each other.
- **Give each agent a self-contained brief.** It doesn't share the main thread's
context: state the goal, the relevant paths, and the exact shape of the result
you want back. Have it return findings/paths, not large verbatim file contents.
- **Don't double-run.** Once a search or task is delegated, wait for the result
instead of also doing it inline.

## GitHub commentary (PRs, issues, comments)

Use the `gh` CLI and invoke the Datum convention skills **before** writing, so
all GitHub content matches the platform's house style.

**Lead with the product, not the implementation.** Frame PRs, issues, and
comments around what changes for users and operators of the platform — the
capability gained, the problem solved, the behavior they'll observe — before the
technical detail. Open with the user/product impact (e.g. "Workloads now
schedule across POP cells by city" rather than "Adds WorkloadDeploymentFederator
and a PropagationPolicy per city code"), then explain the implementation as
supporting context. Issues should describe the user-facing gap or desired
outcome first; PR summaries should answer "what can the platform now do, and why
does it matter" before "how it's wired." Keep the mechanism — it matters for
review — but make product value the headline.

- **Pull requests** — invoke `datum-platform:pr-conventions` before drafting the
title and body. Titles follow the commit format (`<type>(<scope>): <subject>`,
imperative, ≤72 chars); the body uses the skill's required sections (Summary,
etc.). End PR bodies with the Claude Code attribution footer.
- **Commits** — invoke `datum-platform:commit-conventions` for message format
(`<type>: <subject>` types: feat/fix/docs/refactor/test/chore) and include the
`Co-Authored-By` trailer. Only commit/push when the user asks.
- **Issues and review comments** — apply the same tone and structure: a clear
prose summary first, scannable bullets only where they help, type-prefixed
titles for issues, and concrete file/line references (`path:line`). Keep
comments specific and actionable rather than generic.

Don't open/merge PRs, push, or post comments unless the user has asked or
durably authorized it — these are outward-facing actions.

## Conventions

This repo follows the shared Datum platform conventions. Relevant skills:
`datum-platform:go-conventions`, `:controller-runtime-patterns`,
`:k8s-apiserver-patterns`, `:capability-quota`, `:commit-conventions`,
`:pr-conventions`. golangci excludes `lll` for `api/*` and `dupl` for `internal/*`.

### Code comments

Comment to explain **why**, not **what**. The code already shows what it does;
a comment earns its place by capturing the non-obvious reason, constraint, or
consequence a future reader can't infer from the code itself.

- **Why, not what.** Don't narrate the code (`// loop over instances`,
`// set the label`). Explain the rationale: the invariant being upheld, the
edge case being guarded, the external contract being honored.
- **Be concise.** One tight sentence usually beats three. Cut a comment down to
the load-bearing fact; if the code is self-explanatory, write no comment.
- **Go-forward only.** Comments describe the code as it stands now, not the
change that produced it. No diff narration, fix history, or "previously
this…" / "now we…" storytelling — that belongs in the commit message or PR,
not the source. A reader on `main` a year from now has no diff in view.
- **Don't restate identifiers.** If the function/variable name already says it,
the comment adds nothing.

Bad: `// Overwrite the WD UID label with the project-cluster WD UID because the
downstream Instance carries the cell-plane UID assigned by Karmada when it
propagated the WD, which never matches…` (verbose, narrates the change).
Good: `// Cell-plane Instances carry the Karmada WD UID; project-side
label lookups need the project WD UID.`
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ KUSTOMIZE_VERSION ?= v5.5.0
CONTROLLER_TOOLS_VERSION ?= v0.16.4
DEFAULTER_GEN_VERSION ?= v0.32.3
ENVTEST_VERSION ?= release-0.19
GOLANGCI_LINT_VERSION ?= v2.1.5
GOLANGCI_LINT_VERSION ?= v2.12.2

# renovate: datasource=go depName=fybrik.io/crdoc
CRDOC_VERSION ?= v0.6.4
Expand Down
Loading
Loading