feat: route workloads to city locations via distributed scheduling (foundation) by scotwells · Pull Request #107 · datum-cloud/compute

scotwells · 2026-05-18T22:41:29Z

Summary

Workloads targeting a city location are automatically routed to the correct physical site, with instance health and readiness surfaced back to the platform in real time. This replaces the single central scheduler with per-site distributed scheduling, so each site operates independently. User-facing behavior is unchanged — city-code targeting, instance visibility, and the existing API all work as before.

This is the complete federation foundation. Decomposed from one large PR; the genuinely-independent pieces landed first and are merged:

Webhook TLS via cert-manager CSI mount → Simpler, more reliable webhook TLS via a cert-manager CSI mount #141 ✅
API types → Add the API types for federated workload delivery #147 ✅
Quota client + metrics → Add the per-project quota client and metrics #148 ✅
Instance Running → Available condition rename + federation status surface → An Instance is "Available" when it's ready to serve, even when scaled to zero #150 ✅ (this branch builds on it)

What remains here is the full controller layer and the operational-completeness fixes that make it correct on its own: quota self-heals when a grant arrives late (backing-off safety-net requeue, and the quota condition is persisted before transient errors return so granted state can't be lost), instance restart actually rolls instances (recreate instead of in-place update), a downstream-WorkloadDeployment status watch so aggregated status mirrors back immediately instead of on resync, rollout progress via UpdatedReplicas/ObservedGeneration, instance blocking reasons, and instanceType vCPU/memory quota sizing. (These were briefly split into a separate PR and folded back in, since the review showed the foundation is incomplete without them.)

Design & docs

Enhancement doc: Federated deployment scheduling (merged via docs: Multi-region workload scheduling design #106)
Integration strategy / feature request: Define integration strategy with federated control plane for workload deployment scheduling #85

What's inside

14 thematic commits, intended to be reviewed commit-by-commit:

Remove the central WorkloadDeployment scheduler
WorkloadDeployment federator (project control plane → federation hub)
InstanceProjector (federation hub → project namespaces)
Distributed WorkloadDeployment and Workload reconciliation
Instance controller for federated scheduling
Webhook validation updates for federation
Cell and management-plane wiring with feature gates
Roll instances by recreate so restart actually rolls them
Rollout progress via UpdatedReplicas + ObservedGeneration
Instance blocking reasons and instanceType vCPU/memory quota claims
CRDs, RBAC, and kustomize overlays for federation
IAM: allow users to patch workloads
Regression tests for replica counting and scheduling-gate clearing
Toolchain: Go 1.25 + golangci-lint v2.12.2

Testing

Covered by unit tests here, including regression coverage for the replica-counting/gate-clearing path and for quota-condition persistence across transient reconcile errors. End-to-end coverage is deferred to #149 — the original harness ran the operators locally (go run) rather than deploying them to the cells, so it didn't exercise RBAC/manifests/image. It'll be rebuilt as a proper in-cluster harness; the deferred suites are preserved on archive/e2e-local-deferred.

Known follow-ups (from review)

Not blockers for review, tracked separately: single-cluster overlay bootability, the status interpreter not being wired into any overlay, management-plane leader-election scoping (the federation manager runs outside leader election), and observability (metrics/Events) on the federation paths.

Closes #85

scotwells · 2026-05-28T20:54:34Z

Setting to draft while I continue to iterate on getting this working in staging.

The base branch was changed.

scotwells · 2026-06-05T01:42:39Z

📦 The federation e2e chainsaw suites (~900 lines of test YAML) have been split out into a dedicated PR so this foundation reviews without them inline. The shared test/e2e/env harness stays here. See the federation-e2e PR (stacked on this branch).

Bump the toolchain to Go 1.25 and golangci-lint v2.12.2 and align the CI workflows and Makefile with the new versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Delete the central scheduler that placed WorkloadDeployments from a single control plane, and drop its registration from main. Placement now happens through the distributed federator and per-cell controllers introduced in the following commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Introduce the federator that fans a WorkloadDeployment out to the cells selected for its placement, replacing the central scheduler. Add the city-code field indexer it uses to map subnet/location events back to the deployments that depend on them. Beyond fanning the spec out, the federator watches the downstream Karmada WorkloadDeployment (milosource cluster source with cluster-name-preserving enqueue) so aggregated status mirrors back to the project WorkloadDeployment immediately instead of waiting on an informer resync. Downstream events map back to the bare project cluster name the multicluster provider keys on, dropping events for clusters that are not engaged yet. The "cluster-<name>" label encoding (project path with "/" -> "_") is centralized in EncodeClusterName/DecodeClusterName so the wire format lives in one place; the federator wraps the shared decoder and trims to the last path segment to recover the provider cluster key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add the projector that mirrors cell-side Instances back to the management plane, writing their status (readiness, placement, blocking reasons) onto the project-scoped Instance so callers see a single view across cells. Include the shared controller test helpers that build the project/Karmada fake clients and multi-cluster manager used by the federation tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…liation Rework the WorkloadDeployment and Workload controllers to run per cell, resolving networks and Locations locally and driving Instance lifecycle through the stateful instance-control logic rather than a central scheduler. Update the instance-control packages to manage Instances within a cell's control plane. The reconciler requeues explicitly after adding its finalizer (the metadata-only Update can be dropped by watch-side event filtering, which would otherwise leave a new cell WorkloadDeployment unreconciled), and the scheduling-gate clearing path guards the nilable Spec.Controller that the infra provider populates independently of networking readiness. A deployment whose city has no Location yet has no other wake-up event (SubnetClaims/Subnets only exist after a Location resolved), so the controller watches Locations to re-reconcile waiting deployments, and surfaces the wait on the Available condition (NoMatchingLocation, naming the city code) instead of only logging it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Update the Instance controller to compute the Ready/Available conditions and apply the per-project quota gate within a single reconcile pass, so federated placement reflects real allocatable capacity. Quota flow: the ResourceClaim is named after the Instance (unique within the project control plane, "instance-" prefixed so it cannot collide with other kinds' claims) and carries an instance-namespace label so a grant event maps back to the owning Instance for immediate re-enqueue. Because the grant lives on the project control plane and the watch event can be missed (informer engagement races, relist gaps), a backing-off safety-net requeue runs while QuotaGranted != True — anchored on the Instance creation time, computed up front so every return path honors it, logged for observability, and falling back to the bounded quota interval on write conflicts instead of controller-runtime's error backoff. The controller also emits Warning events explaining why an Instance is blocked (QuotaNoBudget, NetworkFailedToCreate, ...) so the signal reaches kubectl describe and the activity timeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Update Workload webhook and Instance validation so the API accepts the fields federated scheduling adds and continues to reject invalid placement and runtime specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire the manager to run in either cell or management-plane mode, gating the federator, projector, and per-cell controllers behind feature flags. Add the feature-gate registry and extend configuration to carry the downstream kubeconfig and discovery settings each mode needs. Single-mode project resolution (decoding edge namespace labels into project identity) lives in the controller package as NewSingleModeProjectID/NewSingleModeProjectNamespace constructors; main.go keeps only the wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… them A template-hash change (an image update, or a restartedAt annotation from `datumctl compute restart`) previously resolved to an in-place Update of the Instance. The unikraft provider bakes the pod at creation time and never recomputes an existing pod's spec, so the in-place update silently failed to roll the running workload — instances kept their old pod. Emit a delete (recreate) for drifted Ready instances instead. The next reconcile refills the slot via the create path with the new template, and the provider's finalizer-gated teardown plus create-on-new-Instance roll the pod with no provider changes. Ordered one-at-a-time pacing is preserved by the existing descending-ordinal sort, skip-all-but-first, and the DeletionTimestamp WaitAction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rvedGeneration A restart/rolling update was invisible from the project plane: there was no status field representing how many instances are on the new template revision. Add UpdatedReplicas (instances whose observed template hash matches the desired template, regardless of readiness) and ObservedGeneration to both WorkloadDeployment and Workload (plus placement) status. UpdatedReplicas is computed on the cell WD reconcile alongside CurrentReplicas (which is now its Programmed subset), aggregated up into the Workload, and rides the existing status sync to the project plane. Repoint the "Up-to-date" printcolumn to .status.updatedReplicas to match `kubectl get deployment` semantics, so a roll is visible as the count dips below Replicas and recovers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…emory Two Instance-controller correctness changes: - Blocking-reason rollup: surface the most specific provider sub-condition (ImageUnavailable, InstanceCrashing, ConfigurationError, Provisioning) and its message onto the Instance Ready condition instead of a generic "Instance has not been programmed", so e.g. an image-pull failure reads as ImageUnavailable with the real message. Ranks the API reason constants in the blocking-reason priority. - Quota sizing: resolve vCPU/memory for instanceType-sized instances from a new instanceTypeCatalog (datumcloud/d1-standard-2 = 1 vCPU / 2 GiB) so the quota ResourceClaim requests vcpus + memory, not just instance count. Explicit container limits / instance requests still take precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Regenerate the Instance, Workload, and WorkloadDeployment CRDs for the new API fields and add the kustomize structure that deploys the manager in cell or management-plane mode: federation and downstream RBAC bases, cell/management/quota-credentials components, the WorkloadDeployment status interpreter, and the matching overlays. The regenerated controller role also grants the event writes the instance controller performs when surfacing blocking reasons (QuotaNoBudget, ImageUnavailable, NetworkFailedToCreate, ...) so those signals reach kubectl describe and the activity timeline instead of being rejected by RBAC. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ate clearing Adds unit coverage for the WorkloadDeployment controller's replica bucketing (updated/current/ready/quota-blocked), the network scheduling-gate clearing path, the nil Spec.Controller and nil Status.Controller regressions, and the finalizer-add requeue with status publication (ObservedGeneration, DesiredReplicas, ReplicasReady). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

scotwells requested review from JoseSzycho, kevwilliams, mattdjenkinson, privateip and savme May 19, 2026 19:31

scotwells mentioned this pull request May 19, 2026

Launch Datum Compute datum-cloud/enhancements#682

Open

scotwells force-pushed the feat/federated-deployment-scheduling branch from 0c0d8df to 134086f Compare May 19, 2026 21:10

scotwells changed the title ~~feat: federated deployment scheduling across POP cells~~ feat: Route workloads to city locations via distributed scheduling May 20, 2026

scotwells force-pushed the feat/federated-deployment-scheduling branch 3 times, most recently from 6e9a268 to 492eb6c Compare May 20, 2026 22:19

mattdjenkinson approved these changes May 22, 2026

View reviewed changes

scotwells requested a review from mattdjenkinson May 27, 2026 00:15

mattdjenkinson previously approved these changes May 27, 2026

View reviewed changes

privateip previously approved these changes May 28, 2026

View reviewed changes

scotwells closed this May 28, 2026

scotwells reopened this May 28, 2026

scotwells marked this pull request as draft May 28, 2026 20:53

This was referenced May 29, 2026

feat(api): add Command and Args fields to SandboxContainer #125

Merged

feat: federated workload scheduling across POP cells #116

Closed

fix: Report accurate health for federated workloads #127

Open

Base automatically changed from docs/issue-85-karmada-federation-design to main June 1, 2026 22:01

This was referenced Jun 4, 2026

Simpler, more reliable webhook TLS via a cert-manager CSI mount #141

Merged

Instances self-heal, restart, and report status correctly on the federation foundation #142

Merged

scotwells force-pushed the feat/federated-deployment-scheduling branch from 82955e2 to bf73355 Compare June 5, 2026 01:42

scotwells mentioned this pull request Jun 5, 2026

End-to-end coverage for federated workload delivery #146

Closed

Base automatically changed from split/api-rename to main June 5, 2026 19:56

scotwells force-pushed the feat/federated-deployment-scheduling branch 3 times, most recently from 71e388c to 5718fbb Compare June 10, 2026 19:26

scotwells and others added 2 commits June 10, 2026 14:52

chore(deps): upgrade to Go 1.25 and golangci-lint v2.12.2

b6c435f

Bump the toolchain to Go 1.25 and golangci-lint v2.12.2 and align the CI workflows and Makefile with the new versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells force-pushed the feat/federated-deployment-scheduling branch 4 times, most recently from 9542455 to 615ef54 Compare June 10, 2026 23:33

scotwells and others added 2 commits June 10, 2026 18:47

scotwells force-pushed the feat/federated-deployment-scheduling branch from 615ef54 to 5b638bb Compare June 10, 2026 23:51

scotwells and others added 10 commits June 10, 2026 19:46

feat(webhook): validation updates for federation

8717dad

Update Workload webhook and Instance validation so the API accepts the fields federated scheduling adds and continues to reject invalid placement and runtime specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: allow users to patch workloads

1cc509f

scotwells force-pushed the feat/federated-deployment-scheduling branch from 5b638bb to 7dc94a0 Compare June 11, 2026 00:56

scotwells requested review from mattdjenkinson and privateip June 11, 2026 01:39

privateip approved these changes Jun 11, 2026

View reviewed changes

scotwells merged commit 7db1bcb into main Jun 11, 2026
9 checks passed

scotwells deleted the feat/federated-deployment-scheduling branch June 11, 2026 01:46

scotwells mentioned this pull request Jun 11, 2026

Workloads can reference ConfigMaps and Secrets, delivered to every POP cell #129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: route workloads to city locations via distributed scheduling (foundation)#107

feat: route workloads to city locations via distributed scheduling (foundation)#107
scotwells merged 14 commits into
mainfrom
feat/federated-deployment-scheduling

scotwells commented May 18, 2026 •

edited

Loading

Uh oh!

scotwells commented May 28, 2026

Uh oh!

scotwells commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

scotwells commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design & docs

What's inside

Testing

Known follow-ups (from review)

Uh oh!

scotwells commented May 28, 2026

Uh oh!

scotwells commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scotwells commented May 18, 2026 •

edited

Loading