Skip to content

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113

Draft
scotwells wants to merge 91 commits into
mainfrom
feat/datumctl-compute-plugin
Draft

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113
scotwells wants to merge 91 commits into
mainfrom
feat/datumctl-compute-plugin

Conversation

@scotwells

Copy link
Copy Markdown
Contributor

Summary

Adds the datumctl compute plugin so developers can deploy and manage containerized workloads on Datum Cloud directly from the CLI.

Commands shipped:

  • deploy — push a container image as a workload with flags or a manifest file; waits for rollout
  • destroy — tear down a workload with a confirmation prompt
  • status — show workload health, per-city placement summary, and the active revision
  • instances — list all running instances across cities, with describe for full detail
  • scale — adjust minimum replica count across all placements
  • rollout — watch live rollout progress, browse revision history, and roll back to any prior revision
  • restart — trigger a rolling restart of a workload or a specific city
  • quota — inspect per-city instance usage and surface quota-exceeded messages

Revision history is stored as a ConfigMap per workload so rollout history and rollout undo work without server-side tracking.

Dependencies

What's not included

  • logs — telemetry service not yet implemented
  • Tests — next step is adding envtest-based integration tests for each command
  • cities / instance-types resource listing commands

Related

Closes #98. Design proposal in #111.

Workloads targeting a city location are now automatically routed to the
correct physical site via a Karmada-based federation layer. Each POP cell
operates independently, instance health is surfaced back to the control
plane in real time, and the platform remains available even when parts of
the control plane are temporarily unreachable.

Controllers added:
- WorkloadDeploymentFederator: replicates WDs into Karmada and manages
  PropagationPolicies per city code
- InstanceProjector: mirrors Instance write-backs from Karmada into the
  project namespace on the control plane

ResourceInterpreterCustomization deployed at config time teaches Karmada
how to aggregate replica counts and conditions across POP cells.

Operator flags --enable-management-controllers and --enable-cell-controllers
allow each deployment to opt into only the controllers it needs.

Includes a 6-test Chainsaw e2e suite covering federation, deletion cascade,
propagation policy lifecycle, instance projection, instance write-back, and
the full end-to-end chain.

Resolves #85

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scotwells and others added 28 commits May 26, 2026 15:04
…edge

Introduces management-plane and cell overlay paths to the compute
OCI artifact so the infra repo can deploy compute-manager in the
correct mode for each tier of the federation architecture.

The management-plane overlay deploys compute-manager with only
WorkloadDeploymentFederator and InstanceProjector enabled, connected
to the Karmada downstream control plane via projected ServiceAccount
token auth. The cell overlay deploys compute-manager with only
WorkloadDeploymentReconciler and InstanceReconciler enabled, with no
downstream connection or webhook server.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ts for webhook TLS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the hardcoded datum-control-plane ClusterIssuer from the
csi-webhook-cert component. DNS names stay since they are fixed by the
service name and namespace. Each consuming overlay now supplies the issuer
via a strategic merge patch, allowing different environments to use
different cert issuers without forking the component.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each WorkloadDeployment is routed to exactly one cell cluster via its
PropagationPolicy, so aggregation across multiple members is not needed.
Replace the summing logic with a direct pass-through of the single member's
status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cert issuer name is environment-specific configuration that belongs
in the infra repo, not the compute overlay. The infra repo's base manager
patch already owns the full webhook-server-tls volume definition including
the issuer. Consumers deploying outside infra must patch the issuer in their
own overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…moval

dev: inline self-signed Issuer + Certificate for host.docker.internal,
replace kustomize replacements block with direct annotation patch, remove
Certificate-patching from webhook_patch.yaml, and clear webhookServer
secretRef from config.yaml.

single-cluster: replace cert-manager Certificate approach with the
csi-webhook-cert component, matching the main branch overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The WorkloadReconciler watches networkingv1alpha.Network objects, which
requires the network-services-operator CRDs to be installed. Cell clusters
don't have those CRDs, causing the manager to crash on startup. Gate the
WorkloadReconciler behind enableManagementControllers so it only runs where
the Network CRDs are present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracts server config file reading and decoding into a dedicated
loadServerConfig helper, reducing main's cyclomatic complexity from
31 to 29 to satisfy the gocyclo linter limit of 30.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Milo's authorization webhook uses Extra claims on the admission request
(iam.miloapis.com/parent-name, iam.miloapis.com/parent-type, etc.) to
resolve the correct project-scoped policy binding. Dropping them caused
the SAR to return Allowed=false even for users with networks.use, because
the authorizer couldn't locate the binding without the project context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
metricRules belongs under spec.quota, not spec.billing. The field is
not declared in the ServiceBillingConfig schema, causing Flux dry-run
failures in staging with:

  .spec.billing.metricRules: field not declared in schema
Previously, InstanceReconciler wrote ResourceClaim objects against
the local deployment cluster via managementCluster.GetClient(). Those
claims were never seen by the Milo quota system, leaving every Instance
in QuotaGranted=Unknown indefinitely.

This change routes claim creation and deletion to the correct Milo
project control plane for each instance using a new
ProjectQuotaClientManager that builds per-project REST clients by
rewriting the host path — mirroring the URL construction already used
by the milomulticluster provider.

The management-cluster claim watch is replaced with a multicluster
Watches call so that grant/denial status changes in project control
planes re-trigger instance reconciles. Claims are stamped with a
source-cluster label (discovery.clusterName) so each edge controller
only reacts to the claims it created.

Co-Authored-By: Claude <claude@anthropic.com>
The admission webhook requires that all metrics referenced in
spec.quota.limits[].metric and spec.quota.metricRules[].metricCosts
match a name declared in spec.metrics[]. The four quota-tracking
metrics (workloads, instances, vcpus, memory) were missing from
spec.metrics[], causing the webhook to reject the resource.
…o cell setup

Controller flags --enable-management-controllers and --enable-cell-controllers
now default to false so kustomize components must explicitly opt in, rather than
both groups running by default. This prevented the management-plane deployment
from crashing when discovery.clusterName was unset — that field is only required
by the InstanceReconciler (a cell controller), so the validation now lives in
InstanceReconciler.SetupWithManager instead of initializeClusterDiscovery.

Also adds cell-controllers and management-controllers components to the
single-cluster overlay, which was silently running with no controllers enabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…scovery

The rebase during cherry-pick propagation introduced a mixed state where
cmd/main.go had the edgeClusterName/projectRestConfig return values partially
reverted. This cleans up the function signature and call sites to be consistent,
while keeping the validation removed from initializeClusterDiscovery (it belongs
in InstanceReconciler.SetupWithManager per the original fix intent).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… RBAC

The workload-deployment-federator calls ensureDownstreamNamespace before
federating WorkloadDeployment resources, but the compute-manager ClusterRole
was missing core-group namespace permissions, causing every reconcile to fail
with a forbidden error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workload scheduling and admission now consult LocationBinding objects
(project-scoped, created by the service catalog) rather than the global
Location list. This ensures consumers only see locations that are both
healthy and available to their specific project.

Also upgrades network-services-operator and milo dependencies to
versions that introduce LocationBinding and address multicluster-runtime
v0.23 API changes (ClusterName type, ProviderRunnable Start lifecycle,
generic webhook builder).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ources

WorkloadDeploymentReconciler creates and owns NetworkBinding and SubnetClaim
resources, and watches Location, NetworkContext, and Subnet. InstanceReconciler
watches ResourceClaim for quota. Neither was granted the necessary ClusterRole
rules, causing watch failures on cell clusters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
From the cell cluster's perspective, Karmada is upstream (the federation
control plane), not downstream. Rename the flag, env var, and related
variables throughout to reflect the actual relationship.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…viderRunnable fix

Points go.miloapis.com/milo to the feature branch commit that implements
multicluster.ProviderRunnable on the Milo provider, enabling the mc manager
to auto-call provider.Start() and set p.mcAware so project clusters can be
registered. Without this, p.mcAware was always nil and every project reconcile
logged "Multicluster manager not yet started" forever.

Also removes the & from ResourceRef in ResourceClaimSpec — the feature branch
has ResourceRef as a value type, not a pointer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove non-existent QuotaRestConfig() call and fix SetupWithManager argument
count; pass nil quota config to skip quota enforcement for now. Single-tenant
cell mode uses namespace-as-project-id and the fixed 'single' cluster name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires up Milo ResourceClaim-based quota accounting for cells running
in single-cell discovery mode (mode: single), where the multicluster
ClusterName is always "single" rather than the Milo project name.

Key changes:

- Add QuotaKubeconfigPath config field and QuotaRestConfig() method so
  quota REST config can be configured independently of discovery mode.
  Returns (nil, nil) when neither path is set, disabling quota rather
  than silently targeting the local apiserver.

- Add projectIDForInstance and clusterNameForProject func fields to
  InstanceReconciler. In single mode, project ID is derived from
  instance.Namespace; the watch map func always enqueues ClusterName
  "single" rather than the project namespace, avoiding ErrClusterNotFound
  on every quota-grant event.

- Guard ResourceClaim watch map func against claims with empty ResourceRef
  to prevent a nil-dereference panic when a label-matching claim from
  another actor has no ResourceRef set.

- Add TestReconcileQuotaSingleMode covering the full single-mode quota
  flow: project ID from namespace, watch re-enqueue to "single" cluster.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v2.1.5 was built with Go 1.24 and refuses to lint Go 1.25 modules.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tatus change

Write-back was only triggered inside the statusChanged||readyChanged block,
so instances stuck in a scheduling gate (no status transitions) were never
replicated to Karmada.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nged

Use apiequality.Semantic.DeepEqual to avoid unnecessary API calls to Karmada
on every reconcile when nothing has actually changed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
From the cell cluster's perspective, Karmada is upstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scotwells and others added 2 commits June 1, 2026 16:23
scotwells and others added 3 commits June 1, 2026 20:10
Consume the server-side status-blocking-reason contract: each resource's
readiness condition (Instance/Ready, WorkloadDeployment/Available,
Workload/Available) now carries a machine-readable reason and human message
when not True.

- Add ReadinessBlock helper in util/conditions.go: given a condition list and
  type, returns (reason, message, blocked) with no per-reason branching —
  the single reusable entry-point for the new contract.
- InstanceStatus (list view): falls through to "Pending (<reason>)" from the
  Ready condition when no specific sub-condition check matches, replacing the
  bare "Pending" for unknown causes like SourceNotFound or ReferencedDataNotReady.
- InstanceStatusDetail (describe view): falls through to "Pending — <reason>"
  with the message as detail, replacing "Unknown" for those same causes.
- WorkloadHealth: surfaces the reason from Available when false, e.g.
  "Unavailable — SourceNotFound" instead of the generic message.
- degradedAnnotation (workloads describe per-city line): rewritten to read the
  WorkloadDeployment's own Available condition; removes the per-instance List
  fetch and the quota/InstanceStatusDetail special-casing that was its only logic.
- printBlockedDetail (rollout watch): rewritten to read the deployment's
  Available condition; removes the per-instance List fetch entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rovisioning status

The Programmed condition starts as Unknown (not False) while programming
is in progress, so the ConditionFalse-only checks were bypassed and the
raw ProgrammingInProgress reason leaked through the Ready condition
fallback. Widen the checks to status != True to cover both Unknown and
False states.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add three provider-emitted reason constants to the API types and map
them to plain-English STATUS strings in the list and describe views:

  ImageUnavailable  → Failed (image unavailable)
  InstanceCrashing  → Failed (crashing)
  ConfigurationError → Failed (configuration error)

Rename the PendingProgramming/ProgrammingInProgress cases from the
misleading "network provisioning" to "Starting", which accurately
describes the transient state without implying network work is involved.

Failed statuses are already counted in the "N Failed" summary line via
the existing strings.HasPrefix(status, "Failed") check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells

Copy link
Copy Markdown
Contributor Author

📋 Real-world UX issue from a user enabling compute

Heads up — we got a user report that surfaces a confusing first-run experience with the enablement flow, and I've traced it end-to-end via the staging audit logs. Sharing here since the fix touches this plugin.

What the user saw:

% datumctl compute instances list
Compute is not enabled for project "personal-project-153fe986".
Would you like to request access? [y/N]: y
Requesting access to compute for project "personal-project-153fe986"...
Error: requesting compute access: serviceentitlements.services.miloapis.com "compute" already exists

From their perspective this looks like a flat-out failure. In reality, their first attempt succeeded — compute was enabled.

What actually happened (from the audit trail):

  1. First run created the entitlement successfully. ✅
  2. But the backend takes a short while (~minutes in this case) to mark it Ready.
  3. During that window, the CLI's "is compute enabled?" check keys off the entitlement's Ready status, not its existence — so it kept reporting "not enabled" and re-offering to request access.
  4. Each retry tried to create the entitlement again and hit a 409 already exists, which we surfaced as a raw, scary error.

Why it matters for the product: the very first thing a new user does is turn compute on, and today that happy path can look broken even when it worked. The error message also leaks an internal resource name (serviceentitlements.services.miloapis.com) that means nothing to a user.

Proposed fix (branch fix/compute-entitlement-pending-state, built off this PR's branch): teach the enablement check to distinguish three states instead of two —

  • not requested → offer to request access (today's behavior)
  • requested but still activating → tell the user it's in progress and to try again in a moment (no re-prompt, no error)
  • active → proceed

…and treat a 409 already exists as "already requested, activation pending" rather than a fatal error. Net result: the user sees a calm "enablement in progress, hang tight" message instead of a stack of confusing failures.

Happy to fold this into this PR or send it as a follow-up — whichever you prefer. 🙏

scotwells and others added 3 commits June 3, 2026 19:04
`compute restart` stamped the non-canonical kubectl.kubernetes.io/restartedAt
annotation on the workload/deployment template. Use the documented
RestartedAtAnnotation (compute.datumapis.com/restartedAt) instead, matching the
controller's restart contract. Both keys change the template hash, but the
canonical one is the documented trigger for the ordered instance roll.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface rolling-update / restart progress in `datumctl compute workloads` by
showing updated/desired replica counts next to ready. UP-TO-DATE counts
instances on the latest template revision (status.updatedReplicas), so a roll
is visible as the count dips below desired and then recovers.

Includes a byte-identical copy of the UpdatedReplicas/ObservedGeneration
WorkloadDeployment status fields in api/v1alpha so the plugin can read them.
These fields are defined identically on the controller branch (PR #129); the
duplicate resolves cleanly once both land on main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Renames the Instance "Running" status condition to "Available" (wire
value "Available") across the API types, controller, and CLI. An
instance can be available while not actively running a pod (e.g. scaled
to zero), so "Running" was a misleading serving/health signal.

API/controller: same constant renames as the backend branch
(InstanceRunning -> InstanceAvailable, InstanceRunningReason* ->
InstanceAvailableReason*, InstanceReadyReasonRunning ->
InstanceReadyReasonAvailable) plus the kubebuilder default marker and
regenerated Instance CRD.

CLI: the derived status now reports availability, never live runtime
state. `Ready=True` displays "Available" (was "Running"), failure
details read "Not available — …" (was "Not running — …"), and the
Available-condition-derived "Starting"/"Stopping" liveness states are
dropped — the CLI no longer indicates whether a process is actively
running at this instant. IsRunning -> IsAvailable.

BREAKING CHANGE: the on-the-wire Instance condition type changes from
"Running" to "Available".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells

scotwells commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Context: should rollout become a verb group?

Capturing the context so we can make this call deliberately — the decision is open, including doing it before this PR merges. Not asserting a deferral.

Today restart and rollout are flat siblings under compute, and rollout is a status-watcher (≈ kubectl rollout status). As we add more rollout lifecycle operations there's a natural pull toward the kubectl mental model, where rollout is a group of verbs:

compute rollout restart    # what `compute restart` does today
compute rollout status     # what `compute rollout` does today
compute rollout undo       # rollback
compute rollout history    # revisions
compute rollout pause / resume

The tradeoff

  • Moving to the grouped form is a breaking change to the surface (compute restartcompute rollout restart). Cheapest to do before this PR merges / before the plugin has real adoption — every release we wait raises the cost of moving users' muscle memory, scripts, and docs.
  • With only two rollout verbs today (restart + status), the grouping is mostly cosmetic; its clear payoff lands once we ship undo / history (revision tracking) and can design the whole verb set + shared flags (--to-revision, --watch, …) in one coherent pass.
  • So it's a real "now vs later" call: now = pay the design cost early but lock in the clean structure before adoption; later = avoid churn until the verbs justify it, at the cost of a harder migration.

Open design question — where does history come from?
rollout history and undo both depend on revision history of the resource, and we should decide where that lives before committing to these verbs:

  • Compute-specific — compute tracks its own Workload/template revisions (à la Deployment → ReplicaSet), owned and stored by this service; or
  • Generic platform capability — a shared resource-history / revisioning / audit-trail primitive that any resource type plugs into.

If the platform offers (or should offer) generic resource history, compute's rollout {history,undo} should be a thin view over that primitive, not a bespoke revision store — and that also shapes whether these verbs stay under compute or surface more generically across the CLI. This is worth resolving early because it drives both the data model behind rollout and the command surface we'd be locking in.

A signal we're already drifting toward grouping
workloads describe → "Next steps" already advertises datumctl compute rollout undo <wl>, which doesn't exist yet. Our own copy is implicitly assuming the grouped model.

If we do it now, keep restart as a hidden alias of rollout restart for a release or two so we don't break early scripts. If we defer, revisit when we pick up rollout history/undo.

@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch 16 times, most recently from 5b638bb to 7dc94a0 Compare June 11, 2026 00:56
Base automatically changed from feat/federated-deployment-scheduling to main June 11, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define the UX, DX, and AX for deploying and managing compute workloads

1 participant