Skip to content

Kubernetes Operator #1719

@drew

Description

@drew

Problem Statement

OpenShell's Kubernetes deployment path currently uses the gateway as the user-facing control plane and depends on the Kubernetes Agent Sandbox controller for runtime sandbox pods. As Kubernetes users adopt OpenShell, we need to decide whether an OpenShell Kubernetes Operator should exist, what responsibilities it should own, and how it should relate to the existing gateway and Agent Sandbox CRD.

This matters because platform teams may expect Kubernetes-native installation, reconciliation, status, and declarative sandbox workflows. At the same time, the gateway already owns OpenShell API behavior, credentials, inference configuration, policy attachment, sandbox lifecycle APIs, logs, watch streams, and client integrations through the CLI, SDKs, and TUI. An operator could complement that model, overlap with it, or replace parts of it in Kubernetes environments.

We should collect user, platform team, and contributor feedback before committing to a specific operator shape.

Proposed Design

Explore an OpenShell Kubernetes Operator with two possible directions:

OpenShell Deployment Operator

The operator could install and manage an OpenShell gateway inside a Kubernetes cluster. It may handle:

  • Gateway installation and upgrades.
  • Gateway configuration.
  • Kubernetes, Docker, or VM-backed compute driver configuration.
  • TLS, service exposure, ingress, and authentication wiring.
  • Integration with existing secret management workflows.
  • Gateway health and readiness through Kubernetes status conditions.

OpenShell Sandbox Custom Resource Operator

The operator could introduce an OpenShell-owned custom resource for declarative sandbox management. This resource may let users or platform teams define:

  • Sandbox runtime requirements.
  • Images, commands, environment variables, resources, and volumes.
  • OpenShell sandbox policies.
  • Lifecycle behavior such as TTLs, restarts, cleanup, and termination.
  • Status exposed through .status.
  • Integration with OpenShell audit logging, policy enforcement, provider configuration, and inference routing.
  • Reusable sandbox classes or templates for platform-managed environments.

Existing Agent Sandbox CRD Boundary

OpenShell already depends on the Kubernetes Agent Sandbox CRD in Kubernetes deployments today. That resource is also named Sandbox, but it lives in the agents.x-k8s.io API group and is reconciled by the Agent Sandbox controller into Kubernetes pods.

The existing Agent Sandbox CRD is a runtime-level Kubernetes primitive used by the OpenShell Kubernetes compute driver. A proposed OpenShell sandbox custom resource may be a different API surface: a user- or platform-facing OpenShell object that includes OpenShell policy references, inference configuration, provider wiring, lifecycle semantics, audit expectations, and gateway integration.

The design should disambiguate these two concepts before proposing a CRD shape:

  • Is the OpenShell operator expected to expose the existing Agent Sandbox CRD directly?
  • Should OpenShell introduce a separate CRD with a distinct name or API group, such as an OpenShell-owned sandbox request/resource?
  • If both exist, does the OpenShell CRD own and generate Agent Sandbox resources, or does the gateway continue to do that?
  • Which object is the source of truth for lifecycle, status, policy attachment, and deletion?
  • How should naming avoid confusing agents.x-k8s.io/Sandbox with any OpenShell-specific sandbox resource?

Operator And Gateway Boundary

The operator and current OpenShell gateway likely have overlapping responsibilities, especially around sandbox lifecycle, configuration, status, policy attachment, and infrastructure integration. It may be valid for some deployments to use an operator or a gateway, but not both.

The design should clarify when users interact with the Kubernetes Operator directly versus when they use the existing gateway API, CLI, SDKs, or TUI. In particular, we should decide whether the operator is primarily:

  • A Kubernetes-native deployment and reconciliation layer for the gateway.
  • A Kubernetes-native frontend for sandbox lifecycle management.
  • A replacement for some gateway responsibilities inside Kubernetes environments.
  • A complementary controller that delegates most runtime behavior to the gateway.

Questions For Feedback

  • Would you use an operator mainly to deploy OpenShell, to manage sandboxes, or both?
  • What Kubernetes-native workflows should this support?
  • What should the first version of the operator include?
  • Should proposed OpenShell sandbox resources be user-facing, platform-team-facing, or both?
  • How should a proposed OpenShell sandbox resource relate to the existing Agent Sandbox CRD?
  • What fields would you expect on an OpenShell sandbox custom resource?
  • How should policies be referenced or embedded?
  • How should credentials, provider configuration, and inference routing be handled?
  • What status, events, and observability would be required?
  • Are there existing operator patterns or tools we should align with?

Non-Goals For Now

  • Committing to a specific CRD shape.
  • Renaming, replacing, or exposing the existing Agent Sandbox CRD without a clear design.
  • Deciding whether the operator replaces Helm-based deployment.
  • Implementing the operator before collecting operator requirements.
  • Supporting every OpenShell runtime mode in the first version.

Alternatives Considered

  • Keep Helm plus gateway as the only Kubernetes user-facing workflow. This avoids new API surface, but may not satisfy platform teams that expect operator-managed lifecycle, status, upgrades, and reconciliation.
  • Expose the existing agents.x-k8s.io/Sandbox CRD directly as the user-facing sandbox API. This avoids a second sandbox resource, but may leak a runtime-level primitive that does not model OpenShell policy, credentials, inference routing, audit behavior, or gateway integrations.
  • Introduce a separate OpenShell-owned sandbox CRD. This can provide a clearer OpenShell API, but creates another resource to reconcile and requires a firm source-of-truth model with the gateway and Agent Sandbox controller.
  • Build only a deployment operator first. This keeps v1 focused on cluster installation and upgrades, but leaves declarative sandbox lifecycle management to the gateway API, CLI, SDKs, and TUI.

Agent Investigation

The local repo already documents and uses the existing Agent Sandbox CRD:

  • deploy/kube/manifests/agent-sandbox.yaml defines sandboxes.agents.x-k8s.io with kind: Sandbox.
  • deploy/helm/openshell/README.md says the Kubernetes Agent Sandbox CRDs and controller must be installed before deploying OpenShell.
  • architecture/gateway.md describes the gateway as the OpenShell control plane and notes that Kubernetes sandbox authentication verifies the pod's controlling Sandbox ownerReference against the live Sandbox CR UID.
  • crates/openshell-driver-kubernetes/ contains the Kubernetes compute driver that creates and watches Agent Sandbox resources.

These findings make the operator design a boundary and API-shape question, not just a request to add a new controller.

Proposed Outcome

Use this issue to gather requirements and decide whether to create one or more follow-up design issues for:

  • An OpenShell deployment operator.
  • An OpenShell sandbox CRD and sandbox lifecycle controller.
  • Shared operator infrastructure, status reporting, policy integration, and docs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Idea

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions