feat: establish the Isolation Backend interface

### Problem Statement

Inside every sandbox the supervisor does a lot at once: it runs the proxy, evaluates policy, resolves identity, audits, manages the agent process, and **builds the isolation boundary itself** (it creates the network namespace and routing, which is why the agent's environment is granted elevated privileges). The driver provisions the environment the supervisor runs in, but boundary-building is the supervisor's own inline code, not a separate component it talks to.

Those are really two different concerns: the **policy authority** (decide what the agent may do, mediate it) should be stable across environments, while the **boundary machinery** (create the namespace and routing, or whatever a given environment requires) is privileged and environment-specific, and should be free to vary. RFC 0001 describes the other major subsystems as driver-backed contracts and even names an **Isolation Backend** for this one, but the contract was never defined; it's only a box on the diagram. So changing how the boundary is built means changing the supervisor itself.

That coupling costs us three ways:

- **Deployability.** Restricted, regulated, and multi-tenant clusters often forbid the elevated container privileges the current mechanism needs, so OpenShell can't yet run in some of the environments it's heading toward (#899).
- **Security.** The privilege that builds the boundary sits in the same container as the untrusted agent, so a compromise reaches the setup meant to contain it (#981).
- **Extensibility.** Safer topologies (a separate pod, VM, or node agent) require changing the supervisor itself, not just adding a backend. This is already concrete: the split-pod proposal (#981) has to wire its topology directly into the Kubernetes driver, and the next topology would have to do the same again.

This issue asks whether we should define that interface, drawing the line between policy and mechanism, starting with the network boundary, before anyone turns it into an RFC.

### Proposed Design

One possible shape is to draw that line between policy and mechanism: define an interface so the boundary becomes a pluggable **Isolation Backend** the supervisor drives. The backend could eventually cover more of the containment envelope (filesystem, syscall, process identity); this proposal suggests starting with the network boundary and considering the other dimensions later, but that scoping is one of the questions below.

At a sketch level, the shape looks like this:

- provisioning happens at **sandbox creation**, through the driver (control plane);
- runtime mediation stays with the **supervisor** (data plane), the policy authority for proxy, policy, identity, and audit;
- the **Isolation Backend interface** sits between them; the two hand off through what provisioning leaves behind, rather than calling each other.

```
    Gateway ──► Driver ──┐
                         │ realizes the substrate
                         ▼
  ┌─ mediated sandbox ──────────────────────────┐
  │                                             │
  │   ┌──────────┐                      ┌───────┴──────┐
  │   │  Agent   │ ──── constrained ──► │  Supervisor  │ ──► allowed
  │   │          │      to the proxy    │ the mediator │     egress
  │   └──────────┘                      └───────┬──────┘
  └─────────────────────────────────────────────┘
                                                │ operates via
                                                ▼
                                       ┌──────────────────────────┐
                                       │ Backend runtime          │
                                       │ (location varies)        │
                                       └──────────────────────────┘
```

The goal would be for the supervisor to drive one common runtime interface while the backend's provisioning varies by environment, so the same supervisor-facing interface works whether the boundary is an in-pod network namespace or a separate pod, VM, or node agent. Today's in-container setup would become the *in-pod* backend; a *delegated* backend could build the boundary outside the agent's container, so the untrusted container would no longer need the network-boundary privileges that build its own containment boundary.

Whatever the backend, two candidate invariants would keep the interface grounded in concrete safety properties: **no unguarded agent-workload egress before the boundary is verified and ready**, and **no agent-workload execution before it's ready**.

Moving the setup out of the agent's container reduces the privilege in that container, but it doesn't make the boundary trustworthy on its own: once the backend builds the boundary somewhere the supervisor can't see, the supervisor still has to *verify* it was realized as admitted and **fail closed** if not, rather than trust an unauthenticated report. The interface has to make that verification possible, which is why this is a contract, not just a relocation.

The working assumption is that the backend would be chosen by deployment and admission configuration, set by the operator, not by the workload, so an untrusted workload can't select weaker isolation for itself (worth confirming this is the right place for that choice).

A full design would still need to work out the trust model a delegated backend depends on: an authenticated, verifiable handoff from the control plane to the supervisor (so the supervisor knows the runtime it talks to is the one provisioned for this sandbox), and the control-plane authorization underneath it. This issue just names those as things the RFC would have to cover, not solve here.

I've sketched a more detailed runtime contract and poked through the codebase to sanity-check feasibility, and I'm happy to share that if it's useful. I've kept this issue light on purpose: it seems worth agreeing this is the right interface to define before anyone invests in a full RFC.

#### What this could enable

| Problem | Today | With the Isolation Backend interface |
|---|---|---|
| **Deployability** (#899) | can't yet run under restricted Pod Security | a later delegated backend could build the network boundary outside the agent's container, so that container no longer needs the network-boundary privileges (one prerequisite for restricted Pod Security; other privileges may need follow-on work) |
| **Security** (#981) | the boundary's privilege sits with the untrusted workload | the network-boundary privilege could move out of the agent container, out of a compromise's reach, assuming the delegated handoff is verified |
| **Extensibility** | each new topology means rewriting the supervisor | VM, separate-pod, and node-agent backends could land behind one interface, without changing the supervisor |

A natural place to start would be the in-pod backend: wrap today's network setup behind the interface with the same privileges, startup order, behavior, and tests, then add delegated backends later. But the question for this issue is whether that's the right direction to head, not how to sequence it.

#### How this fits with work in flight

This isn't a new track; it names a boundary several efforts already press against from different sides.

- **#1511** scopes the proxy pipeline *above* the boundary and notes the nftables rules "belong to the sandbox network boundary." This names and scopes that boundary *beneath* the proxy, the thing that forces traffic into it. The two are closely related (and how they relate is question 2 below).
- **#1650** splits the supervisor's process and network responsibilities; this interface sits under the network side of that split.
- **#1680** (platform-managed Kubernetes via Agent Sandbox) is one way a backend's provisioning could be arranged; the interface defines what the supervisor then attaches to.
- **#981** (the split-pod / gVisor proposal) is the closest neighbor, and complementary rather than competing: it designs one concrete delegated topology, and this interface could be what it fits behind. The cross-pod problems it works through (identity across pods, CA/trust handoff, NetworkPolicy/CNI enforcement) are exactly the kind of thing the interface's contract would need to name.
- **#899** is one motivation: moving the network boundary out of the agent's container is one prerequisite for running under restricted Pod Security (other privileges would need follow-on work), and there's no interface today to put such a backend behind.

It also preserves the merged foundations: RFC 0001 (which draws the box and makes the supervisor the policy authority), RFC 0002 (agent-proposed policy stays the runtime authority; the backend is read-only at runtime), and RFC 0004 (typed resources; the backend references them, it doesn't redefine them).

Each of these efforts runs into the same missing interface from a different side. Defining it once could give them a shared contract to build on, instead of each one working around its absence.

This also relates to roadmap issue #1720.

#### Feedback requested

A few things I'd love a read on:

1. Is the policy/mechanism split the right line to draw for the network boundary beneath the proxy?
2. Should this live under #1511 (which scopes the proxy above this boundary), as its own RFC, or somewhere else?
3. Does "the supervisor drives a pluggable backend, in-pod first, delegated later" sound like the right direction?
4. Is network-first the right scope to start with, or should the first RFC also define the filesystem, syscall, and process-identity responsibilities of the Isolation Backend?

I'd especially welcome corrections from folks closer to this code path.

### Alternatives Considered

- **Build a specific delegated topology (e.g. #981) directly, no interface.** Wires one topology into the supervisor and forks it again for the next one (CNI, node, VM). An interface lets each land behind the same calls, and keeps whether to build #981 an open choice rather than a baked-in one.
- **Keep the current in-container model.** It works well today for the single-environment case, but it's the part that blocks restricted clusters (#899) and keeps the privileged boundary setup in the same container as the untrusted agent (#981), so it doesn't carry into the multi-tenant direction the project is heading.
- **Fold it into #1511.** #1511 owns the proxy *above* the boundary and notes the nftables rules "belong to the sandbox network boundary." This is that boundary, beneath the proxy, related but a distinct concern. Whether they're one effort or two is part of what I'm asking.

### Agent Investigation

A short summary of what I found in the code; happy to share the longer write-up.

- RFC 0001 specifies four driver-backed subsystems as contracts (only `ComputeDriver` is realized as a gRPC service today), but leaves the Isolation Backend as a box with no interface.
- The supervisor builds the boundary at startup (`openshell-sandbox`: create netns, install rules, then spawn the agent into it). The agent's container is granted `NET_ADMIN`, `SYS_ADMIN`, `SYS_PTRACE`, `SYSLOG`, plus `SETUID`/`SETGID`/`DAC_READ_SEARCH` under user namespaces, for the boundary setup plus the supervisor's process-management and identity-resolution duties there.
- The egress boundary is the netns + routing (the agent's traffic routes to the proxy, the only listener it can reach, given the host does not forward the sandbox subnet); nftables adds fast-fail and bypass logging on top.
- The supervisor already separates a process spec from the boundary handle when it spawns the agent, so wrapping today's path as the in-pod backend looks like a real but bounded refactor, not a rewrite.
- Verified against `main` at `b7ce0be4`.

### Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: establish the Isolation Backend interface #1737

Problem Statement

Proposed Design

What this could enable

How this fits with work in flight

Feedback requested

Alternatives Considered

Agent Investigation

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Problem	Today	With the Isolation Backend interface
Deployability (#899)	can't yet run under restricted Pod Security	a later delegated backend could build the network boundary outside the agent's container, so that container no longer needs the network-boundary privileges (one prerequisite for restricted Pod Security; other privileges may need follow-on work)
Security (#981)	the boundary's privilege sits with the untrusted workload	the network-boundary privilege could move out of the agent container, out of a compromise's reach, assuming the delegated handoff is verified
Extensibility	each new topology means rewriting the supervisor	VM, separate-pod, and node-agent backends could land behind one interface, without changing the supervisor

feat: establish the Isolation Backend interface #1737

Description

Problem Statement

Proposed Design

What this could enable

How this fits with work in flight

Feedback requested

Alternatives Considered

Agent Investigation

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions