Run untrusted code through a lightweight, isolated, fast primitive. The framework's answer to: how do I run LLM-generated code or untrusted shell commands without giving them access to the host's filesystem, network, or unbounded resources?

The com.tnsai.quality.sandbox package in tnsai-quality ships the primitives:

Sandbox — interface: start / execute / stop / terminate + AutoCloseable
SandboxSpec — declarative config (image, env, fsMounts, networkPolicy, resourceLimits, workingDir, warmable)
SandboxResult — exit code, stdout/stderr, ResourceUsage, timedOut flag
SandboxFactory — SPI for backend selection (process / container / wasm / future firecracker)
ResourceLimits — cpu / memory / disk / timeout / maxProcesses
NetPolicy — sealed: DenyAll / AllowList(hosts) / Inherit
SandboxPool — warmable-instance reuse for high-concurrency workloads
ObservedSandbox + SandboxExecutionListener — per-execute observability events

Pairs with Accountability (sandbox events correlate with AgentLiabilityRecords on the same correlationId) and the future tnsai-tools refactor (existing PythonExecutionTools / JsExecutionTools move to the SPI in a follow-up).

Why a separate primitive

Three forces motivate sandbox as a first-class layer:

LLM-generated code is untrusted by definition. Even when an agent is benevolent, hallucinated rm -rf / happens. Process boundary alone doesn't protect the host filesystem; we need FS jail + env scrubbing + timeout enforcement.
Per-instance cost matters. Multi-agent fan-out (TNS-294 group), code review (TNS-291 deepsec harness), Agents-of-Chaos benchmarks — all push concurrent sandbox counts into the hundreds. Tencent CDB-style targets (60ms cold start, 5MB RAM) become the budget.
Backend selection is deployment-specific. A laptop dev wants speed (ProcessSandbox); a CI runner wants real isolation (ContainerSandbox); a production microVM host wants Firecracker. The framework ships an SPI so the same calling code targets all three.

Quick start

import com.tnsai.quality.sandbox.*;
import java.time.Duration;

// 1. Pick a backend. preferred() auto-selects the highest-priority
//    one available; explicit selection via byId(...) when you want
//    a specific backend.
SandboxFactory factory = SandboxFactory.preferred();
// or: SandboxFactory factory = SandboxFactory.byId("container");

// 2. Build a spec. Defaults: deny-all network, standard resource
//    limits (1 CPU, 256MB, 30s, 64 maxProcs), workingDir = backend
//    default, warmable = false.
SandboxSpec spec = SandboxSpec.builder()
        .image("python:3.12-slim")          // backend-specific; "" for ProcessSandbox
        .resourceLimits(ResourceLimits.standard())
        .networkPolicy(NetPolicy.denyAll())
        .build();

// 3. Run a command. The sandbox is created lazily by create();
//    each execute() runs to completion or to the timeout budget.
try (Sandbox sb = factory.create(spec)) {
    SandboxResult r = sb.execute(Command.shell("python -c 'print(1+1)'"));
    System.out.println("exit=" + r.exitCode() + " stdout=" + r.stdoutString());
}

Backend choice tree

Backend	Cold start	RAM/instance	Network isolation	Use when
`process`	~50–150ms	~10–30MB	Not enforced (host-shared)	Dev / CI / portable fallback
`container`	~150–400ms	~30–80MB	Real (`--network=none`)	Production code-exec, untrusted shell
`wasm` (v1 stub)	sub-50ms (target)	sub-MB (target)	Capability-gated	Pyodide / Wasmer adapter (follow-up)
`firecracker` (deferred)	~125ms	~5MB	microVM-isolated	Linux production at scale

SandboxFactory.preferred() picks the highest-priority backend whose available() returns true:

Backend	priority
process	10
wasm	75 (when adapter ships)
container	50
firecracker	100 (when adapter ships)

Resource limits

ResourceLimits limits = new ResourceLimits(
        1.0,                            // cpuShares (1.0 = one full core)
        256,                            // memoryMb
        128,                            // diskMb (scratch space)
        Duration.ofSeconds(30),         // timeout (per-execute wall clock)
        64);                            // maxProcesses (0 = no limit)

Validation rejects:

cpuShares <= 0
memoryMb <= 0
diskMb < 0
timeout zero or negative — sandbox without a deadline is a footgun
maxProcesses < 0

Presets:

ResourceLimits.minimal() — 0.25 CPU / 64MB / 16MB disk / 5s / 32 procs (policy checks, regex eval)
ResourceLimits.standard() — 1 CPU / 256MB / 256MB disk / 30s / 64 procs (typical code-exec)
ResourceLimits.of(cpu, memMb, timeoutSec) — convenience for the common shape

Network policy

NetPolicy.denyAll();                                  // recommended default
NetPolicy.allow(List.of("github.com", "api.openai.com:443"));
NetPolicy.inherit();                                  // sandbox inherits host network

Policy	ProcessSandbox	ContainerSandbox
`DenyAll`	Logged, NOT enforced (JVM child inherits host network)	`--network=none` — real
`AllowList`	Logged, NOT enforced	Rejected at create time (deferred to follow-up)
`Inherit`	Default, no warning	`--network=host`

ProcessSandbox is honest about its limits — it logs WARN at create time when the requested policy isn't enforceable, rather than silently downgrading. Real network isolation requires container or firecracker.

Filesystem mounts

SandboxSpec.builder()
        .fsMount(FsMount.readOnly(Path.of("./inputs"), Path.of("/data")))
        .fsMount(FsMount.readWrite(Path.of("./outputs"), Path.of("/work")))
        // …
        .build();

Backend	Read-only	Read-write
ProcessSandbox	Copy-in (host file → jail dir)	Rejected at create (copy-in can't propagate writes back)
ContainerSandbox	`--mount type=bind,readonly`	`--mount type=bind`

Sandbox path MUST be absolute — the sandbox sees its filesystem rooted at /.

Pool reuse

For high-concurrency workloads, reuse warm instances through a pool:

SandboxPool pool = new SandboxPool(
        SandboxFactory.preferred(),
        spec.toBuilder().warmable(true).build(),
        /* maxSize */ 16,
        Duration.ofSeconds(2));

try (SandboxPool.Lease lease = pool.borrow()) {
    SandboxResult r = lease.execute(Command.of("python", "task.py"));
    // ... lease.close() returns sandbox to the pool
}

pool.close();   // drains every idle sandbox

The pool degrades gracefully when full + timeout-exceeded (creates a non-pooled sandbox so callers never block forever); explicit Lease.terminate() evicts a sandbox the caller has reason to mark unhealthy.

Observability

Every execute() emits a SandboxExecutionEvent to the wired listener:

SandboxExecutionListener listener = event ->
        log.info("[sandbox] backend={} image={} exit={} timeoutMs={} cpuMs={}",
                event.backend(), event.image(), event.exitCode(),
                event.resourceUsage().wallClockMs(), event.resourceUsage().cpuMillis());

try (Sandbox sb = new ObservedSandbox(
        factory.create(spec),
        listener,
        factory.id())) {
    sb.execute(Command.shell("python -c 'print(1)'"));
}

The event carries the sandboxId, backend, image, argv, exit, timedOut flag, ResourceUsage, and the network-policy class name (DenyAll / Inherit / AllowList). Listener exceptions are caught + logged so observability outages never break execution.

Pairs with accountability

Sandbox events correlate with AgentLiabilityRecord entries on the same correlation id — a single audit timeline for "agent X attempted action Y inside sandbox Z, used N CPU-ms, exited with code C". Operators wire both listeners on the same agent and downstream consumers join on correlationId.

What's not in v1 (deferred to follow-ups)

FirecrackerSandbox — Linux microVM backend; child issue
WASM runtime adapters — Pyodide / Wasmer / Bun WASM; child issues per language
AllowList enforcement on ContainerSandbox — needs custom network + iptables; v2
GPU sandbox — model inference inside sandbox; v3
Multi-tenant resource quota — per-tenant aggregate limits; v2
Snapshot/restore — running sandbox state save; v2
tnsai-tools refactor — PythonExecutionTools / JsExecutionTools move to the SPI; child issue (current implementations document the gap explicitly via "WARNING: not a sandbox" headers)

Sandbox