Multi-hour and multi-day agent executions need a categorically different runtime than minute-scale interactive sessions. Process crashes, runtime upgrades, transient API failures, and operator-initiated pauses all happen — the framework's reliability layer makes them survivable rather than catastrophic.

The com.tnsai.reliability package in tnsai-core ships the primitives:

Checkpoint + CheckpointStore — durable state at step boundaries
ResumableRun + DefaultResumableRun — execution driver that consumes the stores
Progress + ProgressSink — observable mid-run state for live dashboards
RunConfig — duration / retries / cost / progress-timeout policy bundle
Outcome + AbortReason — terminal-result sealed type

Pairs with the idempotency layer (tnsai-core.idempotency) so side-effecting steps don't double-execute on retry, and with cost governance so a runaway loop hits a configured ceiling instead of burning unbounded spend.

When to use

You probably want this if your agent run:

takes longer than the longest interactive session you'd let the user wait on (rule of thumb: \> 5 minutes)
calls external systems with non-trivial side effects (creates issues, posts messages, runs payments, kicks off CI builds)
has a per-run budget worth enforcing during execution rather than after the fact
needs to be inspectable from a separate process / dashboard while it's running

You probably don't need it for:

single-LLM-call request handlers
read-only research agents that always finish in seconds
one-shot CLI utilities without resumption needs

Quick start

import com.tnsai.reliability.*;
import com.tnsai.idempotency.InMemoryIdempotencyStore;
import java.time.Duration;
import java.math.BigDecimal;

// 1. Pick stores. InMemory for tests; Filesystem / SQLite for persistence.
CheckpointStore checkpoints  = new FilesystemCheckpointStore(Path.of("/var/tns/checkpoints"));
var idempotency               = new InMemoryIdempotencyStore();
ProgressSink progress         = new InMemoryProgressSink();

// 2. Define a state codec — encode/decode whatever your run threads
//    through its steps. JSON is the recommended default; here we use UTF-8
//    bytes for a string state.
DefaultResumableRun.StateCodec<String> codec =
    new DefaultResumableRun.StateCodec<>() {
        public byte[] encode(String s) { return s.getBytes(StandardCharsets.UTF_8); }
        public String decode(byte[] b) { return new String(b, StandardCharsets.UTF_8); }
    };

// 3. Compose the run. Each step is a UnaryOperator<S> with a name and a
//    side-effecting flag. Side-effecting steps consult the idempotency
//    store so a retry doesn't double-execute.
DefaultResumableRun<String> run = DefaultResumableRun.<String>builder()
    .codec(codec)
    .checkpointStore(checkpoints)
    .idempotencyStore(idempotency)
    .progressSink(progress)
    .step(DefaultResumableRun.Step.pure("plan", state -> state + "\n[plan]"))
    .step(DefaultResumableRun.Step.sideEffect("post-pr",
            state -> { githubClient.createPr(...); return state + "\n[posted]"; }))
    .step(DefaultResumableRun.Step.pure("verify", state -> state + "\n[verified]"))
    .build();

// 4. Execute with a config bound by duration / retries / cost / progress timeout.
RunConfig config = RunConfig.builder()
    .maxDuration(Duration.ofHours(8))
    .costCeilingUSD(new BigDecimal("5.00"))
    .checkpointInterval(Duration.ofMinutes(5))
    .maxRetries(3)
    .progressTimeout(Duration.ofMinutes(10))
    .build();

Outcome<String> result = run.execute("start", config);

The orchestrator returns one of four Outcome variants when the run terminates.

Outcome shape

Outcome<T> is a sealed interface — pattern-match exhaustively:

Variant	When
`Outcome.Completed<T>`	Run reached the end of its step list
`Outcome.Failed<T>`	A step exhausted its retry budget; carries the throwable
`Outcome.Aborted<T>`	Policy stopped the run (`AbortReason`); user-requested, budget exhausted, max-duration exceeded, progress timeout, retry-limit exceeded
`Outcome.Suspended<T>`	Run yielded cleanly at a step boundary; resumable via the carried `checkpointId`

Outcome<String> result = run.execute(input, config);
String final = switch (result) {
    case Outcome.Completed<String> c -> c.result();
    case Outcome.Failed<String> f -> { logger.error("Failed: {}", f.reason()); yield ""; }
    case Outcome.Aborted<String> a -> { alert("Aborted: " + a.reason()); yield ""; }
    case Outcome.Suspended<String> s -> { /* resume later via run.resume(s.runId()) */ yield ""; }
};

Resume after a crash

After a process crash, runtime upgrade, or OOMError, the run is recoverable as long as the configured CheckpointStore survives:

// New process, same store backend.
DefaultResumableRun<String> run = /* ... same builder, same store paths ... */;

Outcome<String> result = run.resume(savedRunId);
// Resume loads the latest checkpoint, decodes the state via the codec,
// and continues from the next step.

If run.resume(runId) finds no checkpoint for the id, it returns Outcome.Failed with a clear "no such run" message rather than silently starting a fresh run — confusing one for the other was the bug pattern this primitive prevents.

Side-effect safety: the kill-and-resume invariant

The unit test that defines the contract:

A side-effecting step that runs once and crashes mid-execution must not run a second time on resume — its cached result is replayed instead.

Implementation: DefaultResumableRun keys side-effecting steps by runId:stepIndex:stepName against the configured IdempotencyStore. On the original run the orchestrator records the step's return value via IdempotencyEntry.forSuccess(...); on resume the orchestrator checks the store before invoking the body. Cached hit → replay; miss → execute + record.

Concretely: a step that posts a GitHub PR, captured the PR id in its return state, then crashed before saving its checkpoint, will on resume:

find the cached PR id in the idempotency store under <runId>:<stepIndex>:post-pr
skip the body (no second POST /pulls to GitHub)
thread the cached state forward into the next step

This is what makes hour-scale runs against rate-limited / cost-bearing external APIs feasible.

CheckpointStore selection

Store	When
`InMemoryCheckpointStore`	Tests, single-process workloads, "checkpointing-on" baseline. Doesn't survive restarts.
`FilesystemCheckpointStore`	Single-process production. JSON files under `<root>/<runId>/<stepIndex>.json`. Atomic write-tmp-then-rename with `fsync`.
`SqliteCheckpointStore`	Personal-fleet workloads (one binary on a laptop / NAS). Single SQLite file, durable across restarts. Optional `xerial:sqlite-jdbc` dep — pull only when used.
`RedisCheckpointStore`	Multi-process fleet sharing a Redis instance. Per-checkpoint blob + per-run sorted set for O(log N) latest / range queries. Reuses the existing Jedis dep (no new artefact).
`S3CheckpointStore`	Distributed deployments where checkpoints must outlive a single host — runs started on box A may resume on box B / region 2. JSON object per checkpoint at `s3://<bucket>/<keyPrefix><runId>/<stepIndex>.json`. Optional `software.amazon.awssdk:s3` dep.

Picking between distributed stores

Concern	Redis	S3
Latency	sub-ms intra-region	tens of ms (eventually-consistent regions: more)
Durability	RDB / AOF — operator-tunable	11 nines — managed
Cost	proportional to memory + ops	proportional to storage + requests
Operations	needs a Redis cluster	managed by AWS
Region affinity	typically per-region cluster	cross-region replication available

Use Redis for fast resume cycles (operator-paused runs picked up within minutes); use S3 for archival / cross-region durability. The two are not mutually exclusive — a layered stack with a Redis hot path and an S3 cold path is a reasonable production shape, but ships as a consumer-side composite rather than a built-in.

Cost ceiling enforcement

DefaultResumableRun.Builder#costMonitor is optional. Provide one and pair it with RunConfig.costCeilingUSD to make the orchestrator check spend before every step:

// Wire to the cost-governance store from tnsai-quality.
CostBudgetStore costStore = ...; // your spend ledger
DefaultResumableRun.CostMonitor monitor = () ->
    Optional.of(costStore.currentSpend(
        CostScope.tenant(tenantId), Duration.ofHours(24)));

DefaultResumableRun<S> run = DefaultResumableRun.<S>builder()
    // ...
    .costMonitor(monitor)
    .build();

run.execute(input, RunConfig.builder()
    .costCeilingUSD(new BigDecimal("10.00"))   // $10 cap
    .build());

When currentSpend ≥ ceiling, the orchestrator:

Saves a final checkpoint at the current state
Publishes Progress.RunAborted(BUDGET_EXHAUSTED)
Returns Outcome.Aborted

The consumer can resume the run on a higher ceiling once it's available — the saved checkpoint preserves work-done-so-far.

Progress events

Progress is a sealed interface with eight variants. Subscribe to a run's events for live dashboards or progress-timeout enforcement:

ProgressSink.Subscription sub = sink.subscribe(runId, event -> {
    switch (event) {
        case Progress.StepStarted s   -> dashboard.markStepRunning(s.stepIndex());
        case Progress.StepCompleted s -> dashboard.markStepDone(s.stepIndex(), s.took());
        case Progress.CheckpointSaved c -> dashboard.checkpointAt(c.checkpointId());
        case Progress.CostUpdate c    -> dashboard.spendTick(c.spentUSD());
        case Progress.Heartbeat h     -> dashboard.heartbeat(h.currentStep());
        case Progress.RunCompleted r  -> dashboard.markDone();
        case Progress.RunFailed r     -> dashboard.markFailed(r.reason());
        case Progress.RunAborted r    -> dashboard.markAborted(r.reason());
    }
});
// Cancel when the dashboard tab closes.
sub.cancel();

InMemoryProgressSink is the default. KafkaProgressSink and similar fan-out adapters are tracked as follow-ups; the SPI accepts third-party impls today.

RunConfig defaults

Field	Default	Notes
`maxDuration`	8h	Hard wall-clock cap; `MAX_DURATION_EXCEEDED` on breach
`costCeilingUSD`	empty	Optional; without it, runtime cost-ceiling checks are skipped
`checkpointInterval`	5 min	Caps worst-case replay window after a crash
`maxRetries`	3	Per-step retry budget before `Outcome.Failed`
`progressTimeout`	10 min	"stuck" detector — abort if no `Progress` event in this window

Use RunConfig.defaults() for an overnight-run-friendly starting point, or RunConfig.builder() for finer control.

Tool annotations

Tools that participate in resumable runs benefit from declaring their effect classification + idempotency expectation. Two annotations were added to @Tool for this:

import com.tnsai.annotations.*;

public class GithubTools {

    @Tool(name = "github.create_pr",
          description = "Open a pull request on a GitHub repo",
          sideEffect = SideEffect.EXTERNAL,
          idempotencyHint = IdempotencyHint.REQUIRED)
    public PullRequest createPr(@ToolParam("title") String title, ...) { ... }

    @Tool(name = "github.list_branches",
          description = "List branches on a GitHub repo",
          sideEffect = SideEffect.READ)            // idempotencyHint defaults to NONE
    public List<Branch> listBranches(...) { ... }
}

SideEffect:

Value	Meaning
`NONE`	Pure function. Default.
`READ`	Reads external state, doesn't mutate.
`WRITE`	Mutates state inside the framework or a system the framework controls.
`EXTERNAL`	Calls a third-party system whose effect we can't reverse.

IdempotencyHint:

Value	Meaning
`NONE`	No tracking. Default.
`OPTIONAL`	Track when caller supplies a key, otherwise dispatch unguarded.
`REQUIRED`	Refuse to dispatch without an idempotency key. Reserve for payments / public posts / irreversible deletes.

Both fields default to safe values so existing tools retain their current behaviour without modification — the annotations are purely additive metadata.

What's not in this layer

Out of scope for the framework primitives, tracked separately:

Distributed transaction coordination (sagas) — checkpointing is local; cross-system consistency is a different problem.
Run inspection UI — the ProgressSink API is enough for now; UI is downstream.
Backfill for runs without checkpointing — additive feature, not retroactive.
Harness execution-loop refactor — the existing AgentExecutor will move onto ResumableRun as part of the harness work (TNS-291).

Long-Running Runs