Sampling
Production agents emit thousands of events per second — chat chunks, tool calls, memory writes, hook invocations. Sending all of them to a log aggregator buys a bigger Elasticsearch cluster every quarter; dropping everything to ERROR-only makes debugging impossible. The sampling SPI in tnsai-quality.sampling lets operators decide per-event whether to capture, drop, or down-sample, with policy choices that scale from "dev — capture everything" to "production — errors always, INFO 1%".
The SPI is EventSamplingPolicy. Default implementations (AlwaysCapturePolicy, HeadBasedSamplingPolicy, PriorityBasedSamplingPolicy, ErrorAlwaysPolicy) ship with the framework. Operators wire a policy into their event publisher via the SamplingAgentEventPublisher decorator.
Quick start
import com.tnsai.quality.sampling.*;
import com.tnsai.observability.events.AgentEventPublisher;
// 1. Pick a named policy shape (or build a custom one).
EventSamplingPolicy policy = SamplingPolicies.production();
// 2. Wrap your publisher.
AgentEventPublisher slf4j = new Slf4jAgentEventPublisher();
AgentEventPublisher sampled = new SamplingAgentEventPublisher(slf4j, policy);
// 3. From this point on, every publish call goes through the policy.
agent.setEventPublisher(sampled);Decision shape
SamplingDecision is a sealed interface with three variants. Java 21 exhaustive switch makes the publisher's dispatch compile-checked:
| Decision | Meaning |
|---|---|
Capture | Capture at full fidelity. No downstream rate adjustment needed. |
Drop | Discard the event. Never reaches the underlying publisher. |
CaptureDownsampled(rate) | Capture, but record the configured rate so downstream aggregators (Prometheus, Datadog) can extrapolate volume (count += 1 / rate). |
Capture.INSTANCE and Drop.INSTANCE are reusable singletons — most policies don't allocate a fresh decision per call.
Default policies
AlwaysCapturePolicy
The dev / single-tenant default — every event captured. Use the singleton: AlwaysCapturePolicy.INSTANCE.
EventSamplingPolicy verbose = AlwaysCapturePolicy.INSTANCE;HeadBasedSamplingPolicy
Coin-flip per event at a fixed rate. Stateless, lock-free, the simplest production policy. Returns CaptureDownsampled(rate) when captured.
EventSamplingPolicy p = new HeadBasedSamplingPolicy(0.10); // 10%Trade-off: no per-session affinity. A captured chat may have some chunks captured and others not. Use TailBasedSamplingPolicy (forthcoming) when you need every event from a captured session or none.
PriorityBasedSamplingPolicy
Per-EventLevel rate. Default rates encode the production observability rule of thumb — never drop anything that's already a problem:
| Level | Default rate |
|---|---|
CRITICAL | 1.0 (always) |
ERROR | 1.0 (always) |
WARN | 1.0 (always) |
INFO | 0.1 (10%) |
DEBUG | 0.01 (1%) |
// Defaults
EventSamplingPolicy p = PriorityBasedSamplingPolicy.withDefaults();
// Custom rates per level
Map<EventLevel, Double> overrides = Map.of(
EventLevel.INFO, 0.5,
EventLevel.DEBUG, 0.0 // hard-drop DEBUG
);
EventSamplingPolicy p = new PriorityBasedSamplingPolicy(overrides);ErrorAlwaysPolicy
Simpler than PriorityBasedSamplingPolicy — one switch on level, no per-level map:
WARN/ERROR/CRITICAL→ alwaysCaptureINFO/DEBUG→ coin-flip atinfoDebugRate
EventSamplingPolicy p = new ErrorAlwaysPolicy(0.05); // 5% INFO/DEBUG
EventSamplingPolicy d = ErrorAlwaysPolicy.withDefault(); // 10% INFO/DEBUGUse when "is this a problem?" is the only distinction that matters for capture.
RuleBasedSamplingPolicy
Ordered list of declarative rules, first match wins. Each rule combines an event-kind glob, a minimum level, an optional tenant filter, and an outcome (CAPTURE / DROP / SAMPLE_AT_RATE).
Glob shapes:
"*"— matches any kind"chat.*"— matches one component after the prefix (chat.chunk✓,chat.tool.invoked✗)"tool.invoked"— exact match
EventSamplingPolicy p = RuleBasedSamplingPolicy.builder()
.capture("tool.invoked") // always capture
.drop("internal.heartbeat") // hard-drop noise
.sample("chat.chunk", 0.001) // 0.1% of chunks
.capture("*", EventLevel.WARN, null) // WARN+ everywhere
.defaultDrop() // anything unmatched
.build();The programmatic builder is the canonical API; a YAML loader can produce the same Rule list at startup as a future enhancement.
AdaptiveSamplingPolicy
Wraps a delegate policy and degrades gracefully under measured backpressure. The consumer supplies a BackpressureSignal — a double in [0.0, 1.0] representing "fraction of capacity in use" (queue depth, ack lag, Resilience4j RateLimiter saturation, …). The policy reads it and:
- below
lowWaterMark(default 0.5) → delegate runs unchanged - between low and high → linearly downsample the delegate's
Capturedecisions - at or above
highWaterMark(default 0.9) → drop everything belowcriticalLevel(defaultERROR)
Critical-and-above events always survive, regardless of pressure.
BackpressureSignal queueSignal = () -> queueDepth.get() / (double) MAX_DEPTH;
EventSamplingPolicy p = new AdaptiveSamplingPolicy(
SamplingPolicies.production(), // delegate
queueSignal); // sane defaults: 0.5 / 0.9 / ERROR
// Or with custom thresholds:
EventSamplingPolicy p2 = new AdaptiveSamplingPolicy(
delegate, queueSignal,
0.4, 0.85, EventLevel.WARN);Tracks droppedDueToPressureCount() for monitoring how often pressure forced a drop above what the delegate would have done.
Named factories — SamplingPolicies
Three shapes for the most common operator intents:
| Factory | Returns | When to use |
|---|---|---|
SamplingPolicies.verbose() | AlwaysCapturePolicy.INSTANCE | dev / staging / single tenant |
SamplingPolicies.defaults() | PriorityBasedSamplingPolicy.withDefaults() | production with INFO/DEBUG distinction |
SamplingPolicies.production() | ErrorAlwaysPolicy.withDefault() | production with strict "errors-always" |
EventSamplingPolicy p = SamplingPolicies.production();Per-tenant policies
TenantPolicySamplingPolicy dispatches each call to a per-tenant policy based on EventContext.tenantId. A multi-tenant agent group can apply different rates per tenant without the publisher knowing tenant boundaries.
Resolution order:
EventContext.tenantIdfromSamplingInput.context- Configured default policy — fallback for null / blank / unmatched
EventSamplingPolicy dispatcher = TenantPolicySamplingPolicy.builder()
.tenant("acme", SamplingPolicies.verbose()) // power user — keep everything
.tenant("globex", new ErrorAlwaysPolicy(0.001)) // chatty — aggressive sampling
.defaultPolicy(SamplingPolicies.production()) // everyone else
.build();The dispatch happens on every call (no per-thread cache) so a single agent group serving multiple tenants doesn't leak rates across boundaries.
Wiring into the publisher
SamplingAgentEventPublisher is a decorator over the AgentEventPublisher SPI in tnsai-core. Each publish call builds a SamplingInput from event.eventType() + EventLevel + EventContext, asks the policy, and publishes only when SamplingDecision.shouldPublish() is true.
AgentEventPublisher base = new Slf4jAgentEventPublisher();
AgentEventPublisher sampled = new SamplingAgentEventPublisher(base, policy);Constructor variants:
| Constructor | Use when |
|---|---|
(delegate, policy) | bare publish(event) calls fall back to EventContext.empty() |
(delegate, policy, contextSupplier) | per-call tenant / agent context flows in via the supplier |
Both publish overloads (level-aware and bare) route through the policy. Bare publish(event) defaults to EventLevel.INFO. Null event passes straight through to the delegate without consulting the policy.
Composition with redaction
SamplingAgentEventPublisher and RedactingAgentEventPublisher are independent decorators — both implement the same SPI. Stack order chooses which runs first:
// Order A: sample first, redact what survives.
// Cheaper when drop rate is high — redactor only runs on captured events.
AgentEventPublisher pub = new RedactingAgentEventPublisher(
new SamplingAgentEventPublisher(base, policy), redactor);
// Order B: redact every event, then sample.
// Uniform redaction cost; slightly easier to debug ("what would have been emitted?").
AgentEventPublisher pub = new SamplingAgentEventPublisher(
new RedactingAgentEventPublisher(base, redactor), policy);Pick order A in production (where drop rate is meaningful) and order B when debugging redaction patterns.
Rate limiting (last-resort backstop)
Sampling makes per-event capture/drop decisions; a rate limiter is a per-tenant volume cap that protects the aggregator when sampling alone isn't enough. The two compose: sampling reduces volume cheaply, the limiter bounds the survivors.
import com.tnsai.quality.ratelimit.*;
// Per-tenant token bucket (Resilience4j-backed). Defaults are generous;
// configure tighter limits per noisy tenant.
EventRateLimiter limiter = Resilience4jEventRateLimiter.builder()
.defaultLimit(1000, Duration.ofSeconds(1)) // 1000 events/sec
.perTenant("noisy-tenant", 100, Duration.ofSeconds(1))
.build();
AgentEventPublisher slf4j = new Slf4jAgentEventPublisher();
AgentEventPublisher rate = new RateLimitedAgentEventPublisher(slf4j, limiter);
AgentEventPublisher sampled = new SamplingAgentEventPublisher(rate, policy);Stack order matters. sampled → rate → sink (above) lets the cheap sampling decision run first; the rate limiter only sees what survived. Reverse the stack (rate → sampled → sink) when the limiter's per-tenant cap matters more than sampling cost.
The limiter returns a Decision:
ACCEPT→ publishDROP→ silently swallow (with a counter increment for monitoring)DEFER→ opt-in deferred path; the default decorator drops, overrideonDeferto queue
Resilience4jEventRateLimiter#flushDropCounters() returns a per-tenant snapshot (and clears) so a periodic flusher can emit one ratelimit.dropped aggregate event per minute instead of suppressing the signal entirely.
SPI summary
| Type | Role |
|---|---|
EventSamplingPolicy | The SPI — single method sample(SamplingInput) → SamplingDecision. |
SamplingDecision | Sealed interface — Capture, Drop, CaptureDownsampled(rate). |
SamplingInput | Per-call frozen view — eventKind, level, context. |
AlwaysCapturePolicy | Capture every event; the dev default. |
HeadBasedSamplingPolicy | Coin-flip at fixed rate. |
PriorityBasedSamplingPolicy | Per-EventLevel rates with sensible defaults. |
ErrorAlwaysPolicy | WARN+ always; INFO/DEBUG at one rate. |
RuleBasedSamplingPolicy | Declarative rule list (glob + level + tenant + outcome), first match wins. |
AdaptiveSamplingPolicy | Wraps a delegate; downsamples linearly under backpressure. |
SamplingPolicies | Named factories — verbose, defaults, production. |
TenantPolicySamplingPolicy | Per-tenant dispatcher. |
SamplingAgentEventPublisher | AgentEventPublisher decorator that consults the policy. |
EventRateLimiter | Per-tenant token-bucket SPI returning ACCEPT / DROP / DEFER. |
Resilience4jEventRateLimiter | Default limiter — per-tenant configs, drop counters. |
RateLimitedAgentEventPublisher | Decorator that consults the limiter before delegating. |
Trade-offs
- No per-session affinity —
HeadBasedandPrioritydecide per event in isolation. A captured session may have some chunks dropped.TailBasedSamplingPolicy(forthcoming) buffers per session and decides at session end based on outcome (error / slow / expensive → capture; boring → drop). CaptureDownsampledextrapolation cost — downstream aggregators must know to scale counters by1 / rate. If your aggregator doesn't support per-event rates, useCapture(no rate) policies only.- Statistical variance — at low rates over short windows, observed capture counts vary widely. Plan dashboards over windows large enough that the law of large numbers settles.
- Cost vs. fidelity — aggressive sampling reduces aggregator cost but loses signal for rare bugs. Production agents typically run
SamplingPolicies.production()(10% INFO/DEBUG) until cost forces tighter rates.
Roadmap
The sampling + rate-limiting SPIs are stable in tnsai-quality.sampling and tnsai-quality.ratelimit. Cost governance is documented in Cost Governance. Pending work under issue #82:
TailBasedSamplingPolicy— buffer per session, decide at session end based on outcome (matches OTel's tail-based trace sampling).- YAML-driven rule loader for
RuleBasedSamplingPolicy(programmatic builder is the canonical API; YAML is a producer). - Grafana dashboard JSON + Prometheus Alertmanager rules for rate-limit drop counts and budget utilisation.