Advanced capabilities in TnsAI.LLM for observability, structured output, resilience, caching, intelligent routing, and cost management.

Providers

TnsAI.LLM ships with 13 concrete provider implementations (plus the AbstractLLMClient base), all implementing the LLMClient interface from tnsai-core. API keys are configured via environment variables.

Provider	Class	Notes
Anthropic	`AnthropicClient`	Claude models, prompt caching, extended thinking
OpenAI	`OpenAIClient`	GPT models, JSON mode, function calling
Azure OpenAI	`AzureOpenAIClient`	Azure-hosted OpenAI models
Google Gemini	`GeminiClient`	Gemini models, vision, structured output
AWS Bedrock	`BedrockClient`	Claude/Titan via AWS
Mistral	`MistralClient`	Mistral/Mixtral models
Cohere	`CohereClient`	Command models
Groq	`GroqClient`	Ultra-low latency inference
HuggingFace	`HuggingFaceClient`	Inference API models
Ollama	`OllamaClient`	Local model serving
OpenRouter	`OpenRouterClient`	Multi-provider gateway
MiniMax	`MiniMaxClient`	MiniMax models
ZhipuAI	`ZhipuAIClient`	GLM models
Whisper	`WhisperClient`	Audio transcription (audio package)

All providers support chat(), streamChat(), streamChatWithSpec(), and streamChatWithHandler() methods. Provider capabilities are exposed via getCapabilities() which returns LLMCapabilities with fields like supportsVision(), supportsFunctionCalling(), supportsStructuredOutput(), getMaxInputTokens(), getInputCostPer1KTokens(), etc.

Cross-reference: For provider setup and basic usage, see Providers.

Observability

ObservableLLMClient

ObservableLLMClient wraps any LLMClient and notifies registered observers about all requests, responses, and errors. This enables non-invasive monitoring of LLM operations without modifying existing code.

// Create base client
LLMClient baseClient = new OpenAIClient("gpt-4o");

// Create metrics collector
LLMMetrics metrics = new LLMMetrics();

// Wrap with observability
LLMClient observedClient = new ObservableLLMClient(baseClient, metrics);

// Use normally -- all calls are tracked
observedClient.chat("Hello!");

// Multiple observers
LLMMetrics metrics = new LLMMetrics();
PromptLogger logger = new PromptLogger();
LLMClient client = new ObservableLLMClient(baseClient, metrics, logger);

// Access internals
LLMClient delegate = ((ObservableLLMClient) client).getDelegate();
LLMObserver observer = ((ObservableLLMClient) client).getObserver();

LLMObserver Interface

Implement LLMObserver for custom monitoring. All methods have default no-op implementations.

public interface LLMObserver {
    void onRequest(LLMClient client, String message, Optional<String> systemPrompt,
                   Optional<List<Map<String, Object>>> history,
                   Optional<List<Map<String, Object>>> tools);
    void onResponse(LLMClient client, ChatResponse response, long latencyMs);
    void onError(LLMClient client, Exception error, long latencyMs);
    void onStreamChunk(LLMClient client, String chunk, int chunkIndex);
    void onStreamComplete(LLMClient client, int totalChunks, long latencyMs);
    void onStreamError(LLMClient client, Exception error, int chunksReceived, long latencyMs);
}

Compose multiple observers with ObservableLLMClient.CompositeObserver.of(observer1, observer2).

LLMMetrics

LLMMetrics implements LLMObserver and collects comprehensive metrics:

Request/response/error counts (global and per-provider)
Token usage estimates (input and output)
Latency statistics (average, p50, p95, p99)
Cost estimates based on provider pricing
Stream chunk counts

LLMMetrics metrics = new LLMMetrics();
LLMClient client = new ObservableLLMClient(baseClient, metrics);

// After some usage
LLMMetrics.Report report = metrics.getReport();
System.out.println("Total requests: " + report.totalRequests());
System.out.println("Total responses: " + report.totalResponses());
System.out.println("Total errors: " + report.totalErrors());
System.out.println("Success rate: " + report.successRate() + "%");
System.out.println("Error rate: " + report.errorRate() + "%");
System.out.println("Input tokens: " + report.totalInputTokens());
System.out.println("Output tokens: " + report.totalOutputTokens());
System.out.println("Estimated cost: $" + report.totalEstimatedCost());
System.out.println("Avg latency: " + report.avgLatencyMs() + "ms");
System.out.println("P95 latency: " + report.p95LatencyMs() + "ms");
System.out.println("P99 latency: " + report.p99LatencyMs() + "ms");

// Per-provider breakdown
Map<String, LLMMetrics.ProviderMetrics> byProvider = metrics.getMetricsByProvider();
for (var entry : byProvider.entrySet()) {
    LLMMetrics.ProviderMetrics pm = entry.getValue();
    System.out.println(entry.getKey() + ": " + pm.requests() + " requests, "
        + pm.avgLatencyMs() + "ms avg, $" + pm.estimatedCost());
}

metrics.reset();   // clear all metrics

Structured Output (JSON Mode)

JsonModeClient

JsonModeClient wraps any LLMClient to enforce JSON output. Uses provider-native JSON mode when available, falls back to prompt engineering for providers that lack native support.

// Simple wrap
LLMClient baseClient = new OpenAIClient("gpt-4o");
JsonModeClient client = JsonModeClient.wrap(baseClient);

// Get JSON response
ChatResponse response = client.chat("List 3 programming languages");
// Response: {"languages": ["Python", "Java", "JavaScript"]}

// Parse to a specific type
LanguageList list = client.chatAs(LanguageList.class, "List 3 programming languages");

// With system prompt
Person person = client.chatAs(Person.class, "Generate a person",
    Optional.of("You are a test data generator."));

With JSON Schema

ResponseFormat format = ResponseFormat.jsonSchema("Person", Map.of(
    "type", "object",
    "properties", Map.of(
        "name", Map.of("type", "string"),
        "age", Map.of("type", "integer")
    ),
    "required", List.of("name", "age")
));

JsonModeClient client = JsonModeClient.builder()
    .client(baseClient)
    .responseFormat(format)
    .build();

ChatResponse response = client.chat("Generate a person");
// Response: {"name": "Alice", "age": 30}

ResponseFormat

Represents the desired output format. Three types:

Type	Factory	Behavior
`TEXT`	`ResponseFormat.text()`	Default text output
`JSON_OBJECT`	`ResponseFormat.jsonObject()`	Valid JSON, structure not enforced
`JSON_SCHEMA`	`ResponseFormat.jsonSchema(name, schema)`	JSON conforming to provided schema

Generate schema from a class: ResponseFormat.jsonSchema("Person", Person.class).

Convert to provider-specific formats: format.toOpenAIFormat(), format.toGeminiFormat(), format.toOllamaFormat().

Key methods: isJson(), hasSchema(), isStrict(), getSchema(), getSchemaName().

Advanced Options

JsonModeClient client = JsonModeClient.builder()
    .client(baseClient)
    .responseFormat(format)
    .objectMapper(customMapper)              // custom Jackson ObjectMapper
    .forcePromptEngineering(true)            // skip native JSON mode, always use prompt engineering
    .schemaFromClass("Person", Person.class) // generate schema from class
    .build();

// Check native support
boolean nativeSupport = client.supportsNativeJsonMode();

// Parse raw JSON
Person p = client.parseResponse("{\"name\":\"Alice\",\"age\":30}", Person.class);

On parse failure, JsonModeClient.JsonParseException is thrown, which contains getRawContent() for debugging.

Resilience

CircuitBreakerClient

CircuitBreakerClient prevents cascading failures by fast-failing when a provider is consistently down. Implements the standard three-state circuit breaker pattern.

State transitions: CLOSED (normal, counting failures) -\> OPEN (fast-fail after N consecutive failures) -\> HALF_OPEN (after recovery timeout, allows one probe request) -\> CLOSED (if probe succeeds) or back to OPEN (if probe fails).

// Simple wrap (5 failures, 30s recovery)
LLMClient resilient = CircuitBreakerClient.wrap(openaiClient);

// Custom settings
LLMClient resilient = CircuitBreakerClient.builder()
    .client(openaiClient)
    .failureThreshold(3)
    .recoveryTimeout(Duration.ofSeconds(60))
    .build();

// Inspect state
CircuitBreakerClient cb = (CircuitBreakerClient) resilient;
CircuitBreakerClient.State state = cb.getState();        // CLOSED, OPEN, HALF_OPEN
int failures = cb.getConsecutiveFailures();

// Metrics
CircuitBreakerClient.CircuitBreakerMetrics metrics = cb.getMetrics();
System.out.println("Success rate: " + metrics.successRate() + "%");
System.out.println("Total requests: " + metrics.totalRequests());
System.out.println("Rejected (fast-fail): " + metrics.rejectedCount());
System.out.println("State transitions: " + metrics.stateTransitions());

// Manual reset
cb.reset();

When the circuit is open, all requests throw CircuitOpenException (contains model name, failure count, recovery timeout, and trip time). Compose with FallbackRouter for automatic failover:

FallbackRouter router = FallbackRouter.of(
    CircuitBreakerClient.wrap(primary),
    CircuitBreakerClient.wrap(fallback)
);

Caching

PromptCachingClient

PromptCachingClient wraps any LLMClient and adds Anthropic-style prompt caching support. Automatically adds cache control markers to system prompts, tools, and conversation history breakpoints.

PromptCachingClient client = PromptCachingClient.builder()
    .client(anthropicClient)
    .cacheSystemPrompt(true)          // cache system prompt (default: true)
    .cacheTools(true)                 // cache tool definitions (default: true)
    .cacheHistoryBreakpoints(2)       // cache breakpoints in history (max 4)
    .minTokensForCaching(1024)        // minimum tokens to trigger caching (default: 1024)
    .build();

// Use normally -- caching is automatic
ChatResponse response = client.chat("Hello", systemPrompt, history, tools);

// Check cache statistics
System.out.println("Cache read tokens: " + client.getTotalCacheReadTokens());
System.out.println("Cache creation tokens: " + client.getTotalCacheCreationTokens());
System.out.println("Hit rate: " + client.getCacheHitRate());
System.out.println("Estimated savings: " + (client.getEstimatedSavings() * 100) + "%");
System.out.println("Requests: " + client.getRequestCount());

client.resetStats();

Cost savings: Cache reads are 90% cheaper than regular input tokens. Cache writes are 25% more expensive (one-time cost). TTL is 5 minutes, refreshed on each use.

Cross-reference: For more on caching strategies, see Caching.

SemanticCache

The SemanticCache interface provides similarity-based caching for LLM responses. Unlike exact-match caching, it matches semantically equivalent prompts using embedding vectors.

SemanticCache cache = InMemorySemanticCache.builder()
    .embeddingProvider(new OpenAIEmbeddingProvider())
    .highThreshold(0.95)       // direct hit threshold
    .lowThreshold(0.70)        // below this, skip cache
    .ttlSeconds(3600)          // 1-hour TTL
    .maxEntries(10000)
    .build();

// Check cache
Optional<CacheEntry> hit = cache.findSimilar("What is Python?", 0.90);
if (hit.isPresent()) {
    return hit.get().response();   // cache hit
}

// Cache miss -- call LLM and store
String response = llm.chat("What is Python?");
cache.put("What is Python?", response);

// With system prompt consideration
cache.findSimilar("What is Python?", Optional.of("Be concise"), 0.90);
cache.put("What is Python?", Optional.of("Be concise"), response);

// Find multiple similar entries
List<SemanticCache.SimilarityResult> results = cache.findAllSimilar("Python language", 0.70, 5);

// Statistics
SemanticCache.CacheStats stats = cache.getStats();
System.out.println("Hits: " + stats.hits());
System.out.println("Misses: " + stats.misses());
System.out.println("Hit rate: " + stats.hitRate());
System.out.println("Size: " + stats.currentSize());
System.out.println("Evictions: " + stats.evictions());

Routing

TnsAI.LLM provides multiple routing strategies that implement LLMRouter (which extends LLMClient). All routers can be used as drop-in replacements for a single client.

CapabilityRouter

Routes requests based on required capabilities (vision, function calling, structured output, context window size). Selects the first eligible client matching the capability filter.

CapabilityRouter router = CapabilityRouter.builder()
    .addClient(new OpenAIClient("gpt-4o"))         // vision + tools
    .addClient(new GroqClient("llama-3.3-70b"))    // tools only
    .addClient(new OllamaClient("llama3.2"))       // basic text
    .defaultRequirement(cap -> cap.supportsFunctionCalling())
    .build();

// Use as a normal LLMClient
router.chat("Use the search tool");

// Select specific capability on demand
Optional<LLMClient> visionClient = router.selectVisionCapable();
Optional<LLMClient> toolClient = router.selectToolCapable();
Optional<LLMClient> jsonClient = router.selectStructuredOutputCapable();
Optional<LLMClient> bigContext = router.selectWithMinContext(128_000);

// Generic capability filter
Optional<LLMClient> custom = router.selectByCapability(
    cap -> cap.supportsVision() && cap.supportsFunctionCalling());

// Statistics
LLMRouter.RoutingStats stats = router.getStats();
router.resetStats();

CostBasedRouter

Routes to the cheapest viable provider. Sorts clients by input cost and tries the cheapest first, falling back to more expensive options on failure. Can reduce costs by up to 85% for simple queries.

CostBasedRouter router = CostBasedRouter.builder()
    .addClient(new OpenAIClient("gpt-4o-mini"))         // $0.15/1M input
    .addClient(new GroqClient("llama-3.3-70b"))         // $0.59/1M input
    .addClient(new OpenAIClient("gpt-4o"))              // $2.50/1M input
    .addClient(new AnthropicClient("claude-sonnet-4"))   // $3.00/1M input
    .build();

// Simple queries go to cheapest model
router.chat("What is 2+2?");

// With capability requirement
CostBasedRouter visionRouter = CostBasedRouter.builder()
    .addClient(new OpenAIClient("gpt-4o-mini"))
    .addClient(new OpenAIClient("gpt-4o"))
    .requireCapability(cap -> cap.supportsVision())
    .build();

// Cost tracking
CostBasedRouter.CostStats stats = router.getCostStats();
System.out.println("Total estimated cost: $" + stats.totalEstimatedCost());
System.out.println("Cost per provider: " + stats.costPerProvider());
System.out.println("Input tokens: " + stats.totalInputTokens());
System.out.println("Output tokens: " + stats.totalOutputTokens());

LatencyBasedRouter

Routes to the fastest available provider. Learns from actual response times and adapts routing decisions using a moving average window of the last 20 measurements.

LatencyBasedRouter router = LatencyBasedRouter.builder()
    .addClient(new GroqClient("llama-3.3-70b"))          // ~100ms TTFT
    .addClient(new OpenAIClient("gpt-4o-mini"))          // ~300ms TTFT
    .addClient(new AnthropicClient("claude-sonnet-4"))    // ~600ms TTFT
    .maxLatencyMs(500)       // exclude providers slower than 500ms
    .build();

// Routes to fastest (Groq) automatically, adapts over time
router.chat("Quick question");

// Latency statistics
LatencyBasedRouter.LatencyStats stats = router.getLatencyStats();
System.out.println("Fastest: " + stats.fastestProvider() + " (" + stats.fastestLatencyMs() + "ms)");
System.out.println("Per provider: " + stats.avgLatencyPerProvider());

Initially uses estimated latency from LLMCapabilities.getEstimatedLatencyMs(). As actual measurements accumulate, routing decisions shift to measured performance. Failed requests are penalized with +5000ms latency to deprioritize unreliable providers.

Cross-reference: For routing basics and FallbackRouter, see Routing.

Cost Management

CostTracker

The CostTracker interface provides a unified API for recording and analyzing LLM usage costs. The InMemoryCostTracker is the default implementation.

CostTracker tracker = new InMemoryCostTracker();

// Record usage
UsageRecord record = UsageRecord.builder()
    .modelId("gpt-4o")
    .inputTokens(1000)
    .outputTokens(500)
    .build();
tracker.record(record);

// Query
List<UsageRecord> all = tracker.getRecords();
List<UsageRecord> byTime = tracker.getRecords(Instant.now().minus(Duration.ofHours(1)), Instant.now());
List<UsageRecord> byModel = tracker.getRecordsByModel("gpt-4o");
List<UsageRecord> byProvider = tracker.getRecordsByProvider("openai");

// Costs
BigDecimal total = tracker.getTotalCost();
BigDecimal periodCost = tracker.getTotalCost(periodStart, periodEnd);
Map<String, BigDecimal> byModelCost = tracker.getCostByModel();
Map<String, BigDecimal> byProviderCost = tracker.getCostByProvider();

// Statistics
CostTracker.CostStatistics stats = tracker.getStatistics();
System.out.println("Records: " + stats.recordCount());
System.out.println("Total cost: $" + stats.totalCost());
System.out.println("Avg cost/request: $" + stats.averageCostPerRequest());
System.out.println("Input tokens: " + stats.totalInputTokens());
System.out.println("Output tokens: " + stats.totalOutputTokens());
System.out.println("Cached tokens: " + stats.totalCachedTokens());
System.out.println("Avg latency: " + stats.averageLatencyMs() + "ms");
stats.mostExpensiveRequest().ifPresent(r ->
    System.out.println("Most expensive: " + r.modelId() + " $" + r.cost()));

BudgetManager

BudgetManager provides configurable spending limits with automatic enforcement, alert thresholds, and time-based budget periods. Thread-safe for concurrent use.

BudgetManager budget = BudgetManager.builder()
    .limit(100.00)                             // $100 budget
    .monthly()                                 // or .daily() or .period(Duration.ofDays(7))
    .alertThresholds(0.50, 0.80, 0.90, 0.95)  // or .defaultAlertThresholds()
    .hardLimit(true)                           // hard limit (default) vs .softLimit()
    .costTracker(tracker)                      // optional: sync from CostTracker
    .onAlert(alert -> log.warn("Budget alert: {}", alert))
    .onLimitExceeded(cost -> stopRequests())
    .build();

// Atomic check-and-spend (prevents TOCTOU race conditions)
if (budget.trySpend(new BigDecimal("0.05"))) {
    // make API call
} else {
    // budget exceeded
}

// Or separate check/spend
if (budget.canSpend(estimatedCost)) {
    // make API call
    budget.recordSpend(actualCost);    // returns false if limit exceeded
}

// Query status
BigDecimal remaining = budget.getRemainingBudget();
double usage = budget.getUsagePercent();               // 0.0 to 1.0+
Duration timeLeft = budget.getRemainingTime();

// Comprehensive status
BudgetManager.BudgetStatus status = budget.getStatus();
System.out.println("State: " + status.state());       // OK, WARNING, CRITICAL, EXCEEDED, UNLIMITED
System.out.println("Spend: $" + status.currentSpend() + " / $" + status.limit());
System.out.println("Remaining: $" + status.remaining());

// Manual reset
budget.reset();

BudgetState values: OK (\< 70%), WARNING (70-90%), CRITICAL (90-100%), EXCEEDED (\> 100%), UNLIMITED (no limit set).

BudgetAlertType values: THRESHOLD_REACHED, LIMIT_EXCEEDED, PERIOD_RESET.

Budgets automatically reset when the period elapses. If a CostTracker is provided, the budget syncs spend from tracked records at each period reset.

Cross-reference: For cost tracking basics, see Cost Tracking.

Advanced LLM Patterns