This guide covers the specialized evaluator families introduced in 0.3.0: RAG evaluators for retrieval-augmented generation pipelines, multi-turn evaluators for conversation quality, safety evaluators for harmful content detection, and the trace-eval bridge that connects observability with evaluation.

All evaluators in this guide use the LLM-as-judge pattern via the LLMJudge functional interface. See Evaluation overview for the base evaluator framework, benchmark runner, quality gates, and auto-harness.

LLMJudge Interface

Every advanced evaluator takes an LLMJudge instance, which is a @FunctionalInterface that sends a prompt and returns the LLM's text response. This decouples evaluation from any specific LLM client.

@FunctionalInterface
public interface LLMJudge {
    String judge(String prompt);
}

// Plug in any LLM client
LLMJudge judge = prompt -> myLlmClient.chat(prompt);

Evaluator SPI

All evaluators implement com.tnsai.evaluation.spi.Evaluator:

public interface Evaluator {
    String name();
    EvaluationResult evaluate(EvaluationInput context);
}

EvaluationInput is a record carrying the full evaluation context:

Field	Type	Description
`userInput`	`String`	The user's query
`agentResponse`	`String`	The agent's response to evaluate
`expectedOutput`	`String`	Ground-truth expected answer
`expectedToolSequence`	`List<String>`	Expected tool call order
`actualToolSequence`	`List<String>`	Actual tool calls made
`instructions`	`String`	Instructions the agent was given
`latencyMs`	`long`	Response latency in milliseconds
`costUsd`	`double`	Cost of the LLM call in USD
`inputTokens`	`int`	Input token count
`outputTokens`	`int`	Output token count
`metadata`	`Map<String, Object>`	Arbitrary metadata (retrieved docs, conversation history, etc.)

Build inputs with the fluent builder:

Evaluator.EvaluationInput input = Evaluator.EvaluationInput.builder()
    .userInput("What causes tides?")
    .agentResponse("Tides are caused by gravitational pull of the Moon.")
    .expectedOutput("Tides are caused by the gravitational pull of the Moon and Sun.")
    .metadata("retrieved_documents", List.of(doc1, doc2))
    .build();

EvaluationResult

Every evaluator returns an EvaluationResult record with a normalized score in [0.0, 1.0]:

public record EvaluationResult(
    String evaluatorName,
    double score,
    String details,
    Map<String, Double> metrics,
    Instant timestamp
) {
    // Factory methods
    static EvaluationResult of(String name, double score, String details, Map<String, Double> metrics);
    static EvaluationResult pass(String name, String details);   // score = 1.0
    static EvaluationResult fail(String name, String details);   // score = 0.0

    boolean passed(double threshold);
}

RAG Evaluators

Package: com.tnsai.evaluation.evaluators.rag

RAG evaluators measure retrieval-augmented generation quality across four dimensions: faithfulness, contextual precision, contextual recall, and answer relevancy. All require retrieved_documents in the metadata as a List<String>.

FaithfulnessEvaluator

Measures whether the agent's response is grounded in the retrieved documents. Uses a 2-step LLM-as-judge process:

Extract factual claims from the response
Verify each claim against the retrieved context

Score: supported_claims / total_claims (1.0 = fully faithful, 0.0 = fully hallucinated)

var evaluator = new FaithfulnessEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .agentResponse("Paris is the capital of France and has 2.1 million people.")
    .metadata("retrieved_documents", List.of(
        "Paris is the capital and most populous city of France.",
        "The population of Paris is approximately 2.1 million."
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.score() -> 1.0 (both claims supported)
// result.metrics(): supported_claims, total_claims, hallucinated_claims

Metrics returned:

Metric	Description
`supported_claims`	Number of claims verified against context
`total_claims`	Total factual claims extracted
`hallucinated_claims`	Claims not supported by context

ContextualPrecisionEvaluator

Measures whether the retrieved documents are relevant to the query. Uses weighted precision -- irrelevant documents ranked higher are penalized more heavily.

For each document, the LLM judges relevance (YES/NO). The score uses the formula: sum of precision@k for each relevant document at position k, divided by total relevant count.

Score: Weighted precision (1.0 = all relevant docs ranked first, 0.0 = no relevant docs)

var evaluator = new ContextualPrecisionEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .userInput("What causes tides?")
    .expectedOutput("Gravitational pull of the Moon and Sun causes tides.")
    .metadata("retrieved_documents", List.of(
        "Tides are caused by gravitational forces of the Moon and Sun.",
        "The Pacific Ocean is the largest ocean on Earth.",
        "Spring tides occur when the Moon and Sun are aligned."
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): relevant_docs, total_docs, naive_precision

Metrics returned:

Metric	Description
`relevant_docs`	Number of documents judged relevant
`total_docs`	Total documents evaluated
`naive_precision`	Simple `relevant / total` ratio (without ranking weight)

ContextualRecallEvaluator

Measures whether all relevant information needed for the expected answer was actually retrieved. Extracts key facts from the expected output and checks how many are attributable to the retrieved documents.

Score: attributed_facts / total_facts (1.0 = all facts covered, 0.0 = none covered)

Requires: Both retrieved_documents in metadata and a non-empty expectedOutput.

var evaluator = new ContextualRecallEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .expectedOutput("Tides are caused by the Moon's gravity. Spring tides happen during full and new moons.")
    .metadata("retrieved_documents", List.of(
        "The Moon's gravitational pull is the primary cause of ocean tides."
        // Missing: spring tide information -> recall will be less than 1.0
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): attributed_facts, total_facts, missing_facts

Metrics returned:

Metric	Description
`attributed_facts`	Facts from expected output found in retrieved docs
`total_facts`	Total key facts extracted from expected output
`missing_facts`	Facts not covered by any retrieved document

AnswerRelevancyEvaluator

Measures whether the agent's response actually addresses the user's query. Scores on three normalized dimensions:

Directness: Does the response directly answer the question?
Completeness: Does it cover all aspects of the query?
Focus: Does it avoid irrelevant tangents?

Each dimension is scored 1-5 by the LLM, then normalized to [0.0, 1.0] and averaged.

Score: Average of normalized directness, completeness, and focus.

var evaluator = new AnswerRelevancyEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .userInput("What is the boiling point of water?")
    .agentResponse("Water boils at 100 degrees Celsius at standard atmospheric pressure.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): directness, completeness, focus
// result.details() -> "directness=5/5 completeness=5/5 focus=5/5 score=1.00"

Using RAG Evaluators Together

For a comprehensive RAG evaluation, combine all four evaluators:

LLMJudge judge = prompt -> llmClient.chat(prompt);

var evaluators = List.of(
    new FaithfulnessEvaluator(judge),
    new ContextualPrecisionEvaluator(judge),
    new ContextualRecallEvaluator(judge),
    new AnswerRelevancyEvaluator(judge)
);

BenchmarkRunner runner = BenchmarkRunner.builder()
    .evaluators(evaluators)
    .agentFunction(testCase -> ragAgent.query(testCase.getInput()))
    .build();

Multi-Turn Evaluators

Package: com.tnsai.evaluation.evaluators.multiturn

Multi-turn evaluators assess conversation quality across multiple exchanges. All require conversation_history in metadata as a List<Map<String, String>> with "role" and "content" keys.

KnowledgeRetentionEvaluator

Measures whether the agent retains information from earlier conversation turns. Uses a 2-step process:

Extract key facts established in earlier turns
Check if the agent recalls those facts in later turns

Score: retained_facts / total_facts (1.0 = perfect retention, 0.0 = no retention)

Requires: At least 2 turns in conversation_history.

var evaluator = new KnowledgeRetentionEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .metadata("conversation_history", List.of(
        Map.of("role", "user", "content", "My name is Alice and I work at Acme Corp."),
        Map.of("role", "assistant", "content", "Nice to meet you, Alice! How can I help?"),
        Map.of("role", "user", "content", "Can you summarize what you know about me?"),
        Map.of("role", "assistant", "content", "You're Alice and you work at Acme Corp.")
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): retained_facts, total_facts, forgotten_facts

ConversationCompletenessEvaluator

Measures whether a multi-turn conversation achieved its stated goal. Uses a 1-5 scale:

Score	Meaning
1	Goal not addressed at all
2	Goal partially acknowledged but not resolved
3	Goal partially resolved
4	Goal mostly resolved with minor gaps
5	Goal fully achieved

Score: Normalized to [0.0, 1.0] from the raw 1-5 scale.

Requires: Both conversation_history and conversation_goal (a String) in metadata.

var evaluator = new ConversationCompletenessEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .metadata("conversation_goal", "Help the user book a flight to Paris")
    .metadata("conversation_history", List.of(
        Map.of("role", "user", "content", "I need to fly to Paris next week"),
        Map.of("role", "assistant", "content", "I found flights on Tuesday and Thursday. Which do you prefer?"),
        Map.of("role", "user", "content", "Tuesday please"),
        Map.of("role", "assistant", "content", "Booked! Your flight departs Tuesday at 10am. Confirmation: ABC123.")
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): raw_score, normalized_score
// result.details() -> "completeness=5/5 score=1.00"

TurnRelevancyEvaluator

Measures whether the last assistant turn is relevant to the preceding conversation context. Scores on three dimensions:

Context alignment: Does the response align with the conversation so far?
Query addressing: Does it address the most recent user message?
Coherence: Is it logically consistent with prior turns?

Each dimension is scored 1-5, normalized and averaged.

Requires: At least 2 turns with at least one assistant turn.

var evaluator = new TurnRelevancyEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .metadata("conversation_history", List.of(
        Map.of("role", "user", "content", "Tell me about quantum computing"),
        Map.of("role", "assistant", "content", "Quantum computing uses qubits..."),
        Map.of("role", "user", "content", "How does that compare to classical computing?"),
        Map.of("role", "assistant", "content", "Unlike classical bits that are 0 or 1, qubits can be in superposition...")
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): context_alignment, query_addressing, coherence
// result.details() -> "context_alignment=5/5 query_addressing=5/5 coherence=5/5 score=1.00"

Safety Evaluators

Package: com.tnsai.evaluation.evaluators.safety

Safety evaluators detect harmful content in agent responses. All use inverted scoring: 1.0 = safe, 0.0 = unsafe.

ToxicityEvaluator

Detects toxic, harmful, or offensive content across five categories:

Category	What it detects
`INSULT`	Demeaning or belittling language
`THREAT`	Threatening or intimidating language
`OBSCENITY`	Profane or vulgar language
`HATE_SPEECH`	Language targeting protected groups
`HARASSMENT`	Persistent targeting or bullying

The LLM scores overall toxicity from 0 (none) to 5 (severe), and classifies each category as PRESENT or ABSENT. The evaluator understands context -- quoting toxic content to explain it is treated differently from generating it.

Score: Inverted: 1.0 - (toxicity_score / 5.0) (1.0 = clean, 0.0 = severely toxic)

var evaluator = new ToxicityEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .agentResponse("Here is the information you requested about climate change.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.score() -> ~1.0 (clean)
// result.metrics(): raw_toxicity_score, insult, threat, obscenity, hate_speech, harassment

Metrics returned: raw_toxicity_score (0-5) plus per-category flags (1.0 = present, 0.0 = absent).

BiasEvaluator

Detects demographic, cultural, or ideological bias across six categories:

Category	What it detects
`GENDER`	Stereotypes or assumptions based on gender
`RACE`	Racial or ethnic stereotypes
`AGE`	Age-based stereotypes or assumptions
`RELIGION`	Religious bias or assumptions
`POLITICAL`	Political ideology presented as fact
`SOCIOECONOMIC`	Class-based assumptions or stereotypes

Score: Inverted: 1.0 - (bias_score / 5.0) (1.0 = no bias, 0.0 = severely biased)

The evaluator also considers the user's query for context -- a biased response to a question about bias may be appropriate.

var evaluator = new BiasEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .userInput("What are common career paths?")
    .agentResponse("Common career paths include engineering, medicine, law, and education.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): raw_bias_score, gender_bias, race_bias, age_bias, religion_bias, political_bias, socioeconomic_bias

HallucinationEvaluator

Detects hallucinated content by checking factual claims against provided context. Unlike FaithfulnessEvaluator (which is RAG-specific), this evaluator works with any context source and classifies claims into three categories:

Classification	Meaning
`SUPPORTED`	Claim is backed by the provided context
`CONTRADICTED`	Claim conflicts with the provided context
`FABRICATED`	Claim has no basis in the context at all

Context sources (checked in order): metadata.get("context") as String or List<String>, then metadata.get("retrieved_documents"). If no context is provided, the evaluator checks for internal contradictions and invented references.

Score: Inverted: supported / total (1.0 = no hallucination, 0.0 = fully hallucinated)

var evaluator = new HallucinationEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .agentResponse("The product costs $99 and ships in 2 days.")
    .metadata("context", "Product price: $99. Shipping: 5-7 business days.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// "2 days" contradicts "5-7 business days" -> score < 1.0
// result.metrics(): supported, contradicted, fabricated, total_claims

Without context (internal consistency check):

var input = Evaluator.EvaluationInput.builder()
    .agentResponse("The study by Smith et al. (2024) in Nature found that...")
    .build();

// Checks for invented citations, self-contradictions, fabricated claims

Combining Safety Evaluators

Run all safety evaluators as a guard rail in production:

var safetyEvaluators = List.of(
    new ToxicityEvaluator(judge),
    new BiasEvaluator(judge),
    new HallucinationEvaluator(judge)
);

double safetyThreshold = 0.8;
for (Evaluator eval : safetyEvaluators) {
    EvaluationResult result = eval.evaluate(input);
    if (!result.passed(safetyThreshold)) {
        log.warn("Safety check failed: {} scored {}", eval.name(), result.score());
    }
}

Trace-Eval Bridge

Package: com.tnsai.evaluation.bridge

The trace-eval bridge connects the observability layer (TnsAI.Quality traces) with the evaluation layer. It adapts completed AgentTrace spans into EvaluationInput records, runs evaluators, annotates the trace with scores, and reports failures.

Architecture

AgentTrace ──> TraceToEvalAdapter ──> EvaluationInput
                                          │
                                     Evaluator[] ──> EvaluationResult[]
                                          │
                               EvalSpanAnnotator ──> Score on trace
                                          │
                               onFailure callback ──> AutoHarness / alerting

TraceEvalBridge

The main entry point. Orchestrates the full pipeline: adapt, evaluate, annotate, report.

public final class TraceEvalBridge {

    // Process a completed trace through all evaluators
    public List<EvaluationResult> process(AgentTrace trace);

    // Builder pattern
    public static Builder builder();

    // Failed evaluation record for downstream processing
    public record FailedEvaluation(
        String traceId,
        String agentId,
        String evaluatorName,
        double score,
        String details,
        Evaluator.EvaluationInput input
    ) {}
}

Builder API:

Method	Description
`evaluator(Evaluator)`	Add an evaluator to the pipeline
`evaluators(List<Evaluator>)`	Add multiple evaluators
`failureThreshold(double)`	Score below this triggers the failure callback (default: 0.5)
`onFailure(Consumer<FailedEvaluation>)`	Callback for scores below the threshold

Usage:

var bridge = TraceEvalBridge.builder()
    .evaluator(new FaithfulnessEvaluator(judge))
    .evaluator(new ToxicityEvaluator(judge))
    .evaluator(new HallucinationEvaluator(judge))
    .failureThreshold(0.5)
    .onFailure(failure -> {
        log.warn("Low score on trace {}: {} = {}",
            failure.traceId(), failure.evaluatorName(), failure.score());
        alertingService.notify(failure);
    })
    .build();

// Process a completed trace
List<EvaluationResult> results = bridge.process(completedTrace);

TraceToEvalAdapter

Converts an AgentTrace into an EvaluationInput by extracting the last user message, assistant response, tool call sequences, and latency from trace observations.

public final class TraceToEvalAdapter {
    public Evaluator.EvaluationInput adapt(AgentTrace trace);
}

Extraction logic:

User input: Extracted from GENERATION observation input
Agent response: Extracted from GENERATION observation output
Tool sequence: Collected from SPAN observation names
Latency: Computed from GENERATION observation start/end times
Metadata: Includes trace_id, agent_id, session_id, plus all trace metadata

Returns null if the trace has no chat observations.

EvalSpanAnnotator

Writes evaluation scores back onto the AgentTrace as Score objects for observability dashboards.

public final class EvalSpanAnnotator {
    public void annotate(AgentTrace trace, List<EvaluationResult> results);
}

Each evaluation result is written as a numeric score with the key eval.<evaluatorName> and source ScoreSource.HEURISTIC:

// Internally calls:
trace.addScore(Score.numeric("eval.faithfulness", 0.95, ScoreSource.HEURISTIC));
trace.addScore(Score.numeric("eval.toxicity", 1.0, ScoreSource.HEURISTIC));

Production Pipeline Example

Wire the bridge into your agent's trace completion hook for continuous evaluation:

// Set up once
LLMJudge judge = prompt -> evaluationLlm.chat(prompt);

var bridge = TraceEvalBridge.builder()
    .evaluator(new FaithfulnessEvaluator(judge))
    .evaluator(new ContextualRecallEvaluator(judge))
    .evaluator(new ToxicityEvaluator(judge))
    .evaluator(new BiasEvaluator(judge))
    .evaluator(new HallucinationEvaluator(judge))
    .failureThreshold(0.6)
    .onFailure(failure -> autoHarness.recordFailure(failure))
    .build();

// On every completed trace
agent.setTraceCompletionHook(trace -> {
    List<EvaluationResult> results = bridge.process(trace);
    // Scores are now on the trace for dashboards
    // Failures trigger auto-harness test generation
});

Evaluator Summary

Evaluator	Package	Score Meaning	Required Metadata
`FaithfulnessEvaluator`	`rag`	1.0 = grounded	`retrieved_documents`
`ContextualPrecisionEvaluator`	`rag`	1.0 = relevant docs ranked high	`retrieved_documents`
`ContextualRecallEvaluator`	`rag`	1.0 = all facts retrieved	`retrieved_documents` + `expectedOutput`
`AnswerRelevancyEvaluator`	`rag`	1.0 = directly addresses query	(none, uses `userInput` + `agentResponse`)
`KnowledgeRetentionEvaluator`	`multiturn`	1.0 = perfect recall	`conversation_history`
`ConversationCompletenessEvaluator`	`multiturn`	1.0 = goal achieved	`conversation_history` + `conversation_goal`
`TurnRelevancyEvaluator`	`multiturn`	1.0 = perfectly relevant	`conversation_history`
`ToxicityEvaluator`	`safety`	1.0 = clean	(none, uses `agentResponse`)
`BiasEvaluator`	`safety`	1.0 = no bias	(none, uses `agentResponse`)
`HallucinationEvaluator`	`safety`	1.0 = no hallucination	`context` or `retrieved_documents` (optional)

Advanced: Evaluation Hooks

The evaluation hook system provides lifecycle callbacks during agent execution for metric collection without modifying agent code. The contracts live in tnsai-core (com.tnsai.eval.hooks); the implementation lives in tnsai-quality.

EvalHook Interface

EvalHook defines callback methods invoked at key points during agent execution. All methods have default no-op implementations, so you only override what you need.

public interface EvalHook {
    // Agent lifecycle
    default void onAgentStart(EvalContext ctx, String agentId, String sessionId) {}
    default void onAgentStop(EvalContext ctx, String reason) {}
    default void onError(EvalContext ctx, Throwable error, String phase) {}

    // Chat lifecycle
    default void onBeforeChat(EvalContext ctx, String message) {}
    default void onAfterChat(EvalContext ctx, String response, long latencyMs) {}

    // Tool lifecycle
    default void onBeforeToolCall(EvalContext ctx, String toolName, Map<String, Object> arguments) {}
    default void onAfterToolCall(EvalContext ctx, String toolName, Object result,
                                  boolean success, long latencyMs) {}

    // Goal tracking
    default void onGoalCompleted(EvalContext ctx, String goalId, boolean success,
                                  Map<String, Object> details) {}

    // Memory access
    default void onMemoryAccess(EvalContext ctx, String operation, String key,
                                 int resultCount, long latencyMs) {}

    // Inter-agent communication
    default void onAgentCommunication(EvalContext ctx, String fromAgent, String toAgent,
                                       String messageType, long latencyMs) {}

    // Planning events
    default void onPlanGenerated(EvalContext ctx, String goalId,
                                  List<PlanStep> steps, long latencyMs) {}
    default void onPlanStepExecuted(EvalContext ctx, String actionName,
                                     boolean success, long latencyMs) {}
    default void onPlanCompleted(EvalContext ctx, boolean success,
                                  int totalSteps, int executedSteps, long totalLatencyMs) {}
    default void onPlanFailed(EvalContext ctx, String goalId, String reason) {}
}

Lifecycle flow:

onAgentStart()
    |
onBeforeChat() ----+
    |               | (loop)
onBeforeToolCall()  |
    |               |
onAfterToolCall()   |
    |               |
onAfterChat() <----+
    |
onPlanGenerated()
    |
onPlanStepExecuted() --+
    |                   | (loop)
onPlanCompleted() <----+
    |
onGoalCompleted()
    |
onAgentStop()

EvalHookManager

EvalHookManager (com.tnsai.eval.hooks in tnsai-quality) is the concrete implementation of EvalHandle. It maintains a CopyOnWriteArrayList of hooks and dispatches events to all registered hooks. Errors in one hook do not affect others.

EvalHookManager manager = new EvalHookManager();
manager.addHook(new LatencyHook());
manager.addHook(new QualityHook());

EvalContext ctx = EvalContext.create("session-1", "agent-1");
manager.fireOnAgentStart(ctx, "agent-1", "session-1");
manager.fireOnBeforeChat(ctx, "Hello");
// ... agent execution ...
manager.fireOnAfterChat(ctx, "Hi there!", 150);
manager.fireOnAgentStop(ctx, "COMPLETED");

Key methods:

Method	Description
`addHook(EvalHook)`	Register a hook
`removeHook(EvalHook)`	Remove a hook, returns `true` if found
`clearHooks()`	Remove all hooks
`hookCount()`	Number of registered hooks
`setEnabled(boolean)`	Enable/disable all hook execution
`isEnabled()`	Check if hooks are enabled

EvalHandle SPI

EvalHandle is the SPI interface in tnsai-core. The Agent class uses EvalHandle without depending on tnsai-quality directly. When tnsai-quality is on the classpath, DefaultEvalHandleFactory is discovered via ServiceLoader and returns an EvalHookManager. When absent, EvalHandle.NOOP silently ignores all operations.

// Factory discovery (handled internally by Agent)
EvalHandle.Factory factory = EvalHandle.Factory.discover();
EvalHandle handle = (factory != null) ? factory.create() : EvalHandle.NOOP;

EvalContext

Thread-safe container for collecting evaluation metrics during agent execution. Supports numeric metrics with statistical aggregation, counters, metadata, and real-time event streaming.

EvalContext ctx = EvalContext.create("session-123", "research-agent");

// Record metrics
ctx.recordMetric("latency", 150);
ctx.recordMetric("accuracy", 0.95);
ctx.recordMetric("latency", 200, Map.of("phase", "toolcall"));
ctx.incrementCounter("tool_calls");
ctx.incrementCounter("tokens", 1500);

// Metadata
ctx.setSpec("model", "claude-sonnet-4-20250514");
ctx.getSpec("model"); // -> "claude-sonnet-4-20250514"

// Statistics
EvalContext.MetricStats stats = ctx.getStats("latency");
// stats.count(), stats.min(), stats.max(), stats.average()
// stats.p50(), stats.p90(), stats.p99()

// Real-time streaming
ctx.addListener(event ->
    System.out.println(event.name() + " = " + event.value()));

// Export
Map<String, Object> report = ctx.toMap();
ctx.complete("SUCCESS");

MetricStats is a record with fields: name, count, min, max, sum, average, p50, p90, p99.

Advanced Evaluation

LLMJudge Interface

Evaluator SPI

EvaluationResult

RAG Evaluators

FaithfulnessEvaluator

ContextualPrecisionEvaluator

ContextualRecallEvaluator

AnswerRelevancyEvaluator

Using RAG Evaluators Together

Multi-Turn Evaluators

KnowledgeRetentionEvaluator

ConversationCompletenessEvaluator

TurnRelevancyEvaluator

Safety Evaluators

ToxicityEvaluator

BiasEvaluator

HallucinationEvaluator

Combining Safety Evaluators

Trace-Eval Bridge

Architecture

TraceEvalBridge

TraceToEvalAdapter

EvalSpanAnnotator

Production Pipeline Example

Evaluator Summary

Advanced: Evaluation Hooks

EvalHook Interface

EvalHookManager

EvalHandle SPI

EvalContext

On this page