TnsAI

Advanced Evaluation

This guide covers the specialized evaluator families introduced in 0.3.0: RAG evaluators for retrieval-augmented generation pipelines, multi-turn evaluators for conversation quality, safety evaluators for harmful content detection, and the trace-eval bridge that connects observability with evaluation.

All evaluators in this guide use the LLM-as-judge pattern via the LLMJudge functional interface. See Evaluation overview for the base evaluator framework, benchmark runner, quality gates, and auto-harness.

LLMJudge Interface

Every advanced evaluator takes an LLMJudge instance, which is a @FunctionalInterface that sends a prompt and returns the LLM's text response. This decouples evaluation from any specific LLM client.

@FunctionalInterface
public interface LLMJudge {
    String judge(String prompt);
}

// Plug in any LLM client
LLMJudge judge = prompt -> myLlmClient.chat(prompt);

Evaluator SPI

All evaluators implement com.tnsai.evaluation.spi.Evaluator:

public interface Evaluator {
    String name();
    EvaluationResult evaluate(EvaluationInput context);
}

EvaluationInput is a record carrying the full evaluation context:

FieldTypeDescription
userInputStringThe user's query
agentResponseStringThe agent's response to evaluate
expectedOutputStringGround-truth expected answer
expectedToolSequenceList<String>Expected tool call order
actualToolSequenceList<String>Actual tool calls made
instructionsStringInstructions the agent was given
latencyMslongResponse latency in milliseconds
costUsddoubleCost of the LLM call in USD
inputTokensintInput token count
outputTokensintOutput token count
metadataMap<String, Object>Arbitrary metadata (retrieved docs, conversation history, etc.)

Build inputs with the fluent builder:

Evaluator.EvaluationInput input = Evaluator.EvaluationInput.builder()
    .userInput("What causes tides?")
    .agentResponse("Tides are caused by gravitational pull of the Moon.")
    .expectedOutput("Tides are caused by the gravitational pull of the Moon and Sun.")
    .metadata("retrieved_documents", List.of(doc1, doc2))
    .build();

EvaluationResult

Every evaluator returns an EvaluationResult record with a normalized score in [0.0, 1.0]:

public record EvaluationResult(
    String evaluatorName,
    double score,
    String details,
    Map<String, Double> metrics,
    Instant timestamp
) {
    // Factory methods
    static EvaluationResult of(String name, double score, String details, Map<String, Double> metrics);
    static EvaluationResult pass(String name, String details);   // score = 1.0
    static EvaluationResult fail(String name, String details);   // score = 0.0

    boolean passed(double threshold);
}

RAG Evaluators

Package: com.tnsai.evaluation.evaluators.rag

RAG evaluators measure retrieval-augmented generation quality across four dimensions: faithfulness, contextual precision, contextual recall, and answer relevancy. All require retrieved_documents in the metadata as a List<String>.

FaithfulnessEvaluator

Measures whether the agent's response is grounded in the retrieved documents. Uses a 2-step LLM-as-judge process:

  1. Extract factual claims from the response
  2. Verify each claim against the retrieved context

Score: supported_claims / total_claims (1.0 = fully faithful, 0.0 = fully hallucinated)

var evaluator = new FaithfulnessEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .agentResponse("Paris is the capital of France and has 2.1 million people.")
    .metadata("retrieved_documents", List.of(
        "Paris is the capital and most populous city of France.",
        "The population of Paris is approximately 2.1 million."
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.score() -> 1.0 (both claims supported)
// result.metrics(): supported_claims, total_claims, hallucinated_claims

Metrics returned:

MetricDescription
supported_claimsNumber of claims verified against context
total_claimsTotal factual claims extracted
hallucinated_claimsClaims not supported by context

ContextualPrecisionEvaluator

Measures whether the retrieved documents are relevant to the query. Uses weighted precision -- irrelevant documents ranked higher are penalized more heavily.

For each document, the LLM judges relevance (YES/NO). The score uses the formula: sum of precision@k for each relevant document at position k, divided by total relevant count.

Score: Weighted precision (1.0 = all relevant docs ranked first, 0.0 = no relevant docs)

var evaluator = new ContextualPrecisionEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .userInput("What causes tides?")
    .expectedOutput("Gravitational pull of the Moon and Sun causes tides.")
    .metadata("retrieved_documents", List.of(
        "Tides are caused by gravitational forces of the Moon and Sun.",
        "The Pacific Ocean is the largest ocean on Earth.",
        "Spring tides occur when the Moon and Sun are aligned."
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): relevant_docs, total_docs, naive_precision

Metrics returned:

MetricDescription
relevant_docsNumber of documents judged relevant
total_docsTotal documents evaluated
naive_precisionSimple relevant / total ratio (without ranking weight)

ContextualRecallEvaluator

Measures whether all relevant information needed for the expected answer was actually retrieved. Extracts key facts from the expected output and checks how many are attributable to the retrieved documents.

Score: attributed_facts / total_facts (1.0 = all facts covered, 0.0 = none covered)

Requires: Both retrieved_documents in metadata and a non-empty expectedOutput.

var evaluator = new ContextualRecallEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .expectedOutput("Tides are caused by the Moon's gravity. Spring tides happen during full and new moons.")
    .metadata("retrieved_documents", List.of(
        "The Moon's gravitational pull is the primary cause of ocean tides."
        // Missing: spring tide information -> recall will be less than 1.0
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): attributed_facts, total_facts, missing_facts

Metrics returned:

MetricDescription
attributed_factsFacts from expected output found in retrieved docs
total_factsTotal key facts extracted from expected output
missing_factsFacts not covered by any retrieved document

AnswerRelevancyEvaluator

Measures whether the agent's response actually addresses the user's query. Scores on three normalized dimensions:

  • Directness: Does the response directly answer the question?
  • Completeness: Does it cover all aspects of the query?
  • Focus: Does it avoid irrelevant tangents?

Each dimension is scored 1-5 by the LLM, then normalized to [0.0, 1.0] and averaged.

Score: Average of normalized directness, completeness, and focus.

var evaluator = new AnswerRelevancyEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .userInput("What is the boiling point of water?")
    .agentResponse("Water boils at 100 degrees Celsius at standard atmospheric pressure.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): directness, completeness, focus
// result.details() -> "directness=5/5 completeness=5/5 focus=5/5 score=1.00"

Using RAG Evaluators Together

For a comprehensive RAG evaluation, combine all four evaluators:

LLMJudge judge = prompt -> llmClient.chat(prompt);

var evaluators = List.of(
    new FaithfulnessEvaluator(judge),
    new ContextualPrecisionEvaluator(judge),
    new ContextualRecallEvaluator(judge),
    new AnswerRelevancyEvaluator(judge)
);

BenchmarkRunner runner = BenchmarkRunner.builder()
    .evaluators(evaluators)
    .agentFunction(testCase -> ragAgent.query(testCase.getInput()))
    .build();

Multi-Turn Evaluators

Package: com.tnsai.evaluation.evaluators.multiturn

Multi-turn evaluators assess conversation quality across multiple exchanges. All require conversation_history in metadata as a List<Map<String, String>> with "role" and "content" keys.

KnowledgeRetentionEvaluator

Measures whether the agent retains information from earlier conversation turns. Uses a 2-step process:

  1. Extract key facts established in earlier turns
  2. Check if the agent recalls those facts in later turns

Score: retained_facts / total_facts (1.0 = perfect retention, 0.0 = no retention)

Requires: At least 2 turns in conversation_history.

var evaluator = new KnowledgeRetentionEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .metadata("conversation_history", List.of(
        Map.of("role", "user", "content", "My name is Alice and I work at Acme Corp."),
        Map.of("role", "assistant", "content", "Nice to meet you, Alice! How can I help?"),
        Map.of("role", "user", "content", "Can you summarize what you know about me?"),
        Map.of("role", "assistant", "content", "You're Alice and you work at Acme Corp.")
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): retained_facts, total_facts, forgotten_facts

ConversationCompletenessEvaluator

Measures whether a multi-turn conversation achieved its stated goal. Uses a 1-5 scale:

ScoreMeaning
1Goal not addressed at all
2Goal partially acknowledged but not resolved
3Goal partially resolved
4Goal mostly resolved with minor gaps
5Goal fully achieved

Score: Normalized to [0.0, 1.0] from the raw 1-5 scale.

Requires: Both conversation_history and conversation_goal (a String) in metadata.

var evaluator = new ConversationCompletenessEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .metadata("conversation_goal", "Help the user book a flight to Paris")
    .metadata("conversation_history", List.of(
        Map.of("role", "user", "content", "I need to fly to Paris next week"),
        Map.of("role", "assistant", "content", "I found flights on Tuesday and Thursday. Which do you prefer?"),
        Map.of("role", "user", "content", "Tuesday please"),
        Map.of("role", "assistant", "content", "Booked! Your flight departs Tuesday at 10am. Confirmation: ABC123.")
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): raw_score, normalized_score
// result.details() -> "completeness=5/5 score=1.00"

TurnRelevancyEvaluator

Measures whether the last assistant turn is relevant to the preceding conversation context. Scores on three dimensions:

  • Context alignment: Does the response align with the conversation so far?
  • Query addressing: Does it address the most recent user message?
  • Coherence: Is it logically consistent with prior turns?

Each dimension is scored 1-5, normalized and averaged.

Requires: At least 2 turns with at least one assistant turn.

var evaluator = new TurnRelevancyEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .metadata("conversation_history", List.of(
        Map.of("role", "user", "content", "Tell me about quantum computing"),
        Map.of("role", "assistant", "content", "Quantum computing uses qubits..."),
        Map.of("role", "user", "content", "How does that compare to classical computing?"),
        Map.of("role", "assistant", "content", "Unlike classical bits that are 0 or 1, qubits can be in superposition...")
    ))
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): context_alignment, query_addressing, coherence
// result.details() -> "context_alignment=5/5 query_addressing=5/5 coherence=5/5 score=1.00"

Safety Evaluators

Package: com.tnsai.evaluation.evaluators.safety

Safety evaluators detect harmful content in agent responses. All use inverted scoring: 1.0 = safe, 0.0 = unsafe.

ToxicityEvaluator

Detects toxic, harmful, or offensive content across five categories:

CategoryWhat it detects
INSULTDemeaning or belittling language
THREATThreatening or intimidating language
OBSCENITYProfane or vulgar language
HATE_SPEECHLanguage targeting protected groups
HARASSMENTPersistent targeting or bullying

The LLM scores overall toxicity from 0 (none) to 5 (severe), and classifies each category as PRESENT or ABSENT. The evaluator understands context -- quoting toxic content to explain it is treated differently from generating it.

Score: Inverted: 1.0 - (toxicity_score / 5.0) (1.0 = clean, 0.0 = severely toxic)

var evaluator = new ToxicityEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .agentResponse("Here is the information you requested about climate change.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.score() -> ~1.0 (clean)
// result.metrics(): raw_toxicity_score, insult, threat, obscenity, hate_speech, harassment

Metrics returned: raw_toxicity_score (0-5) plus per-category flags (1.0 = present, 0.0 = absent).

BiasEvaluator

Detects demographic, cultural, or ideological bias across six categories:

CategoryWhat it detects
GENDERStereotypes or assumptions based on gender
RACERacial or ethnic stereotypes
AGEAge-based stereotypes or assumptions
RELIGIONReligious bias or assumptions
POLITICALPolitical ideology presented as fact
SOCIOECONOMICClass-based assumptions or stereotypes

Score: Inverted: 1.0 - (bias_score / 5.0) (1.0 = no bias, 0.0 = severely biased)

The evaluator also considers the user's query for context -- a biased response to a question about bias may be appropriate.

var evaluator = new BiasEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .userInput("What are common career paths?")
    .agentResponse("Common career paths include engineering, medicine, law, and education.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// result.metrics(): raw_bias_score, gender_bias, race_bias, age_bias, religion_bias, political_bias, socioeconomic_bias

HallucinationEvaluator

Detects hallucinated content by checking factual claims against provided context. Unlike FaithfulnessEvaluator (which is RAG-specific), this evaluator works with any context source and classifies claims into three categories:

ClassificationMeaning
SUPPORTEDClaim is backed by the provided context
CONTRADICTEDClaim conflicts with the provided context
FABRICATEDClaim has no basis in the context at all

Context sources (checked in order): metadata.get("context") as String or List<String>, then metadata.get("retrieved_documents"). If no context is provided, the evaluator checks for internal contradictions and invented references.

Score: Inverted: supported / total (1.0 = no hallucination, 0.0 = fully hallucinated)

var evaluator = new HallucinationEvaluator(judge);
var input = Evaluator.EvaluationInput.builder()
    .agentResponse("The product costs $99 and ships in 2 days.")
    .metadata("context", "Product price: $99. Shipping: 5-7 business days.")
    .build();

EvaluationResult result = evaluator.evaluate(input);
// "2 days" contradicts "5-7 business days" -> score < 1.0
// result.metrics(): supported, contradicted, fabricated, total_claims

Without context (internal consistency check):

var input = Evaluator.EvaluationInput.builder()
    .agentResponse("The study by Smith et al. (2024) in Nature found that...")
    .build();

// Checks for invented citations, self-contradictions, fabricated claims

Combining Safety Evaluators

Run all safety evaluators as a guard rail in production:

var safetyEvaluators = List.of(
    new ToxicityEvaluator(judge),
    new BiasEvaluator(judge),
    new HallucinationEvaluator(judge)
);

double safetyThreshold = 0.8;
for (Evaluator eval : safetyEvaluators) {
    EvaluationResult result = eval.evaluate(input);
    if (!result.passed(safetyThreshold)) {
        log.warn("Safety check failed: {} scored {}", eval.name(), result.score());
    }
}

Trace-Eval Bridge

Package: com.tnsai.evaluation.bridge

The trace-eval bridge connects the observability layer (TnsAI.Quality traces) with the evaluation layer. It adapts completed AgentTrace spans into EvaluationInput records, runs evaluators, annotates the trace with scores, and reports failures.

Architecture

AgentTrace ──> TraceToEvalAdapter ──> EvaluationInput

                                     Evaluator[] ──> EvaluationResult[]

                               EvalSpanAnnotator ──> Score on trace

                               onFailure callback ──> AutoHarness / alerting

TraceEvalBridge

The main entry point. Orchestrates the full pipeline: adapt, evaluate, annotate, report.

public final class TraceEvalBridge {

    // Process a completed trace through all evaluators
    public List<EvaluationResult> process(AgentTrace trace);

    // Builder pattern
    public static Builder builder();

    // Failed evaluation record for downstream processing
    public record FailedEvaluation(
        String traceId,
        String agentId,
        String evaluatorName,
        double score,
        String details,
        Evaluator.EvaluationInput input
    ) {}
}

Builder API:

MethodDescription
evaluator(Evaluator)Add an evaluator to the pipeline
evaluators(List<Evaluator>)Add multiple evaluators
failureThreshold(double)Score below this triggers the failure callback (default: 0.5)
onFailure(Consumer<FailedEvaluation>)Callback for scores below the threshold

Usage:

var bridge = TraceEvalBridge.builder()
    .evaluator(new FaithfulnessEvaluator(judge))
    .evaluator(new ToxicityEvaluator(judge))
    .evaluator(new HallucinationEvaluator(judge))
    .failureThreshold(0.5)
    .onFailure(failure -> {
        log.warn("Low score on trace {}: {} = {}",
            failure.traceId(), failure.evaluatorName(), failure.score());
        alertingService.notify(failure);
    })
    .build();

// Process a completed trace
List<EvaluationResult> results = bridge.process(completedTrace);

TraceToEvalAdapter

Converts an AgentTrace into an EvaluationInput by extracting the last user message, assistant response, tool call sequences, and latency from trace observations.

public final class TraceToEvalAdapter {
    public Evaluator.EvaluationInput adapt(AgentTrace trace);
}

Extraction logic:

  • User input: Extracted from GENERATION observation input
  • Agent response: Extracted from GENERATION observation output
  • Tool sequence: Collected from SPAN observation names
  • Latency: Computed from GENERATION observation start/end times
  • Metadata: Includes trace_id, agent_id, session_id, plus all trace metadata

Returns null if the trace has no chat observations.

EvalSpanAnnotator

Writes evaluation scores back onto the AgentTrace as Score objects for observability dashboards.

public final class EvalSpanAnnotator {
    public void annotate(AgentTrace trace, List<EvaluationResult> results);
}

Each evaluation result is written as a numeric score with the key eval.<evaluatorName> and source ScoreSource.HEURISTIC:

// Internally calls:
trace.addScore(Score.numeric("eval.faithfulness", 0.95, ScoreSource.HEURISTIC));
trace.addScore(Score.numeric("eval.toxicity", 1.0, ScoreSource.HEURISTIC));

Production Pipeline Example

Wire the bridge into your agent's trace completion hook for continuous evaluation:

// Set up once
LLMJudge judge = prompt -> evaluationLlm.chat(prompt);

var bridge = TraceEvalBridge.builder()
    .evaluator(new FaithfulnessEvaluator(judge))
    .evaluator(new ContextualRecallEvaluator(judge))
    .evaluator(new ToxicityEvaluator(judge))
    .evaluator(new BiasEvaluator(judge))
    .evaluator(new HallucinationEvaluator(judge))
    .failureThreshold(0.6)
    .onFailure(failure -> autoHarness.recordFailure(failure))
    .build();

// On every completed trace
agent.setTraceCompletionHook(trace -> {
    List<EvaluationResult> results = bridge.process(trace);
    // Scores are now on the trace for dashboards
    // Failures trigger auto-harness test generation
});

Evaluator Summary

EvaluatorPackageScore MeaningRequired Metadata
FaithfulnessEvaluatorrag1.0 = groundedretrieved_documents
ContextualPrecisionEvaluatorrag1.0 = relevant docs ranked highretrieved_documents
ContextualRecallEvaluatorrag1.0 = all facts retrievedretrieved_documents + expectedOutput
AnswerRelevancyEvaluatorrag1.0 = directly addresses query(none, uses userInput + agentResponse)
KnowledgeRetentionEvaluatormultiturn1.0 = perfect recallconversation_history
ConversationCompletenessEvaluatormultiturn1.0 = goal achievedconversation_history + conversation_goal
TurnRelevancyEvaluatormultiturn1.0 = perfectly relevantconversation_history
ToxicityEvaluatorsafety1.0 = clean(none, uses agentResponse)
BiasEvaluatorsafety1.0 = no bias(none, uses agentResponse)
HallucinationEvaluatorsafety1.0 = no hallucinationcontext or retrieved_documents (optional)

Advanced: Evaluation Hooks

The evaluation hook system provides lifecycle callbacks during agent execution for metric collection without modifying agent code. The contracts live in tnsai-core (com.tnsai.eval.hooks); the implementation lives in tnsai-quality.

EvalHook Interface

EvalHook defines callback methods invoked at key points during agent execution. All methods have default no-op implementations, so you only override what you need.

public interface EvalHook {
    // Agent lifecycle
    default void onAgentStart(EvalContext ctx, String agentId, String sessionId) {}
    default void onAgentStop(EvalContext ctx, String reason) {}
    default void onError(EvalContext ctx, Throwable error, String phase) {}

    // Chat lifecycle
    default void onBeforeChat(EvalContext ctx, String message) {}
    default void onAfterChat(EvalContext ctx, String response, long latencyMs) {}

    // Tool lifecycle
    default void onBeforeToolCall(EvalContext ctx, String toolName, Map<String, Object> arguments) {}
    default void onAfterToolCall(EvalContext ctx, String toolName, Object result,
                                  boolean success, long latencyMs) {}

    // Goal tracking
    default void onGoalCompleted(EvalContext ctx, String goalId, boolean success,
                                  Map<String, Object> details) {}

    // Memory access
    default void onMemoryAccess(EvalContext ctx, String operation, String key,
                                 int resultCount, long latencyMs) {}

    // Inter-agent communication
    default void onAgentCommunication(EvalContext ctx, String fromAgent, String toAgent,
                                       String messageType, long latencyMs) {}

    // Planning events
    default void onPlanGenerated(EvalContext ctx, String goalId,
                                  List<PlanStep> steps, long latencyMs) {}
    default void onPlanStepExecuted(EvalContext ctx, String actionName,
                                     boolean success, long latencyMs) {}
    default void onPlanCompleted(EvalContext ctx, boolean success,
                                  int totalSteps, int executedSteps, long totalLatencyMs) {}
    default void onPlanFailed(EvalContext ctx, String goalId, String reason) {}
}

Lifecycle flow:

onAgentStart()
    |
onBeforeChat() ----+
    |               | (loop)
onBeforeToolCall()  |
    |               |
onAfterToolCall()   |
    |               |
onAfterChat() <----+
    |
onPlanGenerated()
    |
onPlanStepExecuted() --+
    |                   | (loop)
onPlanCompleted() <----+
    |
onGoalCompleted()
    |
onAgentStop()

EvalHookManager

EvalHookManager (com.tnsai.eval.hooks in tnsai-quality) is the concrete implementation of EvalHandle. It maintains a CopyOnWriteArrayList of hooks and dispatches events to all registered hooks. Errors in one hook do not affect others.

EvalHookManager manager = new EvalHookManager();
manager.addHook(new LatencyHook());
manager.addHook(new QualityHook());

EvalContext ctx = EvalContext.create("session-1", "agent-1");
manager.fireOnAgentStart(ctx, "agent-1", "session-1");
manager.fireOnBeforeChat(ctx, "Hello");
// ... agent execution ...
manager.fireOnAfterChat(ctx, "Hi there!", 150);
manager.fireOnAgentStop(ctx, "COMPLETED");

Key methods:

MethodDescription
addHook(EvalHook)Register a hook
removeHook(EvalHook)Remove a hook, returns true if found
clearHooks()Remove all hooks
hookCount()Number of registered hooks
setEnabled(boolean)Enable/disable all hook execution
isEnabled()Check if hooks are enabled

EvalHandle SPI

EvalHandle is the SPI interface in tnsai-core. The Agent class uses EvalHandle without depending on tnsai-quality directly. When tnsai-quality is on the classpath, DefaultEvalHandleFactory is discovered via ServiceLoader and returns an EvalHookManager. When absent, EvalHandle.NOOP silently ignores all operations.

// Factory discovery (handled internally by Agent)
EvalHandle.Factory factory = EvalHandle.Factory.discover();
EvalHandle handle = (factory != null) ? factory.create() : EvalHandle.NOOP;

EvalContext

Thread-safe container for collecting evaluation metrics during agent execution. Supports numeric metrics with statistical aggregation, counters, metadata, and real-time event streaming.

EvalContext ctx = EvalContext.create("session-123", "research-agent");

// Record metrics
ctx.recordMetric("latency", 150);
ctx.recordMetric("accuracy", 0.95);
ctx.recordMetric("latency", 200, Map.of("phase", "toolcall"));
ctx.incrementCounter("tool_calls");
ctx.incrementCounter("tokens", 1500);

// Metadata
ctx.setSpec("model", "claude-sonnet-4-20250514");
ctx.getSpec("model"); // -> "claude-sonnet-4-20250514"

// Statistics
EvalContext.MetricStats stats = ctx.getStats("latency");
// stats.count(), stats.min(), stats.max(), stats.average()
// stats.p50(), stats.p90(), stats.p99()

// Real-time streaming
ctx.addListener(event ->
    System.out.println(event.name() + " = " + event.value()));

// Export
Map<String, Object> report = ctx.toMap();
ctx.complete("SUCCESS");

MetricStats is a record with fields: name, count, min, max, sum, average, p50, p90, p99.

On this page