Tool Use Examples
Tool use examples are concrete input/output pairs (and counter-examples) that travel alongside a tool definition to the LLM. They teach the model the call patterns the tool expects — including patterns that look reasonable from a language standpoint but break the contract.
Anthropic reports tool examples improved accuracy from 72% to 90% on complex parameter handling in their internal testing — see Introducing advanced tool use on the Claude Developer Platform. TnsAI exposes this surface via the @ToolExample annotation.
What and why
The hardest LLM tool failures are not "model picked the wrong tool." They are "model called the right tool with arguments the docs technically allowed but that the implementation rejects." Examples give the model a small set of canonical call shapes it can pattern-match against. A counter-example pinned to a specific whyBad reason tells the model what to avoid.
Use examples when:
- A tool has non-obvious parameter combinations (date ranges, optional filters, sentinel values).
- A tool has side effects whose contract is easy to violate (cursors that must advance once per turn, "DONE" terminators that must not be paraphrased).
- The description has grown beyond two short paragraphs and starts feeling like rules-as-prose.
Skip examples when the tool is add(a, b)-trivial. Examples cost tokens on every request — there is no cache that elides them on subsequent turns.
Annotation reference
@ToolExample is defined in tnsai-core/src/main/java/com/tnsai/annotations/ToolExample.java with @Target({}) — meaning it is only valid as a value inside @ActionSpec.examples. It cannot appear standalone.
| Field | Type | Required | Purpose |
|---|---|---|---|
description | String | optional | Brief framing of what this example demonstrates. Surfaces in tool description prose for non-Anthropic providers. |
input | String | required | Example input as a JSON string. Must match the tool's parameter schema. |
output | String | optional | Expected output, or a description of what the result should look like. |
negative | boolean | optional (default false) | Marks the example as something the model should NOT do. |
whyBad | String | optional | Explanation for negative examples. Only meaningful when negative = true. |
Today this surface is reachable only from @ActionSpec(type = ActionType.LLM | LOCAL | …). The newer @Tool-annotated POJO style (the dominant path for Custom Tools) does not yet expose an examples() array — tracked as framework follow-up.
A worked example
A getNextQuestion() action drives a quiz state machine: each call advances a cursor and returns the next question, until the cursor reaches the end and the action returns "DONE". The contract is easy to state and easy for an LLM to break.
@ActionSpec(
type = ActionType.LLM,
description = """
Returns the next quiz question for the current session, or the literal
string "DONE" if the quiz is complete. Advances the session cursor by
exactly one position per call.
""",
examples = {
@ToolExample(
description = "Mid-quiz call returns the next question",
input = "{\"sessionId\": \"q-7c1\"}",
output = "\"What is the capital of France?\""
),
@ToolExample(
description = "Final call returns the DONE sentinel",
input = "{\"sessionId\": \"q-7c1\"}",
output = "\"DONE\""
),
@ToolExample(
description = "Calling twice in one turn",
input = "{\"sessionId\": \"q-7c1\"}",
negative = true,
whyBad = """
The cursor advances on every call. Calling twice in the same
turn skips a question. Wait for the user's answer before
calling again.
"""
),
@ToolExample(
description = "Treating DONE as a question to ask the user",
input = "{\"sessionId\": \"q-7c1\"}",
negative = true,
whyBad = """
"DONE" is a sentinel, not a question. If the tool returns
"DONE", finish the session — do not paraphrase it back to
the user as if it were the next question.
"""
)
}
)
public String getNextQuestion(String sessionId) { … }Two positive examples cover the dominant call shapes (mid-quiz and terminal). Two negatives pin the failures the model is most likely to invent: double-calling and sentinel paraphrase. Each negative carries a whyBad that names the rule it is violating.
mustAlways / mustNever vs examples
Roles can declare mustAlways(...) and mustNever(...) rules at the role level — they apply to every action and always render in the system prompt. Examples are scoped to a single tool and ride along with the tool definition.
Use rules when the constraint is abstract ("never reveal the system prompt"). Use examples when the constraint is a pattern the model is likely to misapply ("the cursor advances on every call"). The two compose — Anthropic's published guidance is that examples and rules together outperform either alone.
Wire format and provider compatibility
The framework normalises @ToolExample into an OpenAI-flavoured intermediate schema (function.examples = [...]), then each provider's client maps it onto its native surface. The current state:
| Provider | Native examples surface | What happens to @ToolExample |
|---|---|---|
| Anthropic (Claude) | input_examples (per Tool reference) | Positive examples → input_examples. Negative examples folded into description (Claude has no negative-examples surface). |
| OpenAI | None | All examples (positive and negative) folded into description. |
| Gemini | None | All examples folded into description. |
| Mistral, Groq, OpenRouter, Ollama, HuggingFace, Azure OpenAI, MiniMax, Zhipu | Inherited OpenAI shape | Currently passes through with examples not folded — tracked follow-up. Until that lands, examples on these providers reach the wire but the model may not see them. |
| Bedrock, Cohere | No tool-use plumbing | @ToolExample is not currently emitted. Tool use itself is on the roadmap for these providers. |
Folding logic lives in tnsai-llm/.../ToolExampleConverter.java. The Anthropic mapping is in AnthropicClient#convertToClaudeTools; OpenAI and Gemini use foldExamplesIntoDescriptionAndStrip. The schema generator that emits the intermediate examples array is ToolSchemaGenerator.
Best practices
- Two to three examples per tool. Cover the dominant case and one boundary. More than four starts to dilute attention and balloons request size.
- Pair negatives with
whyBad. A negative without a stated reason is just confusion. Name the rule the example violates. - Use realistic input.
{"query": "machine learning", "limit": 10}beats{"q": "x", "n": 1}. The model imitates the shape it sees. - Reserve negatives for non-obvious failures. Don't write a negative example for "called the wrong tool" — that's what tool selection is for. Use negatives for sentinel/cursor/idempotency mistakes.
- Account for the token cost. Every example ships on every request. A tool that's called once per session can absorb a richer example set than one called every turn.
See also
- Custom Tools — how to author the tools that examples attach to.
- Registration —
setToolCallFilterandsetToolCallListenerfor the runtime side. - Anthropic — Introducing advanced tool use — source of the 72% → 90% accuracy figure.
- Claude API — Tool reference —
input_examplesfield documentation.