Data Science

Tool Calling Fails Silently When Your Agent Schema Doesn't Match Reality

Why LLM tool calling fails silently in production — a practitioner's deep dive into the four mismatch modes, hallucinated parameters, tool-chain failures, idempotency, and the testing strategy that catches schema drift before it becomes a production incident.

Meritshot13 min read
Tool CallingLLMAI AgentsFunction CallingProduction AIAI Engineering
Back to Blog

Tool Calling Fails Silently When Your Agent Schema Doesn't Match Reality

An e-commerce platform ships an AI-powered customer-support agent. Among other things, it can issue refunds via a process_refund tool. In staging, it handles hundreds of test cases perfectly. In production, over one weekend, the agent issues the same refund twice on 47 orders — once when the customer asks politely, and again when they follow up asking for status. Total loss: roughly $34,000.

The tool call succeeds every time. The schema is technically correct. The logs show nothing unusual. The bug is not in the code that processes refunds. It is in the gap between what the schema says the tool does and what the tool actually does when called twice for the same order.

Three more scenarios every practitioner will recognise immediately:

  • A DevOps assistant asked to "clean up temp files" generates delete_file(path="/tmp") — which the tool happily accepts — and takes out half the system's running state in a single call.
  • A calendar agent asked to "book a meeting with John" invents an attendee ID like "user_john_001", gets a 404 from the backend, and reports to the user "I could not find John" — while John was perfectly reachable through a lookup tool the agent never called.
  • A customer-support agent creates seventeen identical tickets because the LLM API timed out mid-response and the framework retried without an idempotency key.

None of these look like bugs in any of the individual pieces. Every tool did exactly what it was designed to do. The failures live in the contract between the model's understanding of the tools and the tools' actual behaviour.


Why Schemas Fail Silently: The Four Mismatch Modes

Tool-calling failures do not look like exceptions. They look like the agent doing something reasonable that turns out to be wrong. Four distinct classes:

  • Description mismatch. The description tells the model what the tool does; what it actually does has drifted.
  • Parameter mismatch. The parameter definitions do not reflect what the tool really expects or accepts.
  • Selection mismatch. The model picks the wrong tool among several because descriptions are too similar.
  • Response mismatch. The tool returns data in a shape the agent was not prepared to handle.

Each class fails differently. Each requires a different fix. Most teams only discover which one bit them after the postmortem.

Real scenario: A logistics platform had three tools: get_shipment_status, get_order_status, and get_tracking_details. An agent given the query "when will my package arrive" called get_tracking_details. The tool returned internal warehouse-routing data that had nothing to do with customer-facing delivery estimates. The schema said "returns tracking information" — technically true, catastrophically vague. The agent confidently told the customer their package was in a forklift bay in Indianapolis. That was the correct warehouse location. It was not the answer the customer wanted.

Four-quadrant diagram of tool calling mismatch modes

Four different failure classes, four different fixes. Knowing which one you are hitting is half the fight.


The Description Field Is Behavioural Programming, Not Documentation

The single most misunderstood fact about tool calling: the model reads the description field as instructions, not as documentation. Every word changes behaviour.

"Fetch user information," "Retrieve user information," and "Look up user information" produce measurably different selection rates in A/B tests. Adding "Use this tool when the user explicitly asks for their account details" can raise correct selection from 60% to 95% — with no code change.

Real scenario: A customer-support agent had two tools: search_knowledge_base (internal docs) and search_web (public web). The model kept picking search_web for company policy questions. The original descriptions:

  • search_knowledge_base: "Search the company knowledge base."
  • search_web: "Search the web for information."

Both technically accurate. The model picked search_web because "information" sounded more general than "knowledge base."

The rewrite:

  • search_knowledge_base: "Search our internal company policies, SOPs, and product documentation. Use this FIRST for anything about our company, products, or policies."
  • search_web: "Search the public web. Use ONLY when the question is about external information (news, general facts, competitor data)."

Selection accuracy jumped from roughly 55% to 93% overnight. Zero code changed.

A second scenario: A finance agent had transfer_funds and create_payment. The model kept picking transfer_funds for customer-facing invoices, because "transfer" sounded closer to "payment" than "create" did. Rewriting create_payment to begin with "Issue an outgoing payment to a vendor or customer — use for invoices, refunds, and reimbursements" flipped selection to correct in over 95% of cases.

Principles that actually work in description fields:

  • State the positive use case ("use this when...")
  • State the negative use case ("do not use this when...")
  • Add priority signals ("FIRST", "ONLY", "PREFERRED") when multiple similar tools exist
  • Include a short input example for ambiguous parameters
  • Avoid internal jargon the model might misread as a keyword
  • Put the most important guidance at the start — models attend to opening tokens more strongly

Pros of treating descriptions as behavioural programming: major improvements without touching code; tool routing becomes predictable; easier debugging because you can read exactly what the model is seeing.

Cons: descriptions need to be version-controlled and tested like code; updating one tool's description can affect selection rates for neighbouring tools.


Parameter Design: Where Most Bugs Actually Live

The types in your schema are not what you think. The model treats them as hints, not constraints. It will pass a string where an integer is expected and trust the tool to handle it. It will omit required fields if the user did not provide them. It will invent enum values that are not in the allowed set.

The real failure modes:

  • Required fields with no default behaviour. The model will invent values or skip the field entirely if the user did not specify.
  • Free-form strings where an enum should be. The model will occasionally drift to plausible-sounding values that do not exist: "priority": "high_priority" when the valid enum was ["low", "medium", "high"].
  • Numeric fields silently accepting strings. The tool does string-to-int conversion internally; the model never learns it sent the wrong type.
  • Optional fields that behave differently when omitted vs. null vs. empty string. The model cannot distinguish these without explicit description.
  • Timestamps without format specification. The model generates "2024-01-15" when the tool expected "2024-01-15T00:00:00Z" — and the tool silently accepts the broken value or misinterprets the timezone.

Real scenario: A scheduling agent had a create_meeting tool with a timezone parameter marked optional. When the user did not specify, the tool defaulted to UTC. The model, not knowing this default, frequently omitted the parameter for users in IST or PST, and meetings were silently created hours off. The fix was twofold: make timezone required, and add to the description: "Required. Use the user's stated timezone, or their location timezone if inferable from context, or ask the user if unclear." Scheduling errors dropped to near zero within a week.

A second scenario: A deal-tracking CRM agent had a update_deal(deal_id, status) tool where status was a free-form string. The underlying system actually accepted only ["prospecting", "qualified", "proposal", "closed_won", "closed_lost"]. The model generated statuses like "in_progress", "negotiating", "waiting" — all rejected silently, never reflected in the CRM.

Pros of tight parameter design: failures become explicit (validation errors) rather than silent; the model generates more correct calls on the first try.

Cons: adds rigidity; requires upfront schema-design discipline that teams under deadline pressure routinely skip.


The Hallucinated Parameter Problem

The model will invent parameter values. Not occasionally — consistently, under specific conditions.

Conditions under which hallucination spikes:

  • Required fields for which the user has not provided a value
  • ID-style parameters (customer IDs, ticket IDs, SKUs) — the model generates plausible-looking identifiers
  • Enums with poorly-named values (numeric codes, internal abbreviations)
  • Multi-step tasks where the second call needs context from the first that was not explicitly returned

Real scenario: A calendar agent had a schedule_meeting(attendee_id, start_time, duration) tool. When users said "schedule a meeting with John tomorrow at 3pm," the agent often invented plausible-looking attendee IDs like "john_001" or "user_john". None of those existed. The tool failed with a 404. The agent apologised to the user and moved on — hiding the failure as "couldn't find John" when the real problem was its own hallucination.

The fix was a two-step flow: add a lookup_attendee(name) tool that returned actual IDs, and update the system prompt to instruct the agent to always call lookup before schedule. Hallucinated IDs dropped to zero.

Patterns that reduce hallucination:

  • Add a "lookup" or "search" tool for anything that needs an ID
  • Use enum constraints with readable human names, not codes
  • Default to asking the user when required context is missing
  • Include example values in descriptions: "customer_id: Unique customer identifier, format CUST-XXXXXX (e.g., CUST-849201). Never guess — use lookup_customer if you do not have this."

Before/after flow showing how a lookup tool eliminates hallucinated parameter IDs

Hallucination is cheap to prevent with a lookup tool. Most teams discover this only after it costs them production incidents.


The Multi-Tool Confusion Problem

When an agent has ten tools and two of them have similar descriptions, the model does not make a careful decision. It makes a rough similarity-based choice.

Real scenario: A research agent had two tools:

  • fetch_document(doc_id) — pull a specific internal document by ID
  • search_documents(query) — search across the document corpus

A user said "get the Q3 budget report." The model called fetch_document(doc_id="Q3_budget_report") with a hallucinated ID. The document did not exist at that identifier. The tool returned a 404. The agent gave up and told the user "document not found" — never falling back to the search tool.

The fix did not touch the tools at all. It rewrote the descriptions:

  • fetch_document: "Pull a document when the user gives you a specific ID (format: DOC-XXXXX). Do NOT use this if the user is describing what they want — use search_documents instead."
  • search_documents: "Search the document corpus when the user describes what they want (by title, topic, or content). Use this BEFORE fetch_document unless a specific ID is given."

Tool selection is effectively a classification task, and like any classification task, the wrong features produce systematic errors.

Things that reduce multi-tool confusion:

  • Keep tool counts small — 5 to 7 is a sweet spot; more than 15 becomes trouble
  • Make descriptions explicitly contrastive — tell the model when to use each AND when not to
  • For tools that must coexist, add a "tool selection rule" section to the system prompt
  • Group related tools with similar name prefixes (customer_search, customer_update, customer_delete)

Idempotency: The Safety Property Nobody Designs For Until It Is Too Late

Idempotency means: calling a tool twice with the same parameters produces the same result as calling it once. For read-only tools, this is trivially true. For write tools — creating records, sending messages, charging payment methods, issuing refunds — it is a design requirement, not an assumption.

The $34,000 refund-doubling incident at the start of this article was an idempotency failure. The schema described process_refund(order_id, amount) as if it were safe to call multiple times. It was not. The underlying payment processor was happy to process a second refund on the same order if called again.

The practical requirement: Every write tool must accept an idempotency key. The model must be instructed to generate a unique key for each intended action (using session ID + action type + timestamp or similar). The backend must return the same response for duplicate calls with the same key rather than executing the action twice.

The retry problem: When an LLM API call times out and the framework retries, the tool call is retried too. Without idempotency keys, every retry is a potential duplicate action. The seventeen-ticket scenario at the start of this article was caused by exactly this: a timeout, a framework retry, and a create_ticket tool with no idempotency protection.


Testing Tool Schemas Before They Ship

Most teams test tools in isolation: does create_meeting correctly schedule a meeting when called with valid parameters? This is necessary but insufficient. What also needs testing:

Schema-to-reality drift tests. For every tool, periodically call it with the exact parameter set a naive model would generate for five realistic user queries. Check whether the output matches what the model would expect based on the description.

Boundary condition tests. What happens when the model omits an optional parameter that has a surprising default? What happens when it passes a string to a numeric field? What happens when the user provides ambiguous input for an enum field?

Multi-tool selection tests. For every pair of semantically similar tools, construct ten user queries that are ambiguous between them and verify that the model selects the intended tool.

Idempotency tests. For every write tool, call it twice with identical parameters and verify the result is identical to a single call.

Retry simulation. Simulate a timeout on every tool call and verify that the retry does not produce a duplicate action.


The Permission Layer: Preventing Destructive Tool Calls

The DevOps assistant that deleted /tmp by calling delete_file(path="/tmp") was not a model failure. It was a permission design failure. The tool accepted any path. The model had no signal that /tmp was off-limits.

Three-layer permission design:

Schema-level: Use path patterns to restrict what values are valid. path should be an enum of allowed deletion targets, not a free-form string. If path must be a string, add a description: "Valid paths are within /tmp/user_uploads/ only. Never use /tmp directly."

Tool-level: Implement a dry-run mode. A delete_file tool should accept dry_run=True which returns what would be deleted without deleting it. The agent runs dry-run first, summarises the result to the user, and only proceeds with the actual deletion after user confirmation.

Orchestrator-level: Classify tools by destructiveness. Read tools can run automatically. Write tools require an explicit action intent from the user. Destructive tools (delete, overwrite, send, charge) require user confirmation before execution, regardless of what the model generates.

Software architecture diagram showing tool permission layers and validation

The model is not the last line of defence. Schema constraints, tool-level validation, and orchestrator permission gates each catch a different class of failure.


Where You Learn to Build Tool-Safe Agents

At Meritshot, our AI Engineering programs treat tool-schema design as a core production engineering discipline — not an afterthought to LLM prompting. You design tool schemas, run adversarial tests against them, implement idempotency patterns, and build the permission layers that keep destructive tools from taking down production.

The engineers who build reliable agent systems in 2026 are the ones who think about tool contracts the way backend engineers think about API contracts: rigorously, adversarially, and before the first production deployment. That discipline is what Meritshot builds — with practitioners who have felt the cost of getting it wrong.

Recommended