Data Science

The Token Loop Nobody Detects Until the API Bill Arrives

Your LLM product works perfectly. Then the invoice lands. Token loops — conversation history bloat, tool-call stacking, RAG redundancy, broken caching — are architectural failures that only reveal themselves at production scale. Here's how to find and fix them.

Meritshot14 min read
LLMAI EngineeringToken OptimizationData ScienceProduction AI
Back to Blog

The Token Loop Nobody Detects Until the API Bill Arrives

You Built Something That Works. Then the Invoice Landed.

The assistant responded correctly. The demo ran clean. The client signed off. You pushed to production on a Tuesday.

By Thursday, your billing dashboard showed a number that made you refresh the page twice.

You didn't build a bad product. You built a perfectly functional one — with a token loop quietly running in the background, billing you for the same context, over and over, in ways no unit test would ever catch.

This isn't a beginner mistake. It happens to engineers at funded startups, to data science teams at Fortune 500 companies, and to developers who absolutely know what tokens are. It happens to people who've read the documentation, attended the webinars, and watched every prompt engineering course available.

The trap isn't ignorance. It's the gap between how LLM APIs seem to work and how they actually bill — a gap that only reveals itself at production scale, under real user behaviour, with real session durations.


The Mental Model Most People Are Using Is Wrong

Ask any developer how LLM billing works, and they'll say something accurate but dangerously incomplete: "You pay per token, input and output."

That's true. What's missing is the implication: every API call includes every token in the input, regardless of how many times you've already sent those tokens.

The API has no memory between calls. It doesn't remember that you sent the system prompt last time. Each call is stateless at the infrastructure level. Your application is responsible for assembling the input — and if your application assembles it naively, you pay for the assembly.

The mental model that actually holds at scale:

Cost = (tokens you needed) + (tokens you sent because your architecture didn't prune)

The second term is invisible in logs unless you instrument for it. It doesn't show up as an error. The model processes it correctly and charges you for the privilege.


What a Token Loop Actually Is (And Why It's Not One Thing)

A token loop isn't a single bug. It's a class of architectural patterns where your application causes token counts to compound across turns, calls, or pipeline stages — silently, structurally, and at scale.

There are five distinct forms, each with its own fingerprint in billing data:

1. Conversation history bloat: Every message in a multi-turn chat gets resent on each new call. Your 50-token user message arrives riding a 3,200-token history payload that's been growing since session start. At turn 20, you're essentially re-reading the entire conversation from the beginning — every single call.

2. Agentic tool-call stacking: In a ReAct-style or function-calling agent, each tool result gets appended to the context before the next model call. Five tool steps means the model sees turn 1's tool output again on turn 5, and again on turn 6. The context snowballs with each reasoning step.

3. System prompt duplication: In pipelines where prompts are assembled programmatically, the same instruction block gets injected at multiple stages. Nobody catches it in code review because it renders fine in testing and the output is correct — it's just expensive.

4. Retrieval-Augmented Generation (RAG) redundancy: When your vector search retrieves overlapping chunks, the model receives near-identical passages multiple times. If your top-5 retrieved chunks share 60% of their tokens, you're billing for redundant context on every RAG call.

5. Streaming retry loops: In poorly handled streaming implementations, a network timeout mid-stream triggers a full retry — resending the complete input context. If your retry logic doesn't deduplicate, you pay for the same input 2–3 times per session without generating any additional useful output.

Each pattern is harmless in isolation. In a live production system serving concurrent users across long sessions, they compound — and they compound simultaneously.


The Real-World Scenario That Gets Teams Every Time

A team builds an internal HR assistant. The system prompt is detailed — around 800 tokens of instruction, persona definition, and policy context. In testing, sessions averaged 4 turns. Costs looked reasonable. The team launched.

In production, usage patterns shifted. Employees opened the assistant at the start of their workday and left the tab open. HR staff ran long diagnostic sessions when handling edge cases. A benefits manager might have an open session for six hours.

What happened in the billing:

  • The 800-token system prompt was being resent on every call — 40 times in a long session
  • Conversation history grew unbounded within the session
  • By turn 20, each API call was carrying approximately 6,000 tokens of context for a new user message that was, on average, 12 words long
  • The 12-word question was costing 500 times what it cost in testing
  • Concurrent sessions meant this wasn't one anomalous user — it was the average

The product worked flawlessly. Employees loved it. Response quality was excellent. The bill was the only signal that anything was structurally wrong — and it arrived 30 days after the loop started running.


Why Standard Monitoring Doesn't Catch This

The first instinct is usually to look at error rates, latency, or response quality. Token loops don't affect any of these. The model reads the full context and responds correctly.

Most teams track total tokens per month, which tells you volume but nothing about structure. You need a different set of metrics entirely:

What you need to track:

  • Tokens per call (not just per month) — plot this as a time series; a healthy system is roughly flat; a looping system trends upward
  • Token count growth rate within a session — if turn 10 costs 8x what turn 1 cost, something is compounding
  • Input-to-output token ratio — a ratio of 20:1 consistently usually means history bloat is the primary suspect
  • Unique system prompt tokens per call — if caching is active, this should register near zero after the first call; if it doesn't, your caching layer is silently broken
  • P95 and P99 token counts per call — averages hide the long tail
  • Cost per session, not just cost per call — multiply by average session length to understand true unit economics

What you're probably tracking instead:

  • Total monthly spend (tells you the damage, not the cause)
  • Error rate (unaffected by token loops)
  • Response latency (affected, but not enough to alert)
  • Response quality ratings (CSAT stays high even as costs spiral)

Agentic Systems Are the High-Risk Zone

If multi-turn chat is a slow leak, agentic pipelines are a burst pipe.

A financial operations team builds an agent that pulls transaction data from an ERP system, cross-references it against bank feeds, formats a reconciliation summary, identifies discrepancies, and drafts a narrative explanation. Seven tool steps. Each step appends to the running context:

StepNew Tokens AddedTotal Input Tokens to Model
1400900 (with system prompt)
24501,750
33802,530
45003,430
54204,250
63905,040
74105,850

The task generates approximately 2,950 tokens of genuinely new information. The model processes 5,850 input tokens on the final step — nearly double what the task actually required.

The fix sounds simple: prune the context between steps — summarise tool outputs or extract only relevant fields before passing to the next step.

In practice, it requires explicit decisions:

  • What constitutes "relevant" varies by task type and can't always be predetermined
  • Aggressive pruning causes the model to lose reasoning threads on complex tasks
  • Summarisation adds a model call (and tokens) to produce the summary
  • Over-pruning fails silently — the agent completes the task but with degraded accuracy

Pros of full-context agent chains: Coherent multi-step reasoning, complete audit trail, lower probability of hallucination on complex cross-referencing tasks.

Cons: Context costs compound non-linearly. At production scale with concurrent users, cost becomes a business blocker before a technical one.


RAG Pipelines: The Hidden Redundancy Problem

Consider a legal team's contract review assistant. They've chunked a 200-page contract into 400-token segments. When a user asks about indemnification clauses, the vector search returns the top 5 most similar chunks. Indemnification is referenced in multiple places throughout the document.

The top 5 retrieved chunks might share 40–60% of their unique token content. The model receives all 5 chunks — approximately 2,000 tokens — but the unique informational content might be 900–1,000 tokens. You're paying for 1,000 tokens of redundancy on every RAG call.

More importantly, it degrades output quality — the model attends to the repeated content disproportionately, skewing responses toward whichever clause is most over-represented in retrieval.

The fixes:

  • Semantic deduplication before injection — if two chunks exceed a similarity threshold (typically 0.92+), drop the lower-ranked one
  • Better chunk boundaries — chunk on logical document structure rather than fixed token counts
  • Contextual compression — pass each retrieved chunk through a smaller, cheaper model that extracts only the sentences relevant to the query
  • Maximal Marginal Relevance (MMR) retrieval — a retrieval strategy that balances relevance against diversity

Prompt Caching: The Fix That Isn't Always Working

Anthropic and OpenAI both offer prompt caching — where repeated, static portions of a prompt are cached server-side and not re-billed at full price on subsequent calls. In theory, this is the structural fix for system prompt bloat.

In practice, caching silently fails in more situations than most teams realise:

1. Dynamic variable injection anywhere in the static block: Cache keys are computed on the exact byte sequence of the prompt prefix. If your system prompt includes a session ID, a timestamp, or any dynamically generated content — the prefix changes on every call. Every call is a cache miss.

2. Prefix-only caching doesn't survive mid-block dynamics: If your 800-token system prompt has 600 tokens of static content followed by 200 tokens of dynamic context, only the first 600 tokens cache. If anything in those first 600 tokens changes between calls, nothing caches.

3. Short TTL on cache entries: Cache TTL is short — typically minutes, not hours. A user who returns to a session after a 20-minute gap may trigger a full cache miss.

4. SDK wrapper reordering: Some LangChain configurations and orchestration frameworks reorder prompt components before sending. If your static content ends up after dynamic content in the assembled request, caching never activates.

5. Minimum token threshold: Caching typically requires a minimum prompt length to activate — usually around 1,024 tokens. Short system prompts don't benefit.

How to verify caching is actually working: Pull raw API response objects and check the usage.cache_read_input_tokens field. If caching is active and hitting, this number should be non-zero and roughly equal to your static prompt token count. If it's zero on calls after the first, caching is broken somewhere in your pipeline.


The Cost Model That Changes Everything

Most teams think about cost as: (tokens per call) × (price per token) × (call volume).

The actual model in a looping system is: (tokens per call at turn N) × (price per token) × (call volume) × (average session length in turns).

That last multiplier — average session length — is the one that gets teams. It's not in the initial cost projection because it wasn't in the test environment.

Avg. Session LengthTokens per Final TurnMonthly Cost at 10K sessions/day
4 turns~2,400~$2,100
10 turns~5,800~$5,100
20 turns~11,200~$10,200
40 turns~22,000~$19,800

The product didn't get 10x more expensive because pricing changed. It got more expensive because users liked it more and stayed longer. Without context management, your most engaged users are your most expensive infrastructure load.


The Structural Fix: Four Layers You Need

Layer 1: Observability First

Don't touch your prompt or architecture until you have instrumentation. Log at the call level: prompt tokens, completion tokens, estimated cost, session ID, turn number. Build a lightweight dashboard. Run it for a week. Only then do you know which loop is actually costing you.

Layer 2: Context Window Management

Replace unbounded conversation history with a rolling window — typically the last 6–8 turns — plus a structured summary of earlier context generated by a smaller, cheaper model. For most conversational use cases, this cuts input tokens by 60–80% in long sessions.

What goes wrong: Summarisation adds latency and introduces failure modes. If the summariser loses a critical piece of context, the main model gives a wrong answer. Test extensively with your specific task type.

Layer 3: Agent Context Pruning

Between tool steps in an agentic chain, extract only the relevant output rather than passing raw tool responses. A tool that returns a 600-token JSON payload often contains 3 fields that matter to the next step. Parse, extract, send 40–60 tokens instead.

Layer 4: RAG Pipeline Hygiene

Implement semantic deduplication on retrieved chunks before injection. Consider contextual compression for document-heavy applications. Switch from fixed-token chunking to structure-aware chunking where possible.


What Actually Breaks When You Over-Optimise

Over-trimmed context causes reasoning failures. If you set your rolling window too short for your task type, the model loses back-references. A user who says "go back to the approach you suggested in step 3" gets a confused response.

Aggressive RAG compression degrades answer quality on edge cases. Contextual compression works well on factual queries with clear relevance signals. On ambiguous queries — 20–30% of real production traffic — the compressor may discard context the main model needed.

Summarisation introduces hallucination into the context itself. When you summarise earlier turns, you're asking a model to produce the summary. That summary can introduce subtle inaccuracies. These errors compound and are extremely difficult to trace.

Cost optimisation creates a latency trade-off. Every technique that reduces tokens either adds a model call or adds application-layer computation. In latency-sensitive applications, aggressive token optimisation can increase perceived response time.

The right architecture isn't the most optimised one. It's the one calibrated to your specific cost/quality/latency constraints — and that calibration is empirical, not theoretical.


Reading Your API Bill Like an Engineer, Not an Accountant

What high input-to-output token ratios tell you: If your average call has 4,000 input tokens and 200 output tokens (20:1 ratio), conversation history is dominating your context.

What month-over-month cost growth tells you: If costs grow faster than user growth, your per-session cost is increasing. Either sessions are getting longer, a new feature introduced a new loop, or a recent deployment broke your caching configuration.

What a sudden cost spike with no user spike tells you: A deployment introduced a new loop. Check what changed: system prompt length, new tool added to an agent, a new RAG retrieval path.

The bill always has a story. The teams that find the story in the first week have a materially different trajectory than the ones that find it after the second invoice.


Token loops are one failure mode in one layer of the LLM engineering stack. The practitioners navigating this well understand the full AI engineering stack: how context windows interact with vector stores, how cost models shift between hosted APIs and self-deployed open-source models, how to instrument AI systems with the same rigour you'd apply to a distributed backend service.

At Meritshot, these aren't whiteboard topics — they're live, case-study-driven sessions inside the Data Science and Full Stack Development programs, led by practitioners who've debugged exactly the kind of billing surprises and architectural failures this article described.

Recommended