The Slack message arrives on a Monday morning, almost always.
Hey, do you know why our OpenAI bill last week was $84,000? It's been averaging $4,200.
What follows is a forensic investigation that nobody planned for. By the end of the day, they've identified what happened. A loop ran somewhere it shouldn't have. Maybe it was a self-reflection cycle that didn't terminate. Maybe it was two agents passing context back and forth in a conversation that nothing was timing out. Maybe a retry mechanism kept retrying the entire history of a long thread instead of just the last failed step.
The technical details vary. The economic shape is identical: a program that should have used a few hundred tokens used a few hundred million, and nothing in the system noticed until billing did.

Why Token Loops Are an Emergent Property, Not a Bug
Token loops are rarely the result of a coding mistake. They emerge from interactions between components that, individually, are correct.
A retry policy that retries on transient failures: correct. A conversation history that includes all prior turns: correct. An agent that calls a tool when it needs information: correct. A tool whose output gets fed back into the agent's context: correct.
Each piece works as designed. The combination produces behavior that no single component is responsible for and that no test would have caught. The system isn't broken. It's doing exactly what it was told to do — just faster, longer, and more expensively than anyone modeled.
The Anatomy of a Loop That Burned $40K
A B2B SaaS company built a customer onboarding agent that helped new users configure their account. The agent had access to documentation retrieval, account-state reading, and a "verify" tool.
The loop:
- User asks a configuration question
- Agent retrieves documentation, makes a recommendation
- Agent calls the verify tool to check current state
- Verify tool returns: "configuration not yet applied" (because the user hasn't acted yet)
- Agent reasons that maybe its recommendation was wrong, retrieves more documentation
- Agent calls verify again
- Loop continues
Without an iteration cap, the agent ran in this cycle for the duration of the user's session. Across a weekend with a few hundred concurrent onboarding sessions, the bill reached $40,000 against a baseline expectation of maybe $1,500.
The Self-Reflection Trap
Self-reflection improves quality. In specific cases, it doesn't terminate.
The trap shape:
- Model generates draft answer (input + output tokens)
- Model is given the draft and asked to critique (input includes draft; output is critique)
- Model is given the original prompt + draft + critique and asked to regenerate
- Each iteration includes everything from the previous iterations
- Token usage grows roughly quadratically with iteration count
Briefcraft case: A legal-tech startup's brief drafting tool had a self-reflection loop for citation errors. Usually it ran 4-6 iterations. For one user who uploaded an unusually complex case file, the reflection loop ran for 89 iterations before someone manually killed the process. Token usage per iteration grew from ~3K (early) to ~24K (late) because the iteration history was accumulating. Cost for that single session: ~$1,800.
The fix: a hard iteration cap (maximum 8) plus novel-edit detection that broke the loop if successive iterations were making increasingly minor edits.
Multi-Agent Conversation Loops
Two agents talking to each other is useful. The economic problem: each agent typically receives the full conversation history as context. Conversation context grows linearly. Token cost per turn grows linearly. Total cost grows quadratically with turn count.
Surveyline case: A market research firm's three-agent synthesis system (retrieval, critique, synthesis) normally exchanged 8-12 messages before producing output. For complex queries, agents got stuck in disagreement loops and ran 60-90 turns before timing out.
Token math:
- Average context per turn at turn 8: ~4K tokens
- Average context per turn at turn 60: ~32K tokens
- The pathological cases were less than 2% of queries but accounted for over 40% of monthly cost.

The Retry-With-Full-Context Pattern
Retry policies are good engineering practice. In LLM systems, retry policies often fail by retrying the entire request — including all conversation history, all tool call outputs — when only the last step actually failed.
During a 90-minute provider-side incident with 30% error rates, one SaaS company's retry policy produced this result:
- Normal load: ~12K average input tokens per successful request
- During incident: ~32K input tokens per successful request (because most successful requests were preceded by 1-3 failed retries)
- Bill for the 90-minute period: roughly 4x the normal rate
Background Agent Loops
Background agents — cron-triggered, webhook-triggered, queue-triggered — have no natural human backstop. They run when triggered, finish when their logic says they're done, and don't notice if "done" never comes.
Reedwell case: A customer service platform deployed a background agent for email triage. A misconfigured upstream integration started sending the same email's webhook repeatedly — once every few seconds. Each duplicate webhook triggered a fresh agent run. The agent ran for 36 hours before the team noticed. Total cost: ~$14,000 of duplicate work, none of which produced any new value.
The fix: idempotency at the trigger level. Every webhook included a unique event ID; if the agent had already processed that ID in the last hour, it returned a cached response without re-running.
Detection: What You Actually Need to Monitor
The standard observability stack — request rate, latency, error rate — does not catch token loops. The system isn't erroring. The cost is the only signal, and cost typically lags by hours or days.
What actually catches token loops:
- Token-per-outcome tracking: tokens consumed per successful business outcome. When this number drifts upward without explanation, you have a loop.
- Iteration distributions: for any component that can iterate, track the distribution of iteration counts. Alert on p99, not p50.
- Context window distributions: steadily rising p99 input tokens signals conversation memory bloat.
- Per-identity budget enforcement: every agent identity, every conversation, every customer gets a token budget.
- Real-time cost signals: 15-minute granularity catches loop incidents in their first hour.
Bracksaw case: A fintech built a detection pipeline after a $30K loop incident, monitoring per-customer token usage, per-agent iteration counts, and aggregate hourly cost. Three months later, a new deployment introduced a self-reflection loop with a flawed termination condition. Within 47 minutes, all three signals fired. The team rolled back. Total cost impact: ~$1,200, versus what would likely have been $20K+ without detection.
Mitigation: The Architectural Patterns That Survive
- Hard iteration caps everywhere — enforced by the orchestrator, not the component itself
- Per-request token budgets — the single highest-leverage defense against runaway costs
- Circuit breakers on cost anomalies — when per-customer cost exceeds expected ranges, automatically degrade
- Idempotency at trigger points — for background and webhook-triggered agents, every trigger carries a deduplication key
- Context pruning disciplines — sliding window of last N turns, periodic summarization
- Loop detection at the orchestrator — refuses re-entry of agents already in the current chain
- Cost attribution tags on every call — which feature, which user, which agent, which session
The token loop nobody detects until the API bill arrives is not an exotic failure. It's the default outcome of building agentic systems without explicit cost-and-loop discipline. The systems that survive at scale are the ones built by teams who learned this — sometimes through the bill, sometimes before it.
Meritshot's Data Science, Full Stack, and AI Engineering programs include token loop defense architecture — iteration caps, budget enforcement, per-identity monitoring, and circuit breakers — built into hands-on production projects.





