The dashboards look healthy. P99 latency is in target. Retrieval hit rate is steady. The LLM bill is roughly what it was last quarter. By every metric anyone is watching, the RAG system is fine.
It isn't.
Underneath the green dashboards, a slow, compounding degradation is in progress. Embeddings drift. Documents go stale. Indices fragment. Query patterns shift while the retriever stays static. Six months in, the system is meaningfully worse than it was at launch. Twelve months in, users have started routing around it.
This is the silent rot inside every production RAG pipeline that isn't deliberately monitored for it. The rot doesn't show up in the metrics most teams track because the metrics were chosen for a system architecture that didn't fully anticipate how RAG degrades in production.

Failure Mode 1: The Eval Set Trap
A typical RAG launch follows a clean script. The team writes a representative set of evaluation queries with known-correct answers — maybe 200, maybe 2,000. They benchmark retrieval recall and answer quality on this set. They commit it to a config file. They wire it into a CI job.
For about six months, the metrics tell them something useful. After that, the eval set is increasingly misleading — because the eval set is no longer representative of the queries real users send.
User behavior shifts. Topics emerge that weren't in the original distribution. The eval set captures none of this because it was frozen at launch.
Kelvinpath case: A SaaS company launched a customer support RAG bot with a 1,200-query eval set built from historical support tickets. Eval set accuracy stayed at 78%. Real user satisfaction dropped 14 percentage points. Escalation to human agents rose 22%. The eval set didn't include any queries about features shipped after launch.
The fix: a rolling sample of real queries, refreshed weekly, with a defined process for adding new queries when product capabilities changed.
Failure Mode 2: Embedding Drift
Your embedding model hasn't changed. Your documents haven't changed. And yet, the meaning of the vectors in your index is slowly drifting from what the world means today.
Language evolves. Industry jargon shifts. Internal terminology changes after reorganizations. New concepts emerge that the embedding model — frozen at training time — doesn't know about. The embedding still produces a vector. The vector is plausible. It's just not aligned with how users now think about the topic.
Norshore Health case: A health system's clinical RAG system showed 11% of clinical queries using terminology that had emerged after the embedding model's training cutoff. For those queries, retrieval recall was substantially below average. Clinicians weren't reporting the issue because they had adapted — they reformulated queries using older terminology when the first attempt failed.
The user adaptation pattern is the most insidious: users learn what works and unconsciously rephrase. The system stops seeing failures because the failures get translated into successes by humans willing to do the work.
Failure Mode 3: Corpus Rot
RAG was sold as the architecture that grounds answers in current documentation. In production, this contract breaks within months.
What happens:
- Old policy documents stay in the index after new ones supersede them
- Product documentation for deprecated features remains alongside current docs
- Regulatory documents predating recent rule changes keep showing up in compliance queries
Lochbridge Bank case: External auditors identified customer responses citing compliance procedures from a 2022 version of internal policy. The 2024 policy was also in the same index. Both were retrieved with high relevance scores. The LLM sometimes generated answers that mixed them. Cleanup involved auditing 340 documents identified as superseded.
The deeper lesson: the index is a living system that needs continuous reconciliation against the source of truth.

Failure Mode 4: Index Fragmentation
Vector indices weren't designed for the operational pattern most production RAG systems put them through: ingest the original corpus once, then incrementally add, update, and delete documents over time.
Each incremental operation is fast and cheap. The aggregate effect, over months and years, is fragmented index quality. Approximate-nearest-neighbor structures lose recall as their graph topology drifts from optimal under repeated updates. Deleted documents leave tombstone entries. Updated documents with slightly different chunking parameters create heterogeneous retrieval behavior.
Precepts case: A legal research startup grew from 480K to 920K documents over 18 months with only incremental ingestion. Index recall on their eval set dropped from 91% to 84% — a 7-point drop accumulated across small ingestion operations, none of which individually showed signal. An offline rebuild returned recall to 90%.
Failure Mode 5: Query Distribution Shift
The queries users send in month 18 are not the same queries they sent in month 1. The system was tuned for the early distribution. Most teams don't retune.
Goldframe case: A fintech deployed a customer-facing financial advisory RAG in early 2024. By 2026, tax-related queries grew from 4% to 22% of volume, crypto questions from 1% to 15%, cross-border transfer questions from 3% to 11% — driven by new product offerings. The system's retrieval parameters had been tuned against the early distribution. On the new dominant topics, retrieval relevance was meaningfully lower.
Failure Mode 6: Duplicate Accumulation
In any long-running RAG system, slightly different versions of the same underlying document accumulate in the index. The duplicates aren't byte-identical, so deduplication scripts miss them. They are semantically near-identical, so retrieval finds all of them and feeds them to the LLM together.
Pinerow case: A developer tools company found that for a substantial fraction of queries, retrieval was returning multiple chunks from effectively the same content. Token spend per query was roughly 30% higher than necessary. After near-duplicate detection deployment, average query token usage dropped 20% and answer quality on diverse-evidence queries improved measurably.
Failure Mode 7: Generation-Retrieval Mismatch
Teams upgrade the LLM but don't revisit the prompt template, retrieval parameters, or chunking strategy that worked for the previous LLM. The new LLM is better on benchmarks. In production, RAG quality drops.
Ridgepine case: A SaaS company upgraded their RAG system's LLM to the next major version. Customer-reported satisfaction dropped 9 percentage points. The new model interpreted the existing prompt template differently — it was more willing to extrapolate beyond retrieved evidence. Hallucinations weren't loud; they were confident and grounded-looking. A prompt template overhaul tuned for the new model's behavior restored satisfaction.
Failure Mode 8: The Feedback Loop Trap
RAG systems that train on user feedback can quietly select for users who have adapted to the system's limitations rather than users who got what they actually wanted.
The mechanism: mediocre answers drive some users away. The remaining users are those for whom the mediocre answers are good enough. Their positive feedback becomes the training signal. The re-ranker learns to optimize for this self-selected population. Usage metrics improve as the real audience shrinks.
Building a Rot Detection Pipeline
Each failure mode above is independently catchable with the right monitoring:
- Rolling real-query eval refreshed continuously, not frozen at launch
- Per-topic quality tracking to surface topic-level rot that aggregates hide
- Corpus freshness audits reconciling the index against source-of-truth repositories
- Embedding drift detection tracking query embedding distributions over time
- Near-duplicate detection in ingestion plus periodic offline scans
- Generation-retrieval validation on every LLM upgrade, not just model benchmark checks
- Abandonment tracking by cohort, not just aggregate
The dashboards stay green either way. The difference is whether the system underneath them is still doing its job.
Meritshot's Data Science and AI Engineering programs include production RAG monitoring design — rolling evals, corpus freshness pipelines, drift detection — because operational discipline is what separates systems that survive from systems that degrade.





