Most engineering writing about retrieval-augmented generation treats deployment as a destination. You shipped, the evals passed, your customer is happy. That's not what production looks like. Production is a slow, often invisible drift away from the day you shipped. If you aren't watching the right signal, you won't notice until a customer does.
This is a walk-through of a failure mode we design against: what index drift looks like three weeks after a handoff, how it shows up in the metrics, and the changes it forced in our standard evals harness so this category of failure surfaces on the first day, not in the third week.
The setup
Take a representative deployment: a customer-support stack for a mid-size fintech, roughly 8,000 employees, 1.2M support-document chunks, English and French, peak load around 240 QPS. Seven agents in the orchestration graph (research, classifier, planner, executor, two specialists, and a guardrail node). Retrieval is hybrid (BM25 plus dense) against a Qdrant cluster, with a re-ranker on top.
On day 0, we ran the full evals harness: 412 questions across 14 categories, scored against frozen golden answers by an LLM-as-judge with human spot checks. Customer-satisfaction (CSAT) from the live pilot tracked our offline scores within 1.5 points. Everyone agreed it was working.
The signal
Three weeks in, the customer's analytics dashboard flagged a CSAT dip from 88 to 84. It looked like noise at first (a 4-point swing over 1,200 conversations is within a reasonable confidence interval), but the next day's number was 83. The day after, 82. That's not noise. That's a trend.
We pulled the orchestration traces for the lowest-CSAT conversations from the last 48 hours. The agents weren't confused. The planner picked the right path. The executor called the right tools. The generations were fluent, on-brand, polite. They were also wrong: citing the wrong policies, quoting outdated pricing, missing a product line that had launched on day 14.
In every case the failure was upstream. The retriever returned no useful chunks, so the generator hallucinated something plausible from its priors. The agents themselves were fine. The corpus had moved.
Diagnosis
The customer's support team had been quietly heroic. In the three weeks since handover, they had added 2,401 new support documents, retired 188 old ones, and rewritten policy on two product lines. None of that had been re-embedded. None of that was in the index the retriever queried.
Our pipeline had an ingestion job. It was just running on the wrong schedule.
We had configured nightly re-embedding for delta documents, but the ingestion
query depended on a last_modified column the customer's CMS only
updated on creation, not on edit. New docs got in. Edited docs didn't.
Retired docs were still in the index. Quietly, every day, the index got more
wrong.
What the metric should have caught
Our evals harness measured generation quality against frozen golden answers. If the retriever returned bad chunks, the generation would still be scored against what we expected on day 0. The judge would happily mark a generation "good" if it matched the day-0 expected answer, even when the answer was outdated.
| metric | day 0 | day 21 | day 35 | signal |
|---|---|---|---|---|
| generation quality (judge) | 0.91 | 0.90 | 0.88 | flat ✓ |
| retrieval recall @ k=10 | 0.87 | 0.71 | 0.58 | −0.29 |
| chunk freshness (mean age) | 4d | 18d | 29d | +25d |
| citation match rate | 0.94 | 0.74 | 0.61 | −0.33 |
| live csat | 88.4 | 83.9 | 81.2 | −7.2 |
The judge-based metric was the last to move. Retrieval recall, chunk freshness, and citation match rate all collapsed weeks earlier. We just weren't watching them.
The fix
The patch was small. The cultural change wasn't. We rewrote three things:
- Ingestion trigger. Instead of trusting a CMS column, we compute a content hash on every document each night and re-embed any chunk whose hash changed. Slower; correct.
- Retrieval evals. Recall@k against a held-out ground-truth set of citations is now a first-class metric, run daily against the live index. Drift past a threshold pages the on-call.
- Chunk-freshness gauge. Mean age of cited chunks is now in the dashboard alongside latency and cost. If the index isn't moving but the source corpus is, something is wrong.
Below is roughly what the recall harness looks like. It runs as a cron job on the same compute that handles the index, takes about 90 seconds against our typical corpus size, and writes a row per run to a small Postgres table.
from dataclasses import dataclass from retriever import Retriever from ground_truth import load_eval_set @dataclass class RecallResult: recall: float mean_age_days: float n_queries: int def eval_recall(retriever: Retriever, k: int = 10) -> RecallResult: # held-out: ~600 (query, expected_chunk_ids) pairs eval_set = load_eval_set("recall_v3") hits, ages = 0, [] for q, expected_ids in eval_set: retrieved = retriever.search(q, k=k) retrieved_ids = {c.id for c in retrieved} hits += len(retrieved_ids & expected_ids) / len(expected_ids) ages.extend(c.age_days for c in retrieved) return RecallResult( recall=hits / len(eval_set), mean_age_days=sum(ages) / len(ages), n_queries=len(eval_set), )
Two days after the rewrite, retrieval recall was back above 0.85, mean chunk age was under 6 days, and CSAT recovered to 89. In a scenario like this the end users never see the diagnosis. They just see the system get better.
What goes in evals, exactly
We've now standardized this. Every agentic system we ship runs at least these five metrics daily, and each has a documented threshold and an on-call page if it breaches.
- Retrieval recall@k against a held-out ground-truth set, refreshed quarterly.
- Citation match rate. Does the cited chunk actually contain the answer.
- Chunk freshness. Mean age of retrieved chunks, p50 and p99.
- Generation quality via judge-LLM, but only meaningful when retrieval is healthy.
- Drift detector. KL divergence of query-embedding distribution vs. a baseline.
None of these are exotic. None of them require research papers. The reason most teams don't run them isn't capability. It's that the dashboard the deploy team built on day 0 only showed generation quality, and once the system was "live" no one revisited the dashboard.
Takeaways
If you're running a RAG system in production, three asks:
- Measure retrieval recall daily against a held-out set. Generation quality is downstream; if you're only watching generation, you're watching a lagging indicator of a lagging indicator.
- Compute a chunk-freshness gauge and put it on the wall next to latency and cost. A system whose retrieved chunks are getting older every week is a system that is silently breaking.
-
Don't trust your CMS's
last_modified. Compute content hashes. The extra cycles are cheap. The week of degraded production is not.
Silent retrieval drift on what looks like a stable deployment is one of the most common ways a healthy-looking RAG system goes wrong. Run RAG in production long enough and you will meet this failure mode. The point of a worked example is to recognize it before your CSAT does.
If you'd like us to look at your eval suite, the contact form is the fastest way. We do free 30-minute reviews for production systems.