Why your RAG pipeline silently degrades, and how we caught it in eval week 3.

Most engineering writing about retrieval-augmented generation treats deployment as a destination. You shipped, the evals passed, your customer is happy. That's not what production looks like. Production is a slow, often invisible drift away from the day you shipped. If you aren't watching the right signal, you won't notice until a customer does.

This is a walk-through of a failure mode we design against: what index drift looks like three weeks after a handoff, how it shows up in the metrics, and the changes it forced in our standard evals harness so this category of failure surfaces on the first day, not in the third week.

↳ tl;dr Index staleness is the silent killer. A RAG system's quality is bounded by the quality of its retrieval, and retrieval quality decays as the corpus changes without re-embedding. Our standard evals didn't measure recall against a fresh corpus. They measured generation quality against frozen golden answers. We changed that.

The setup

Take a representative deployment: a customer-support stack for a mid-size fintech, roughly 8,000 employees, 1.2M support-document chunks, English and French, peak load around 240 QPS. Seven agents in the orchestration graph (research, classifier, planner, executor, two specialists, and a guardrail node). Retrieval is hybrid (BM25 plus dense) against a Qdrant cluster, with a re-ranker on top.

On day 0, we ran the full evals harness: 412 questions across 14 categories, scored against frozen golden answers by an LLM-as-judge with human spot checks. Customer-satisfaction (CSAT) from the live pilot tracked our offline scores within 1.5 points. Everyone agreed it was working.

corpus size

1.2M

chunks indexed at deploy time

eval coverage

412

golden Qs across 14 categories

deploy-day csat

88.4%

live pilot, n=1,840 conversations

The signal

Three weeks in, the customer's analytics dashboard flagged a CSAT dip from 88 to 84. It looked like noise at first (a 4-point swing over 1,200 conversations is within a reasonable confidence interval), but the next day's number was 83. The day after, 82. That's not noise. That's a trend.

fig · 01 / csat · t-21d → t+0 ● drift detected · day 21

fig · 01 daily csat, rolling 1d. The system tracked target ±2 points through day 19, then began a steady decline. The slope correlates almost exactly with the rate of new support documents added to the source corpus that were not yet re-embedded.

We pulled the orchestration traces for the lowest-CSAT conversations from the last 48 hours. The agents weren't confused. The planner picked the right path. The executor called the right tools. The generations were fluent, on-brand, polite. They were also wrong: citing the wrong policies, quoting outdated pricing, missing a product line that had launched on day 14.

In every case the failure was upstream. The retriever returned no useful chunks, so the generator hallucinated something plausible from its priors. The agents themselves were fine. The corpus had moved.

The system tracked target for 19 days. On day 20, the index hadn't moved, but the world had.

Diagnosis

The customer's support team had been quietly heroic. In the three weeks since handover, they had added 2,401 new support documents, retired 188 old ones, and rewritten policy on two product lines. None of that had been re-embedded. None of that was in the index the retriever queried.

Our pipeline had an ingestion job. It was just running on the wrong schedule. We had configured nightly re-embedding for delta documents, but the ingestion query depended on a last_modified column the customer's CMS only updated on creation, not on edit. New docs got in. Edited docs didn't. Retired docs were still in the index. Quietly, every day, the index got more wrong.

What the metric should have caught

Our evals harness measured generation quality against frozen golden answers. If the retriever returned bad chunks, the generation would still be scored against what we expected on day 0. The judge would happily mark a generation "good" if it matched the day-0 expected answer, even when the answer was outdated.

metric	day 0	day 21	day 35	signal
generation quality (judge)	0.91	0.90	0.88	flat ✓
retrieval recall @ k=10	0.87	0.71	0.58	−0.29
chunk freshness (mean age)	4d	18d	29d	+25d
citation match rate	0.94	0.74	0.61	−0.33
live csat	88.4	83.9	81.2	−7.2

The judge-based metric was the last to move. Retrieval recall, chunk freshness, and citation match rate all collapsed weeks earlier. We just weren't watching them.

The fix

The patch was small. The cultural change wasn't. We rewrote three things:

Ingestion trigger. Instead of trusting a CMS column, we compute a content hash on every document each night and re-embed any chunk whose hash changed. Slower; correct.
Retrieval evals. Recall@k against a held-out ground-truth set of citations is now a first-class metric, run daily against the live index. Drift past a threshold pages the on-call.
Chunk-freshness gauge. Mean age of cited chunks is now in the dashboard alongside latency and cost. If the index isn't moving but the source corpus is, something is wrong.

Below is roughly what the recall harness looks like. It runs as a cron job on the same compute that handles the index, takes about 90 seconds against our typical corpus size, and writes a row per run to a small Postgres table.

eval/retrieval_recall.py · 24 lines python · 3.11

from dataclasses import dataclass
from retriever import Retriever
from ground_truth import load_eval_set

@dataclass
class RecallResult:
    recall: float
    mean_age_days: float
    n_queries: int

def eval_recall(retriever: Retriever, k: int = 10) -> RecallResult:
    # held-out: ~600 (query, expected_chunk_ids) pairs
    eval_set = load_eval_set("recall_v3")

    hits, ages = 0, []
    for q, expected_ids in eval_set:
        retrieved = retriever.search(q, k=k)
        retrieved_ids = {c.id for c in retrieved}
        hits += len(retrieved_ids & expected_ids) / len(expected_ids)
        ages.extend(c.age_days for c in retrieved)

    return RecallResult(
        recall=hits / len(eval_set),
        mean_age_days=sum(ages) / len(ages),
        n_queries=len(eval_set),
    )

Two days after the rewrite, retrieval recall was back above 0.85, mean chunk age was under 6 days, and CSAT recovered to 89. In a scenario like this the end users never see the diagnosis. They just see the system get better.

What goes in evals, exactly

We've now standardized this. Every agentic system we ship runs at least these five metrics daily, and each has a documented threshold and an on-call page if it breaches.

Retrieval recall@k against a held-out ground-truth set, refreshed quarterly.
Citation match rate. Does the cited chunk actually contain the answer.
Chunk freshness. Mean age of retrieved chunks, p50 and p99.
Generation quality via judge-LLM, but only meaningful when retrieval is healthy.
Drift detector. KL divergence of query-embedding distribution vs. a baseline.

None of these are exotic. None of them require research papers. The reason most teams don't run them isn't capability. It's that the dashboard the deploy team built on day 0 only showed generation quality, and once the system was "live" no one revisited the dashboard.

↳ takeaway · for anyone shipping Evals are a product surface. Treat the dashboard the same way you'd treat the UI: who looks at it, how often, what action they take when a number moves. A dashboard nobody reads is worse than no dashboard, because it gives you a false sense that you'd notice if something broke.

Takeaways

If you're running a RAG system in production, three asks:

Measure retrieval recall daily against a held-out set. Generation quality is downstream; if you're only watching generation, you're watching a lagging indicator of a lagging indicator.
Compute a chunk-freshness gauge and put it on the wall next to latency and cost. A system whose retrieved chunks are getting older every week is a system that is silently breaking.
Don't trust your CMS's last_modified. Compute content hashes. The extra cycles are cheap. The week of degraded production is not.

Silent retrieval drift on what looks like a stable deployment is one of the most common ways a healthy-looking RAG system goes wrong. Run RAG in production long enough and you will meet this failure mode. The point of a worked example is to recognize it before your CSAT does.

If you'd like us to look at your eval suite, the contact form is the fastest way. We do free 30-minute reviews for production systems.

· end · tx 018 ·

Jean Pierre Levac

Founder of Acceleratech, the AI and workflow automation services arm of JPL Digital Growth Group. Writes and edits the field notes published here.

Written and edited by Jean Pierre Levac. Transparency note →

The setup

The signal

Diagnosis

What the metric should have caught

The fix

What goes in evals, exactly

Takeaways

More on / RAG & evals

Liked this / get the next one.