Customer satisfaction scores are collected after the interaction. Customers rate how they felt, which is a proxy for whether their problem was solved, which is itself a proxy for whether the agent performed correctly. At each remove, signal degrades. By the time a hallucination (a fabricated return policy, an invented shipping timeframe, a made-up product spec) shows up in your CSAT, it has already appeared in dozens of interactions you'll never trace back to the root cause.
The more insidious problem: CSAT scores are systematically biased toward interactions that resolved. Customers whose issue wasn't resolved often don't complete the survey. Customers who were confidently given wrong information and acted on it don't always realize the error in time to rate the interaction poorly. A CS agent can be factually wrong at a meaningful rate and still maintain a 4-star rating.
We've seen this pattern repeatedly. Teams deploy a CS agent, watch CSAT hold steady or improve (customers appreciate the speed) and conclude the agent is performing well. Then a policy change happens, the agent's training data doesn't reflect it, and the agent continues confidently giving outdated answers. CSAT doesn't move for weeks. Returns spike. Support volume spikes. By then the causal chain is buried.
CSAT is a lagging indicator of a lagging indicator
The damage from a CS hallucination is downstream of the interaction that caused it. The customer says thank you and hangs up. The problem surfaces days later, in a return, a chargeback, or a regulatory complaint. CSAT, collected at the moment the customer feels heard, cannot see any of that.
The numbers in the next section come from an illustrative audit scenario for a CS agent. Customers giving 4-star ratings are, in several cases, rating interactions where the agent gave them incorrect information they had not yet acted on. The satisfaction score is real. The accuracy is not.
The three tiers. Only one is optional.
CS agent quality decomposes into three measurement tiers. The first tier, vanity metrics, is what most teams report. The second tier, operational metrics, is what most teams can instrument with reasonable effort. The third tier, trust metrics, is what most teams skip and what matters most.
| metric | typical value | benchmark range | signal type |
|---|---|---|---|
| Tier 1 · Vanity (measure, don't optimize) | |||
| CSAT Score Post-interaction satisfaction rating | 4.2 / 5 | 3.8 – 4.5 typical | lagging / biased |
| Deflection Rate % of contacts resolved without human | 68% | 60 – 80% claimed | vanity / gameable |
| Tier 2 · Operational (instrument these first) | |||
| First Response Latency p50 / p99 time to first substantive reply | 4s / 18s | <8s p50 target | leading / actionable |
| Escalation Rate % of sessions transferred to human | 23% | 10 – 20% healthy | watch / trending |
| Repeat Contact Rate % returning within 24h for same issue | 14% | <8% target | leading for quality |
| Refusal Rate % of valid queries refused or deflected | 2.1% | <3% healthy | leading / testable |
| Tier 3 · Trust (the ones that matter most) | |||
| Hallucination Rate % of responses with ungrounded factual claims | 8.1% | <1% required | alarm / CSAT-invisible |
| Policy Accuracy % of policy-citing responses citing correctly | 91.4% | >99% required | alarm / liability |
| Grounding Citation Rate % of factual claims traceable to source | 78% | >95% target | proxy for halluc. rate |
| Unwarranted Confidence Rate Definitive answers on genuinely uncertain things | 5.3% | <2% target | alarm / hard to detect |
The CSAT score was 4.2. The hallucination rate was 8.1%. Both were true simultaneously. The first tier looked fine. The third tier was on fire. No dashboard built only on tier 1 would have caught it.
What hallucination looks like in customer service. Specifically.
CS hallucinations have a different character from the hallucinations people worry about in research or coding contexts. They tend to be plausible, confident, and about things the customer has no immediate way to verify: return windows, shipping estimates, warranty terms, feature availability in specific plans. The customer says thank you and hangs up. The problem surfaces days later.
Agent said: "Absolutely, our return policy covers 60 days from purchase, so you're well within the window."
Actual policy: 30-day return window. The customer was 2 days outside it.
Root cause: The policy changed from 60 to 30 days 6 weeks prior. The agent's retrieval corpus hadn't been re-indexed. The old policy document was still the top result for "return window."
Three things made this case typical. First, the agent answered confidently without hedging: no "I believe" or "let me verify." Second, the error was grounded in a real document. It wasn't fabricated from nothing, it was the wrong version of a real thing. Third, CSAT for that interaction was 5 stars. The customer was delighted. The damage was downstream, when they showed up with the product and were told no.
The measurement implication: detecting this class of hallucination requires checking responses against the current version of the source material, not just against whether a source exists. A response can be fully cited and still be wrong. Citation coverage is necessary but not sufficient. You need version-aware grounding checks.
How to measure what actually matters
The tier-3 trust metrics require active measurement. They won't appear in any default analytics. Here's the practical approach for each one, ordered by implementation cost.
- Escalation rate, trivial to instrument, rich in signal
escalations / total sessions · trended daily by intent categoryEscalation rate by intent category is more valuable than overall escalation rate. A spike in escalations on "billing dispute" queries is a specific, actionable signal. A spike in overall escalation rate is a starting point for investigation. The category breakdown is the signal; the aggregate is noise reduction.
Target: <15% overall · alert on >25% in any single intent category - Repeat contact rate, the resolution quality proxy
sessions where same customer reopens within 24h / total sessionsIf a customer comes back within 24 hours with the same issue, the agent didn't resolve it, or resolved it incorrectly. Repeat contact rate is the closest CSAT-free proxy for resolution quality. It's measurable from session logs with no additional model calls. An 8% repeat contact rate means roughly 1 in 12 resolutions fails within a day.
Target: <8% · anything above 12% indicates a systematic resolution failure - Grounding citation rate, the hallucination leading indicator
responses with ≥1 traceable source citation / total factual responsesRequire the agent to cite sources for factual claims. This doesn't prevent hallucination (a model can cite the wrong document) but it creates an auditable trail and raises the friction of hallucinating. Responses without citations, on queries that require factual grounding, are the highest-risk responses. Flag and sample-review them.
Target: >95% citation rate on factual queries · sample-review uncited responses weekly - Hallucination rate, requires active evaluation
LLM-as-judge: responses where agent claim ≠ source document / sampled responsesYou cannot measure hallucination rate passively. It requires sampling responses, retrieving the source documents the agent cited or should have cited, and checking whether the agent's claim is consistent with those documents. A lightweight LLM-as-judge pass ("does this response contradict the cited source?") on 2–5% of interactions is tractable and sufficient for trend detection. Version-aware: the source must be the version current at the time of the interaction.
Target: <1% · anything above 3% requires immediate retrieval corpus audit - Policy accuracy, the liability metric
policy-citing responses verified correct against current policy / policy-citing responsesPolicy claims are the highest-stakes responses in CS: return windows, warranty terms, pricing, entitlements. Maintain a policy golden set, a structured list of current policy facts with their correct values, and run every policy-citing response against it. This is not LLM-judged; it's a deterministic lookup. Wrong policy claim = 0, regardless of how confidently it was stated.
Target: >99% · non-negotiable for legal and regulatory exposure
The hallucination sampler. What it actually runs.
This is the lightweight version: a 2% sampling pass that runs asynchronously after each interaction and logs results to your quality dashboard. It adds zero latency to the customer-facing path.
import anthropic, random from datetime import datetime client = anthropic.Anthropic() # Runs async after interaction, zero customer-facing latency impact def sample_for_hallucination(interaction: dict, sample_rate: float = 0.02): if random.random() > sample_rate: return # not sampled this interaction agent_response = interaction["agent_response"] cited_sources = interaction["retrieved_chunks"] # from retrieval layer interaction_ts = interaction["timestamp"] # ── Version-aware source fetch ──────────────────────────────── # Retrieve the source as it existed at interaction time sources_at_time = [ fetch_source_version(src["id"], as_of=interaction_ts) for src in cited_sources ] # ── LLM-as-judge: is the response consistent with sources? ──── judge_prompt = f"""You are auditing a customer service response for factual accuracy. Agent response: {agent_response} Source documents (current at time of interaction): {format_sources(sources_at_time)} For each factual claim in the agent response: 1. Identify the claim 2. Find the supporting source passage (if any) 3. Assess: CONSISTENT, INCONSISTENT, or UNVERIFIABLE Respond in JSON: {{"claims": [{{"text": ..., "verdict": ..., "reason": ...}}]}}""" result = client.messages.create( model="claude-haiku-4-5", # cheap judge, not the production model max_tokens=1000, messages=[{"role": "user", "content": judge_prompt}] ) claims = parse_json(result.content[0].text)["claims"] # ── Score and emit ──────────────────────────────────────────── inconsistent = [c for c in claims if c["verdict"] == "INCONSISTENT"] if inconsistent: log_hallucination_event({ "interaction_id": interaction["id"], "inconsistent_claims": inconsistent, "severity": classify_severity(inconsistent), "sampled_at": datetime.utcnow().isoformat() }) # Compute running rate, alert if 30-day rate exceeds threshold update_hallucination_rate_metric( has_hallucination=bool(inconsistent), interaction_id=interaction["id"] )
Two design decisions worth calling out. First: the judge uses claude-haiku-4-5, not the production model. The task ("does this claim contradict this passage?") is a classification problem, not a generation problem. Haiku handles it well at roughly 12× lower cost than Sonnet. Second: version-aware source fetch. Without it, you'd be checking responses against the current source document, not the one that existed when the interaction happened. A policy change last week would make pre-change interactions look like hallucinations when they weren't.
What to put on the wall. And what to leave off.
The hardest part of this isn't the instrumentation, it's the reporting. Quality dashboards that show everything create alert fatigue; dashboards that show only CSAT create false confidence. The three-tier structure suggests a three-panel layout, and the tier labels tell you how much attention to give each one.
The escalation rate breakdown by intent category deserves its own mention. When escalation rate spikes, you want to know immediately whether it's "billing disputes" (a policy change, a billing system issue) or "product questions" (a knowledge gap, a retrieval failure) or "all categories equally" (a model update, a system prompt regression). The overall number tells you something is wrong. The breakdown tells you where to look.
One instrument that takes an afternoon to build and pays back permanently: a policy golden set. Maintain a JSON file of current policy facts (return window, warranty duration, shipping timelines, plan-level feature availability) with their correct values and effective dates. Run every policy-citing response against it automatically. Wrong policy answer flagged, regardless of tone or confidence. This is the only deterministic check in an otherwise probabilistic quality stack, and it covers the highest-liability category.
If you would like us to help wire one of these into your CS stack, the contact form is the fastest way. We do 30-minute reviews for production agent stacks, free.