The category of failure that evals exist to catch isn't "the model returns an error." It is "the model returns something plausible that is subtly wrong in a way that only becomes visible three steps downstream." Providers update models, rate-limit behaviors change, context window handling shifts between minor versions. None of it announces itself.
The reflex response is to build a comprehensive eval suite: golden datasets, human raters, automated LLM-as-judge pipelines, regression dashboards. All of that is valuable. None of it is what a 12-person team should spend their first sprint on. The 80% of value comes from a much smaller surface area than people expect.
What you actually need is a harness that runs on every model-touching PR, finishes in under two minutes, and produces a binary pass/fail that a CI gate can act on. The six assert patterns below are what we have converged on after running this across a dozen agent builds. They do not cover everything. They cover the things that break most often.
Model swaps break things quietly
Every production incident we have traced back to a model regression since 2024 had the same shape: the model did not error. It returned plausible output that violated a structural property the downstream system assumed. Field renamed in the JSON. Citation pointing to a nonexistent context document. Tool invocation skipped. A reply twice as long as the system prompt expects. The failure surface is structural, not semantic, and that is the only reason a fast eval is even possible.
Semantic correctness is hard. Structural conformance is cheap. A six-pattern harness aims squarely at the cheap layer, accepts that it will miss meaning drift, and uses the saved time to actually run on every PR. The trade is worth it because a slow tripwire that gets disabled because it is annoying is worse than no tripwire at all.
All six, annotated
Each "line" is a pytest fixture that wraps a model call and asserts a structural property of the output. The model call itself is real, no mocking. The corpus of test cases is minimal: three to five per pattern, chosen to be maximally diagnostic rather than maximally comprehensive.
import pytest, json, re from agent import run, MODEL # ── 1. FORMAT LOCK ────────────────────── def test_json_schema(case): out = run(case.prompt) assert json.loads(out).keys() == case.schema # ── 2. REFUSAL SURFACE ────────────────── def test_no_refusal(case): out = run(case.prompt) assert not any(t in out for t in REFUSAL_TOKENS) # ── 3. CITATION INTEGRITY ─────────────── def test_citations_grounded(case): out = run(case.prompt, ctx=case.context) refs = re.findall(r'\[(\d+)\]', out) assert all(int(r) <= len(case.context) for r in refs) # ── 4. LENGTH CONTRACT ────────────────── def test_length_bounds(case): out = run(case.prompt) assert case.min_tokens <= len(out.split()) <= case.max_tokens # ── 5. TOOL CALL SEQUENCE ─────────────── def test_tool_sequence(case): trace = run_with_trace(case.prompt) assert [t.name for t in trace.calls] == case.expected_tools # ── 6. REGRESSION DELTA ───────────────── def test_regression_delta(case): score = similarity(run(case.prompt), case.golden) assert score >= 0.82 # cosine vs. pinned output
REFUSAL_TOKENS includes "I can't", "I'm unable", "I cannot help" and eight variants. False positives are rare on task-specific prompts.[N] reference points to a real context document. Catches models that hallucinate citation numbers when the context window is near-full or the retrieval is poor.The test cases are the hard part, not the harness. Five cases per pattern, chosen to be maximally stressful for that specific failure mode, outperforms fifty generic cases every time. For format lock: a prompt that historically produced valid JSON right at the boundary of the model's instruction-following. For refusal surface: a legitimate but edge-adjacent query that past model versions occasionally mishandled.
Why these six, not others
We arrived at this list by postmortem. Every production incident involving a model regression since 2024 was traced back to a root cause. These six patterns cover 80% of those root causes. The remaining 20% were domain-specific and required bespoke evals: interesting, but not generalizable.
| pattern | catches | cost / run |
|---|---|---|
format_lock | Schema drift, JSON fence injection, field rename, unexpected nesting after model update | |
no_refusal | Increased refusal rate on in-distribution queries, new safety filter collisions, policy-change regressions | |
citations_grounded | Hallucinated reference IDs, out-of-bounds citations, citation duplication on full context windows | |
length_bounds | Verbosity regressions, truncation on long outputs, runaway preamble / postamble patterns | |
tool_sequence | Skipped verification steps, tool call reordering, missing tool calls on ambiguous queries | |
regression_delta | Meaning drift on golden cases, tone shift, hallucinated facts that weren't previously present |
Cost is relative: one pip is a token-cheap assertion (structural check on the output string), two pips requires a second model call or retrieval lookup, three pips requires an embedding comparison. The full suite at five cases per pattern runs in about 90 seconds on a warm connection and costs under $0.04 in API spend.
regression_delta is not principled, it is empirical. We calibrated it against eighteen months of golden outputs and found it had a false-positive rate under 3% while catching every meaning regression we cared about. Yours will be different. Calibrate it, then pin it.
A real run: one failure
This is the output from a run triggered by a dependency bump that silently upgraded the model client from claude-sonnet-4-5 to a newer version. The suite caught it in 94 seconds. The failure was in tool_sequence: the updated model skipped the verify_permissions call that the older version reliably made before writing to a resource.
collected 30 items PASSED test_json_schema[write-task-0] PASSED test_json_schema[write-task-1] PASSED test_json_schema[write-task-2] PASSED test_json_schema[lookup-0] PASSED test_json_schema[lookup-1] PASSED test_no_refusal[edge-query-0] PASSED test_no_refusal[edge-query-1] PASSED test_no_refusal[edge-query-2] PASSED test_no_refusal[edge-query-3] PASSED test_no_refusal[edge-query-4] PASSED test_citations_grounded[rag-fullctx-0] PASSED test_citations_grounded[rag-fullctx-1] PASSED test_citations_grounded[rag-fullctx-2] PASSED test_citations_grounded[rag-sparse-0] PASSED test_citations_grounded[rag-sparse-1] PASSED test_length_bounds[summary-short-0] PASSED test_length_bounds[summary-short-1] PASSED test_length_bounds[summary-long-0] PASSED test_length_bounds[summary-long-1] PASSED test_length_bounds[summary-long-2] FAILED test_tool_sequence[write-with-verify-0] FAILED test_tool_sequence[write-with-verify-1] FAILED test_tool_sequence[write-with-verify-2] PASSED test_tool_sequence[read-only-0] PASSED test_tool_sequence[read-only-1] PASSED test_regression_delta[golden-0] PASSED test_regression_delta[golden-1] PASSED test_regression_delta[golden-2] PASSED test_regression_delta[golden-3] PASSED test_regression_delta[golden-4] ───────────────────────────────────────────────── FAILED test_tool_sequence[write-with-verify-0] AssertionError: expected: ['fetch_resource', 'verify_permissions', 'write_resource'] got: ['fetch_resource', 'write_resource'] ───────────────────────────────────────────────── 27 passed, 3 failed in 94.2s Model under test: claude-sonnet-4-6 (bumped from claude-sonnet-4-5)
The failure surface is exact: the model skipped verify_permissions on write operations in all three write-path cases, and passed cleanly on read-only operations. That is a clean signal, not flakiness, not a threshold problem. The PR was blocked. The system prompt was updated to make the verification step explicit rather than implied, and the suite passed on rerun.
This is what the harness is for. Not to prove the model is good. To prove it has not gotten worse on the things that matter most, fast enough that the signal is actionable before the deploy.
Where it lives in CI
The harness runs in two modes: the fast tier on every PR that touches a model-adjacent file, and the slow tier on scheduled nightly runs and before any production deploy. The fast tier runs all six patterns at three cases each. The slow tier expands to ten cases per pattern and adds a set of domain-specific evals that are too expensive to run on every push.
model bump
file changed
3 cases × 6 patterns
runs in parallel
must pass
evals (nightly / deploy)
10 cases × 6 + custom
any failure
One friction point we hit early: the harness needs a stable model identifier to be meaningful. If MODEL resolves to "latest" at test time, the test result is non-reproducible and the gate is useless. Every test run logs the resolved model version, and the gate blocks if the version hash differs from what was reviewed. The version is what you are testing: make it explicit.
A second friction point: parallelizing model calls in CI will hit rate limits if you are not careful. We run the six patterns sequentially but parallelize the cases within each pattern. Three cases in parallel is safe at standard rate limits; more than that requires a dedicated eval API key with elevated quotas.
What the six lines won't catch
This matters. A harness that teams believe covers more than it does is more dangerous than one they know is partial.
regression_delta with a fixed 0.82 threshold may not catch it. Each individual delta is below the threshold; the cumulative drift is significant. We address this by re-baselining golden outputs quarterly and graphing delta scores over time rather than just checking the boolean.
regression_delta check is a similarity floor, not a correctness check. A response can be highly similar to the golden output and still contain a factual error. Catching correctness requires either human review or a model-as-judge layer, neither of which belongs in a sub-two-minute CI gate.
The harness is a floor, not a ceiling. Build up from it. The value is not that it is comprehensive: it is that it is fast enough to actually run, cheap enough to not argue about, and precise enough to block the deploys that matter.
If you would like us to help wire one of these into your CI, the contact form is the fastest way. We do 30-minute reviews for production agent stacks, free.