Consider a research agent that enters an undetected retry loop after receiving an ambiguous tool response. With no termination condition beyond a per-session step cap (which the loop resets on each new sub-task) and no cost alerting in place, the agent makes 24,847 API calls over 9 hours and 14 minutes, accumulating $3,218 in API spend before an engineer notices the anomaly during a routine morning check. This post walks that failure end to end.
Minute by minute, then hour by hour
The incident timeline below records every state change from normal invocation to containment. Spend figures are the running API cost at each checkpoint.
| time | state | what happened | spend |
|---|---|---|---|
| 22:04 | Normal | Research agent invoked for nightly competitive analysis run. Scheduled task. Agent tasked with researching pricing changes across 12 competitor products. Normal invocation. No anomalies in first 3 minutes of execution. | n/a |
| 22:07 | Warning (undetected) | Tool response returns ambiguous result on competitor #7. The web scraper tool returned a 200 status with an empty body, a known edge case on a rate-limited site. The agent interpreted this as "task incomplete, retry." This was the correct interpretation. The problem was what happened next. | ~$0.12 |
| 22:07–22:19 | Loop begins | Agent spawns sub-task, resets its own step counter. The agent's retry logic spawned a new sub-task for competitor #7. The sub-task inherited a fresh step budget. The sub-task also failed. It spawned another sub-task. The step cap, intended as the loop-breaker, was per-task, not per-workflow. The agent had found a gap between the two scopes and was threading through it. | ~$18 · ~$1.50/min |
| 22:19–04:30 overnight | Active loop · undetected | Loop runs for 6 hours and 11 minutes without alert. No cost threshold alert existed. The API spend dashboard updated hourly; the on-call rotation was not subscribed to it. The agent was healthy by every metric that was being monitored: no errors, no timeouts, no failed requests. It was making successful API calls in a loop. From the infrastructure's perspective, it was working perfectly. | ~$2,900 · ~24,000 calls |
| 07:18 | Detection | Engineer notices anomalous API usage during morning check. Not an automated alert. An engineer opened the provider dashboard during their morning routine and saw the usage graph. Manual detection, 9 hours and 14 minutes after the loop began. The agent was terminated immediately. | $3,218 · 24,847 calls |
| 07:22 | Contained | Agent terminated, incident declared, post-mortem begun. The workflow was killed. No customer-facing systems were affected, this was a background analysis task. The competitive analysis report was not produced. The post-mortem was started the same morning. | n/a |
The spend curve, hour by hour
Total API spend over the incident window. The loop warms up after 23:00, plateaus near $400/hour through the night, and is cut off by manual termination at 07:18.
Three missing safeguards. All three required.
This was not a rare edge case. The trigger, an empty 200 response from a rate-limited scraper, happens on roughly 3% of scraper invocations. The loop was a foreseeable outcome of a foreseeable input, with no defense in depth. The post-mortem identified five contributing factors. Three were independent safeguards that, had any one of them been in place, would have contained the incident.
| # | factor | severity |
|---|---|---|
| 01 | No confidence budget, the primary root cause. The agent had no mechanism to detect that its progress was stalling. Each retry was indistinguishable from productive work at the infrastructure level: a call was made, a result was received. A confidence budget would have measured the novelty of each observation against the session context and detected within 8–10 iterations that the agent was receiving no new information. It would have triggered graceful fallback rather than continuing to spend. See Building a planner that knows when to give up. | Root cause |
| 02 | Step cap was per-task, not per-workflow. The existing step cap was a correct idea, incorrectly scoped. Capping steps per sub-task while allowing unlimited sub-task spawning is equivalent to no cap: any agent that can recurse will route around it. The cap needed to be at the workflow level, counting all steps across all sub-tasks in a single invocation tree. This is a design error, not a configuration error. See Why we stopped writing custom orchestrators. | Root cause |
| 03 | No cost alerting, the detection failure. The incident ran for 9 hours and 14 minutes undetected. The provider dashboard updated hourly and was not monitored outside business hours. No alert existed for spend-rate anomalies, cumulative session spend, or calls-per-minute above threshold. Had a $50 per-session cost alert existed, the incident would have been caught within 33 minutes and cost under $100. See $0.0004 per agent step: how we made cost a first-class metric. | Root cause |
| 04 | No regression test for loop behavior. The eval harness at the time tested output quality on happy-path cases. It did not include a test that verified the agent terminated correctly on ambiguous tool responses, the exact input that triggered this incident. A single test case ("empty 200 response from tool, agent should exit gracefully within N steps") would have caught the missing termination condition at deploy time. See The 6-line eval suite we ship with every agent. | Contributing |
| 05 | On-call rotation not subscribed to spend anomalies. Even without automated alerting, the provider dashboard showed anomalous spend from 23:00 onward. The on-call engineer was not subscribed to dashboard notifications and did not check the dashboard during the night. The monitoring gap was organizational, not technical, which is why the technical fix (automated cost alerts) was necessary rather than sufficient. | Missed detection |
What the code looked like. Before and after.
The missing termination condition was three lines: a workflow-level step counter and a check before each sub-task spawn. The cost alert was a single CloudWatch rule. Neither required architectural changes. Both required the incident to make them feel necessary.
# ── BEFORE: sub-task spawn with no workflow-level guard ────────── def spawn_subtask(task: str, context: dict) -> dict: # No check on total workflow steps. This is the gap. sub_agent = ResearchAgent(max_steps=20) # per-task cap only return sub_agent.run(task, context) # ── AFTER: workflow-level budget enforced across all sub-tasks ──── @dataclass class WorkflowBudget: max_steps: int = 100 # across ALL sub-tasks max_spend_usd: float = 5.00 # hard dollar cap steps_used: int = 0 spend_usd: float = 0.0 def check(self) -> None: if self.steps_used >= self.max_steps: raise WorkflowBudgetExceeded(f"Step limit reached") if self.spend_usd >= self.max_spend_usd: raise WorkflowBudgetExceeded(f"Spend limit $ reached") def spawn_subtask(task: str, context: dict, budget: WorkflowBudget) -> dict: budget.check() # enforced before spawn sub_agent = ResearchAgent( max_steps=20, budget=budget, # shared budget object on_step=lambda cost: setattr( budget, "spend_usd", budget.spend_usd + cost ) ) return sub_agent.run(task, context) # The confidence budget (the planner post) sits above this. It # detects stalling before the dollar cap is reached. The dollar # cap is the last line of defense, not the first.
The WorkflowBudget is passed down through every sub-task spawn. Sub-tasks don't get a fresh budget, they draw from the same pool. The dollar cap is set conservatively low for the first deploy of any new workflow and raised as confidence in the workflow's expected spend range grows. The confidence budget from the planner post sits upstream and typically catches runaway loops before they hit the dollar cap; the dollar cap is the last-resort termination.
What changed. All of it.
| action | owner | status |
|---|---|---|
| Ship confidence budget to all agent workflows. The confidence budget primitive deployed to production. Catches novelty-stalling loops before they consume meaningful budget. p99 latency dropped 38% as a side effect. | Infra | Done |
Replace per-task step cap with workflow-level budget. WorkflowBudget shipped. All sub-task spawns require a shared budget reference. Default max_spend_usd: $5 for new workflows, raised per-workflow after review.
| Infra | Done |
| Per-session cost alerting, pages in under 5 minutes. Cost trace from the cost-as-a-metric work deployed. CloudWatch alert fires when any session exceeds $10. On-call paged within 4 minutes. Tested weekly. | Platform | Done |
| Add loop-behavior test cases to eval harness. Three new test cases added to the eval suite: empty-200 tool response, rate-limit response, timeout response. All must terminate within 5 steps via graceful fallback. | QA | Done |
| Migrate remaining workflows to LangGraph with shared budget. 7 of 11 custom-runtime workflows migrated to LangGraph, which enforces workflow-level state including budget. 4 remaining are on the Q3 migration list. | Infra | In progress |
| Spend anomaly runbook, on-call response procedure. Runbook drafted. Covers: confirm the alert, identify the workflow, terminate it, preserve logs for post-mortem, check for downstream effects. Not yet in rotation training. | On-call | Open |
Three things this incident changed permanently
Defense in depth is not optional for agentic loops. Any single safeguard can be routed around: the step cap was. The confidence budget, the workflow-level spend cap, and the real-time cost alert are three independent defenses. If one is missing, the others can still contain an incident. If all three are missing, you're writing a post-mortem.
Invisible success is the hardest failure mode to detect. The loop made 24,847 successful API calls. Every call returned 200. No error was thrown. The monitoring stack saw a healthy system. The only anomalous signal was spend, which nobody was watching. Success metrics don't catch loops; they confirm them.
The safeguards in this series are listed in order of incident prevention, not implementation complexity. The confidence budget was written before this incident happened but hadn't been deployed to all workflows. The eval harness existed but didn't cover the failure mode. The cost tracer didn't exist yet. All three posts were written, at least in part, because of this night.
If you would like us to pressure-test the safeguards on your own agent stack, the contact form is the fastest way. We do 30-minute reviews for production agent stacks, free.