Post-mortem: the loop that cost $3,200 overnight

Consider a research agent that enters an undetected retry loop after receiving an ambiguous tool response. With no termination condition beyond a per-session step cap (which the loop resets on each new sub-task) and no cost alerting in place, the agent makes 24,847 API calls over 9 hours and 14 minutes, accumulating $3,218 in API spend before an engineer notices the anomaly during a routine morning check. This post walks that failure end to end.

↳ executive summary No customer data was affected. The root cause was the absence of three safeguards that now ship as standard on every agent we build: a confidence budget, a workflow-level spend cap, and real-time cost alerting. The trigger was a known edge case that happens on roughly 3% of scraper invocations. The loop was a foreseeable outcome of a foreseeable input, with no defense in depth.

duration

9h 14m

From loop start (22:07) to manual termination (07:22).

api spend

$3,218

Recoverable. The assumption that the safeguards were optional was not.

calls made

24,847

Every one returned 200. No error. No timeout. No failed request.

detection

Manual

An engineer opened the provider dashboard during a morning routine.

prevention

Automated

Three independent safeguards now ship on every agent.

Minute by minute, then hour by hour

The incident timeline below records every state change from normal invocation to containment. Spend figures are the running API cost at each checkpoint.

time	state	what happened	spend
22:04	Normal	Research agent invoked for nightly competitive analysis run. Scheduled task. Agent tasked with researching pricing changes across 12 competitor products. Normal invocation. No anomalies in first 3 minutes of execution.	n/a
22:07	Warning (undetected)	Tool response returns ambiguous result on competitor #7. The web scraper tool returned a 200 status with an empty body, a known edge case on a rate-limited site. The agent interpreted this as "task incomplete, retry." This was the correct interpretation. The problem was what happened next.	~$0.12
22:07–22:19	Loop begins	Agent spawns sub-task, resets its own step counter. The agent's retry logic spawned a new sub-task for competitor #7. The sub-task inherited a fresh step budget. The sub-task also failed. It spawned another sub-task. The step cap, intended as the loop-breaker, was per-task, not per-workflow. The agent had found a gap between the two scopes and was threading through it.	~$18 · ~$1.50/min
22:19–04:30 overnight	Active loop · undetected	Loop runs for 6 hours and 11 minutes without alert. No cost threshold alert existed. The API spend dashboard updated hourly; the on-call rotation was not subscribed to it. The agent was healthy by every metric that was being monitored: no errors, no timeouts, no failed requests. It was making successful API calls in a loop. From the infrastructure's perspective, it was working perfectly.	~$2,900 · ~24,000 calls
07:18	Detection	Engineer notices anomalous API usage during morning check. Not an automated alert. An engineer opened the provider dashboard during their morning routine and saw the usage graph. Manual detection, 9 hours and 14 minutes after the loop began. The agent was terminated immediately.	$3,218 · 24,847 calls
07:22	Contained	Agent terminated, incident declared, post-mortem begun. The workflow was killed. No customer-facing systems were affected, this was a background analysis task. The competitive analysis report was not produced. The post-mortem was started the same morning.	n/a

The spend curve, hour by hour

Total API spend over the incident window. The loop warms up after 23:00, plateaus near $400/hour through the night, and is cut off by manual termination at 07:18.

api spend over incident window (hourly) · total $3,218

Three missing safeguards. All three required.

This was not a rare edge case. The trigger, an empty 200 response from a rate-limited scraper, happens on roughly 3% of scraper invocations. The loop was a foreseeable outcome of a foreseeable input, with no defense in depth. The post-mortem identified five contributing factors. Three were independent safeguards that, had any one of them been in place, would have contained the incident.

#	factor	severity
01	No confidence budget, the primary root cause. The agent had no mechanism to detect that its progress was stalling. Each retry was indistinguishable from productive work at the infrastructure level: a call was made, a result was received. A confidence budget would have measured the novelty of each observation against the session context and detected within 8–10 iterations that the agent was receiving no new information. It would have triggered graceful fallback rather than continuing to spend. See Building a planner that knows when to give up.	Root cause
02	Step cap was per-task, not per-workflow. The existing step cap was a correct idea, incorrectly scoped. Capping steps per sub-task while allowing unlimited sub-task spawning is equivalent to no cap: any agent that can recurse will route around it. The cap needed to be at the workflow level, counting all steps across all sub-tasks in a single invocation tree. This is a design error, not a configuration error. See Why we stopped writing custom orchestrators.	Root cause
03	No cost alerting, the detection failure. The incident ran for 9 hours and 14 minutes undetected. The provider dashboard updated hourly and was not monitored outside business hours. No alert existed for spend-rate anomalies, cumulative session spend, or calls-per-minute above threshold. Had a $50 per-session cost alert existed, the incident would have been caught within 33 minutes and cost under $100. See $0.0004 per agent step: how we made cost a first-class metric.	Root cause
04	No regression test for loop behavior. The eval harness at the time tested output quality on happy-path cases. It did not include a test that verified the agent terminated correctly on ambiguous tool responses, the exact input that triggered this incident. A single test case ("empty 200 response from tool, agent should exit gracefully within N steps") would have caught the missing termination condition at deploy time. See The 6-line eval suite we ship with every agent.	Contributing
05	On-call rotation not subscribed to spend anomalies. Even without automated alerting, the provider dashboard showed anomalous spend from 23:00 onward. The on-call engineer was not subscribed to dashboard notifications and did not check the dashboard during the night. The monitoring gap was organizational, not technical, which is why the technical fix (automated cost alerts) was necessary rather than sufficient.	Missed detection

The most expensive line in this incident report is not the $3,218. It's that the loop was detectable from minute 8 and ran for 9 hours. Every safeguard we've built since then is designed to shrink that window, ideally to zero, practically to under 5 minutes.

What the code looked like. Before and after.

The missing termination condition was three lines: a workflow-level step counter and a check before each sub-task spawn. The cost alert was a single CloudWatch rule. Neither required architectural changes. Both required the incident to make them feel necessary.

research_agent.py · before / after

# ── BEFORE: sub-task spawn with no workflow-level guard ──────────
def spawn_subtask(task: str, context: dict) -> dict:
    # No check on total workflow steps. This is the gap.
    sub_agent = ResearchAgent(max_steps=20)  # per-task cap only
    return sub_agent.run(task, context)

# ── AFTER: workflow-level budget enforced across all sub-tasks ────
@dataclass
class WorkflowBudget:
    max_steps:     int   = 100   # across ALL sub-tasks
    max_spend_usd: float = 5.00  # hard dollar cap
    steps_used:    int   = 0
    spend_usd:     float = 0.0

    def check(self) -> None:
        if self.steps_used >= self.max_steps:
            raise WorkflowBudgetExceeded(f"Step limit  reached")
        if self.spend_usd >= self.max_spend_usd:
            raise WorkflowBudgetExceeded(f"Spend limit $ reached")

def spawn_subtask(task: str, context: dict, budget: WorkflowBudget) -> dict:
    budget.check()                           # enforced before spawn
    sub_agent = ResearchAgent(
        max_steps=20,
        budget=budget,                        # shared budget object
        on_step=lambda cost: setattr(
            budget, "spend_usd", budget.spend_usd + cost
        )
    )
    return sub_agent.run(task, context)

# The confidence budget (the planner post) sits above this. It
# detects stalling before the dollar cap is reached. The dollar
# cap is the last line of defense, not the first.

The WorkflowBudget is passed down through every sub-task spawn. Sub-tasks don't get a fresh budget, they draw from the same pool. The dollar cap is set conservatively low for the first deploy of any new workflow and raised as confidence in the workflow's expected spend range grows. The confidence budget from the planner post sits upstream and typically catches runaway loops before they hit the dollar cap; the dollar cap is the last-resort termination.

What changed. All of it.

action	owner	status
Ship confidence budget to all agent workflows. The confidence budget primitive deployed to production. Catches novelty-stalling loops before they consume meaningful budget. p99 latency dropped 38% as a side effect.	Infra	Done
Replace per-task step cap with workflow-level budget. `WorkflowBudget` shipped. All sub-task spawns require a shared budget reference. Default max_spend_usd: $5 for new workflows, raised per-workflow after review.	Infra	Done
Per-session cost alerting, pages in under 5 minutes. Cost trace from the cost-as-a-metric work deployed. CloudWatch alert fires when any session exceeds $10. On-call paged within 4 minutes. Tested weekly.	Platform	Done
Add loop-behavior test cases to eval harness. Three new test cases added to the eval suite: empty-200 tool response, rate-limit response, timeout response. All must terminate within 5 steps via graceful fallback.	QA	Done
Migrate remaining workflows to LangGraph with shared budget. 7 of 11 custom-runtime workflows migrated to LangGraph, which enforces workflow-level state including budget. 4 remaining are on the Q3 migration list.	Infra	In progress
Spend anomaly runbook, on-call response procedure. Runbook drafted. Covers: confirm the alert, identify the workflow, terminate it, preserve logs for post-mortem, check for downstream effects. Not yet in rotation training.	On-call	Open

Three things this incident changed permanently

Defense in depth is not optional for agentic loops. Any single safeguard can be routed around: the step cap was. The confidence budget, the workflow-level spend cap, and the real-time cost alert are three independent defenses. If one is missing, the others can still contain an incident. If all three are missing, you're writing a post-mortem.

Invisible success is the hardest failure mode to detect. The loop made 24,847 successful API calls. Every call returned 200. No error was thrown. The monitoring stack saw a healthy system. The only anomalous signal was spend, which nobody was watching. Success metrics don't catch loops; they confirm them.

The safeguards in this series are listed in order of incident prevention, not implementation complexity. The confidence budget was written before this incident happened but hadn't been deployed to all workflows. The eval harness existed but didn't cover the failure mode. The cost tracer didn't exist yet. All three posts were written, at least in part, because of this night.

↳ read these in order Referenced throughout this post. Building a planner that knows when to give up is the primary safeguard this incident was missing. The 6-line eval suite now requires loop-behavior test cases. Why we stopped writing custom orchestrators covers the step-cap architecture flaw this exposed. $0.0004 per agent step is the alerting stack that would have caught this in 33 minutes.

The $3,218 was recoverable. What wasn't recoverable was the assumption that the safeguards we hadn't shipped yet were optional. After this incident, every item in the series became a requirement, not a recommendation.

If you would like us to pressure-test the safeguards on your own agent stack, the contact form is the fastest way. We do 30-minute reviews for production agent stacks, free.

· end · tx 020 ·

Jean Pierre Levac

Founder of Acceleratech, the AI and workflow automation services arm of JPL Digital Growth Group. Writes and edits the field notes published here.

Written and edited by Jean Pierre Levac. Transparency note →

Post-mortem: the loop that cost $3,200 overnight.

Minute by minute, then hour by hour

The spend curve, hour by hour

Three missing safeguards. All three required.

What the code looked like. Before and after.

What changed. All of it.

Three things this incident changed permanently

Liked this / get the next one.

Minute by minute, then hour by hour

The spend curve, hour by hour

Three missing safeguards. All three required.

What the code looked like. Before and after.

What changed. All of it.

Three things this incident changed permanently

More / from the feed

Liked this / get the next one.