For months our on-call rotation had a recurring villain: the agent that just wouldn't stop. It would re-plan, re-query, re-ask itself the same sub-question in slightly different words, and rack up latency until a timeout finally killed it. The trace would land in our dashboards looking like a bad EKG, a flat line of retries followed by a cliff.
We blamed prompt quality. Then tool reliability. Then the model itself. All of those things were real contributors, but none of them were the root cause. The root cause was architectural: our planner had no principled way to stop trying.
The loop problem
A planning loop happens when an agent's belief about the world doesn't converge. It issues an action, observes a result that it finds ambiguous, updates its internal state in a way that re-triggers the same reasoning branch, and repeats. From the outside it looks like spinning. From the inside, if you could call it that, it looks like diligence.
Standard mitigations (max step counts, timeouts, deduplication of tool calls) all treat the symptom. They cut the loop after it's already running. What we wanted was a mechanism that would prevent the loop from forming in the first place, or at least surface a clean degraded response when confidence genuinely wasn't recoverable.
Introducing the confidence budget
The confidence budget is a first-class value in our planner's execution context. Think of it as a currency: every reasoning step spends some budget, and every piece of new, non-redundant information earns some back. When the balance hits zero, the planner doesn't retry. It escalates to a graceful fallback.
The spend/earn asymmetry is the key design choice. Redundant observations, tool responses that return content semantically equivalent to something already in context, earn zero budget back. Novel, high-information observations earn budget proportional to how much they reduce uncertainty. Steps that merely reformat or re-examine existing context spend at the normal rate without earning anything.
Implementation sketch
The primitive itself is small. About 30 lines of Python sits inside our planner package and is wired
into the execution context that already threads through every step. Two pieces of state, three methods,
one decision point in Planner.step.
from dataclasses import dataclass, field @dataclass class ConfidenceBudget: initial: float = 1.0 balance: float = field(default_factory=lambda: 1.0) step_cost: float = 0.08 # flat spend per planning step novelty_gain: float = 0.12 # max earn per novel observation def spend(self) -> None: self.balance = max(0.0, self.balance - self.step_cost) def earn(self, novelty_score: float) -> None: # novelty_score: 0.0 = fully redundant, 1.0 = fully novel gain = novelty_score * self.novelty_gain self.balance = min(self.initial, self.balance + gain) def is_depleted(self) -> bool: return self.balance <= 0.05 # 5% threshold, graceful exit class Planner: def step(self, ctx: PlanContext) -> PlanResult: self.budget.spend() if self.budget.is_depleted(): return self.best_effort_reply(ctx) observation = self.execute_action(ctx) novelty = self.novelty_scorer.score(observation, ctx.seen) self.budget.earn(novelty) return PlanResult(observation=observation, budget=self.budget)
The novelty scorer deserves its own post. We use a lightweight embedding comparison against a rolling window of recent observations, with a similarity threshold tuned to our specific tool surface. The exact numbers above (0.08 spend, 0.12 earn) were empirically derived and will be different for your workload. The ratio matters more than the absolute values.
What "graceful fallback" actually means
This is the part that took us the longest to get right. A planner that just crashes on budget depletion is no better than a timeout. The fallback needs to be a genuine response, incomplete, hedged, clearly partial, but useful. We settled on three modes.
- Confident partial. The planner knows something. It reports what it found, explicitly names what it couldn't verify, and stops. This is the most common fallback and often good enough for the user.
- Structured uncertainty. The planner has competing hypotheses but no way to resolve them. It surfaces both with a rough confidence weighting. Downstream systems can use this as a signal rather than a dead end.
- Hard punt. The planner genuinely has nothing. It returns a structured error with the partial trace, which either escalates to a human or triggers a different planning strategy entirely. We keep this rare, below 2% of invocations in production.
Results
| metric | before | after | signal |
|---|---|---|---|
| p50 latency | 1.4s | 1.2s | −14% |
| p99 latency | 42s | 26s | −38% |
| timeout rate | 3.1% | 0.28% | −91% |
| partial-reply share | 0% | 6.4% | net new |
| hard-punt rate | n/a | 1.8% | target < 2% |
The p99 improvement surprised us. We expected p50 to be the main beneficiary. In practice, p99 improved most because runaway loops were concentrated at the tail. Once those resolved quickly into fallbacks, the entire tail of the latency distribution compressed.
The user-satisfaction number on partial replies deserves context: we were comparing against prior behavior where a depleted planner returned a generic error. A well-framed partial response turns out to be considerably more useful than "something went wrong." Users can act on partial information. They can't act on a 500.
What we'd do differently
We initialized budgets uniformly. In hindsight, query complexity should inform the starting balance. A simple lookup gets a modest budget; an open-ended research task should start with more runway before the novelty-spend curve kicks in. We're building this now and expect it to push satisfaction scores another few points.
We also didn't account for tool-specific novelty curves. Some tools are inherently noisier, their responses vary in surface form while encoding the same information. The novelty scorer penalized these tools unfairly until we added per-tool calibration. If you're implementing this, budget tool-specific tuning time.
The deeper takeaway is that termination behavior is a product decision, not just an engineering one. How your agent fails, or refuses to fail gracefully, is part of the user experience. Build it explicitly.
If you'd like us to look at how your agent terminates, the contact form is the fastest way. We do free 30-minute reviews for production systems.