Building a planner that knows when to give up.

For months our on-call rotation had a recurring villain: the agent that just wouldn't stop. It would re-plan, re-query, re-ask itself the same sub-question in slightly different words, and rack up latency until a timeout finally killed it. The trace would land in our dashboards looking like a bad EKG, a flat line of retries followed by a cliff.

We blamed prompt quality. Then tool reliability. Then the model itself. All of those things were real contributors, but none of them were the root cause. The root cause was architectural: our planner had no principled way to stop trying.

↳ tl;dr Termination is a primitive, not a backstop. Wrap every planning step in a confidence budget that spends on reasoning and earns on novel observations. When the balance runs out, escalate to a graceful, partial response. Not a retry. Not a timeout. p99 dropped 38%, runaway loops dropped 91%.

The loop problem

A planning loop happens when an agent's belief about the world doesn't converge. It issues an action, observes a result that it finds ambiguous, updates its internal state in a way that re-triggers the same reasoning branch, and repeats. From the outside it looks like spinning. From the inside, if you could call it that, it looks like diligence.

Diligence without a termination condition is just an elegant way to run out of time.

Standard mitigations (max step counts, timeouts, deduplication of tool calls) all treat the symptom. They cut the loop after it's already running. What we wanted was a mechanism that would prevent the loop from forming in the first place, or at least surface a clean degraded response when confidence genuinely wasn't recoverable.

Introducing the confidence budget

The confidence budget is a first-class value in our planner's execution context. Think of it as a currency: every reasoning step spends some budget, and every piece of new, non-redundant information earns some back. When the balance hits zero, the planner doesn't retry. It escalates to a graceful fallback.

fig · 01 / confidence-budget · state machine ● spend / earn / depleted

fig · 01 the planner step is a state machine over a single scalar: balance. Every step subtracts a flat cost; every novel observation adds budget proportional to its information content. Depletion is a clean escape hatch, not an error.

The spend/earn asymmetry is the key design choice. Redundant observations, tool responses that return content semantically equivalent to something already in context, earn zero budget back. Novel, high-information observations earn budget proportional to how much they reduce uncertainty. Steps that merely reformat or re-examine existing context spend at the normal rate without earning anything.

Implementation sketch

The primitive itself is small. About 30 lines of Python sits inside our planner package and is wired into the execution context that already threads through every step. Two pieces of state, three methods, one decision point in Planner.step.

planner/budget.py · 28 lines python · 3.11

from dataclasses import dataclass, field

@dataclass
class ConfidenceBudget:
    initial: float = 1.0
    balance: float = field(default_factory=lambda: 1.0)
    step_cost: float = 0.08       # flat spend per planning step
    novelty_gain: float = 0.12    # max earn per novel observation

    def spend(self) -> None:
        self.balance = max(0.0, self.balance - self.step_cost)

    def earn(self, novelty_score: float) -> None:
        # novelty_score: 0.0 = fully redundant, 1.0 = fully novel
        gain = novelty_score * self.novelty_gain
        self.balance = min(self.initial, self.balance + gain)

    def is_depleted(self) -> bool:
        return self.balance <= 0.05   # 5% threshold, graceful exit

class Planner:
    def step(self, ctx: PlanContext) -> PlanResult:
        self.budget.spend()
        if self.budget.is_depleted():
            return self.best_effort_reply(ctx)

        observation = self.execute_action(ctx)
        novelty = self.novelty_scorer.score(observation, ctx.seen)
        self.budget.earn(novelty)
        return PlanResult(observation=observation, budget=self.budget)

The novelty scorer deserves its own post. We use a lightweight embedding comparison against a rolling window of recent observations, with a similarity threshold tuned to our specific tool surface. The exact numbers above (0.08 spend, 0.12 earn) were empirically derived and will be different for your workload. The ratio matters more than the absolute values.

What "graceful fallback" actually means

This is the part that took us the longest to get right. A planner that just crashes on budget depletion is no better than a timeout. The fallback needs to be a genuine response, incomplete, hedged, clearly partial, but useful. We settled on three modes.

Confident partial. The planner knows something. It reports what it found, explicitly names what it couldn't verify, and stops. This is the most common fallback and often good enough for the user.
Structured uncertainty. The planner has competing hypotheses but no way to resolve them. It surfaces both with a rough confidence weighting. Downstream systems can use this as a signal rather than a dead end.
Hard punt. The planner genuinely has nothing. It returns a structured error with the partial trace, which either escalates to a human or triggers a different planning strategy entirely. We keep this rare, below 2% of invocations in production.

Results

p99 latency

−38%

vs. same workload, prior planner

runaway loops

−91%

timeouts on infinite re-plan paths

csat on partials

+4pt

vs. prior generic-error fallback

metric	before	after	signal
p50 latency	1.4s	1.2s	−14%
p99 latency	42s	26s	−38%
timeout rate	3.1%	0.28%	−91%
partial-reply share	0%	6.4%	net new
hard-punt rate	n/a	1.8%	target < 2%

The p99 improvement surprised us. We expected p50 to be the main beneficiary. In practice, p99 improved most because runaway loops were concentrated at the tail. Once those resolved quickly into fallbacks, the entire tail of the latency distribution compressed.

The user-satisfaction number on partial replies deserves context: we were comparing against prior behavior where a depleted planner returned a generic error. A well-framed partial response turns out to be considerably more useful than "something went wrong." Users can act on partial information. They can't act on a 500.

What we'd do differently

We initialized budgets uniformly. In hindsight, query complexity should inform the starting balance. A simple lookup gets a modest budget; an open-ended research task should start with more runway before the novelty-spend curve kicks in. We're building this now and expect it to push satisfaction scores another few points.

We also didn't account for tool-specific novelty curves. Some tools are inherently noisier, their responses vary in surface form while encoding the same information. The novelty scorer penalized these tools unfairly until we added per-tool calibration. If you're implementing this, budget tool-specific tuning time.

Knowing when to stop is a capability, not a limitation. The planner that concedes gracefully is more useful than the one that grinds forever.

The deeper takeaway is that termination behavior is a product decision, not just an engineering one. How your agent fails, or refuses to fail gracefully, is part of the user experience. Build it explicitly.

If you'd like us to look at how your agent terminates, the contact form is the fastest way. We do free 30-minute reviews for production systems.

· end · tx 017 ·

Jean Pierre Levac

Founder of Acceleratech, the AI and workflow automation services arm of JPL Digital Growth Group. Writes and edits the field notes published here.

Written and edited by Jean Pierre Levac. Transparency note →

The loop problem

Introducing the confidence budget

Implementation sketch

What "graceful fallback" actually means

Results

What we'd do differently

More on / agents & planners

Liked this / get the next one.