Most teams know their total API spend. Almost no teams know which agent steps are driving it. The cost appears as a monthly line item from the provider, and it does not decompose into "the document synthesis step on workflow X costs 14× what the status check costs, and it runs on every invocation whether or not the user sees the output." The bill you cannot see is the bill you cannot fix.
The bill you cannot see is the bill you cannot fix.
This is not a monitoring gap. It is an architecture gap. Token consumption is not measured at the step level because steps are not first-class objects in most agent implementations. Calls happen, tokens accumulate, the bill arrives. The causal chain from "we added a planning step" to "our API spend increased 40%" is invisible without deliberate instrumentation.
The average agent step in our production systems costs $0.0004, a number that sounds trivially small until you multiply it by invocations at scale. At 5,000 daily active workflows, each averaging 11 steps, that is $22 per day in median-cost steps. The outlier steps (the 12%) cost roughly 18× the median. When outlier steps cluster in high-frequency workflows, they compound silently.
The cost tracer. What it measures.
The core object is a CostTrace: a context manager that wraps every model call in your agent graph, records token counts and model tier, converts to a dollar amount at the current rate card, and emits a structured log event. It adds under 1ms overhead per call.
from dataclasses import dataclass, field from contextlib import contextmanager import time, logging # Current rates, update when provider reprices RATES = { "claude-sonnet-4-6": {"input": 3.00, "output": 15.00}, # per 1M tokens "claude-haiku-4-5": {"input": 0.25, "output": 1.25}, } @dataclass class StepCost: step_name: str model: str input_tokens: int output_tokens: int duration_ms: float workflow_id: str @property def dollars(self) -> float: r = RATES[self.model] return ( (self.input_tokens / 1_000_000) * r["input"] + (self.output_tokens / 1_000_000) * r["output"] ) @property def is_outlier(self) -> bool: # Flag steps costing > 5× the p50 for their step_name return self.dollars > OUTLIER_THRESHOLDS.get(self.step_name, 0.005) @contextmanager def cost_trace(step_name: str, model: str, workflow_id: str): t0 = time.perf_counter() usage = {} # filled by the model call yield usage # caller populates from response.usage elapsed = (time.perf_counter() - t0) * 1000 cost = StepCost( step_name=step_name, model=model, workflow_id=workflow_id, input_tokens=usage.get("input_tokens", 0), output_tokens=usage.get("output_tokens", 0), duration_ms=elapsed, ) logging.info("step_cost", extra={ "cost_usd": cost.dollars, "step": step_name, "is_outlier": cost.is_outlier, "input_tokens": cost.input_tokens, "output_tokens": cost.output_tokens, "workflow_id": workflow_id, }) # ── Usage in a LangGraph node ───────────────────────── def summarise_node(state: AgentState) -> AgentState: with cost_trace("summarise", "claude-sonnet-4-6", state["workflow_id"]) as usage: response = client.messages.create(...) usage["input_tokens"] = response.usage.input_tokens usage["output_tokens"] = response.usage.output_tokens return {**state, "summary": extract_text(response.content)}
Three things the trace captures that standard API logging misses: the step name (so you can group by workflow node, not just by model), the workflow ID (so you can correlate cost to the specific task being run), and the outlier flag (so your dashboards do not require a SQL join to surface anomalies).
The spend distribution. It is never flat.
After two weeks of tracing across our production agent workflows, the distribution was exactly what we suspected and worse than we hoped. Cost per step follows something close to a Pareto distribution: a small number of step types account for the majority of spend.
One step (doc_synthesis) accounted for 38% of total API spend. It ran on 9% of invocations, which made it look unremarkable in a volume-based view. Only when we sorted by dollars did it surface. The root cause: it was passing the entire document corpus as context on every run, including runs where the synthesis result was never surfaced to the user because a downstream gate short-circuited the workflow.
| Step | Avg tokens (in) | Avg tokens (out) | Avg cost/call | % of volume | % of spend |
|---|---|---|---|---|---|
| doc_synthesis | 42,800 | 1,240 | $0.147 | ||
| planning | 8,400 | 620 | $0.034 | ||
| research | 11,200 | 880 | $0.047 | ||
| summarise | 6,100 | 480 | $0.025 | ||
| status_check | 420 | 80 | $0.0003 |
The status_check row is the healthy baseline: 41% of volume, 3% of spend, $0.0003 per call. That is what a well-scoped step looks like. doc_synthesis at $0.147 per call is 490× more expensive, and it was running on 9% of invocations, many of them wasted.
What we changed. And what it cost to not change it.
before optimization
after (first 60 days)
no recall regression
doc_synthesis was running unconditionally on every invocation. A 2-token classifier check before it ("does this workflow actually need document synthesis?") eliminated 61% of its invocations. The classifier costs $0.00004. The gate pays for itself in 4 calls.doc_synthesis were the full document corpus. The step only needed the top-3 retrieved chunks: roughly 3,200 tokens. RAG before synthesis, not during it. Input cost fell by 92% on the surviving invocations. Same output quality, the eval harness confirmed it.classify and status_check were running on Sonnet. They are structured-output tasks: low creativity, high predictability. Moving them to Haiku reduced their cost by 12× with no measurable quality regression. The eval harness caught no regressions across 500 test cases.Total monthly reduction: $2,460, or 60% of pre-optimization spend. None of these optimizations required rearchitecting the agent. They required measuring first. The gate, the context truncation, the model tier selection, and the prompt caching were all visible as opportunities only after the cost trace showed us where the money was going.
What the cost trace looks like in production.
The structured log events from cost_trace feed a simple dashboard: one row per step type, sorted by total spend. We keep it on the same screen as the latency dashboard because cost and latency regressions often arrive together and have similar root causes (context size, retry loops, model calls that were not gated).
The dashboard sorts by 30-day spend, not volume. This is the key design decision: a step that runs rarely but costs a lot should surface at the top, not be buried in an average. The token bar gives a visual sense of context footprint at a glance, and the outlier warning triggers when a step's cost exceeds 5× its historical p50.
We run one additional view: cost per completed workflow output, not per step. A workflow that takes 18 steps to produce an output is not obviously cheaper than one that takes 6 steps, because the 18-step workflow might be using cheaper steps. Step-level and output-level cost views tell different stories. You want both.
If you want a 30-minute audit of where your agent's API spend is actually going, the contact form is the fastest way. Send us a sample workflow and a 7-day usage export, you get a tagged Pareto chart back the same week.