The terminology around LLM-adjacent execution is genuinely confused. Not because engineers are imprecise, but because three different communities (ML researchers, API designers, and product teams) colonized the same words at the same time with different meanings. This glossary draws a bright line between each term, gives you code that makes the distinction concrete, and ends with a decision guide you can actually use.
Why the terminology is broken
Three vocabularies arrived at the same time and pointed at overlapping things. OpenAI named their structured-output mode "function calling" in June 2023, which immediately implied execution that does not happen. Anthropic named the same primitive "tool use", which conflated intent and execution from the other direction. The product community took "agent", a term with decades of meaning in AI research, and pointed it at anything from a chatbot with memory to a single tool-call loop.
The practical cost is that nobody in a meeting knows which one anyone else means. Engineers default to the most generous interpretation; product managers default to the most ambitious one. By the time a build is scoped, "we are using function calling to take actions" can mean three different architectures with different cost, latency, and failure modes.
The fix is to commit to definitions that distinguish the three by structural property, not vendor branding. The rest of this post does exactly that.
Function calling
Function calling is the oldest and narrowest of the three primitives. Its entire job is to make "extract structured data from language" reliable. Before it existed, you would prompt the model to "respond in JSON" and then parse whatever came out, hoping. Function calling gives you a schema-validated output with no parsing ambiguity.
The term was coined by OpenAI in their June 2023 API release and immediately caused confusion because it implies execution. Nothing executes. The model outputs a dict that looks like a function call. You decide whether to call the function, log it, or throw it away.
Anthropic's equivalent is tool_use content blocks with a tool_result return: the same pattern, different terminology. Both reduce to one statement. The model declares intent, the caller acts.
import anthropic client = anthropic.Anthropic() # Declare the function signature. The model will emit this shape. extract_order = { "name": "extract_order", "description": "Extract order details from a customer message", "input_schema": { "type": "object", "properties": { "product_id": {"type": "string"}, "quantity": {"type": "integer"}, "shipping": {"type": "string", "enum": ["standard", "express"]}, }, "required": ["product_id", "quantity"] } } response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, tools=[extract_order], tool_choice={"type": "tool", "name": "extract_order"}, messages=[{ "role": "user", "content": "I need 3x SKU-7821, express please" }] ) # Model emits structured intent. WE decide what to do with it. tool_block = response.content[0] order_data = tool_block.input # }'product_id': 'SKU-7821', 'quantity': 3, 'shipping': 'express'} # Nothing was executed. This is just a very reliable JSON extractor. create_order(order_data) # caller decides to run this
Tool calling
stop_reason: end_turn
Tool calling is function calling with a return address. After the model emits a tool use block, you run the function and send a tool_result back into the conversation. The model then decides what to do next: call another tool, ask for clarification, or produce a final answer.
This is the pattern you want for single-agent tasks with real-world side effects. Look up a database record, call an API, read a file, check the weather. The loop is implicit. The model keeps calling tools until it has enough to answer.
The confusion point: people use "function calling" when they mean "tool calling." Technically, function calling produces the intent; tool calling is the full request-execute-return cycle. In practice, when someone says "I am using function calling," they usually mean tool calling. Ask whether the result goes back to the model.
import anthropic client = anthropic.Anthropic() messages = [{"role": "user", "content": "What's the current price of NVDA?"}] # THE TOOL CALLING LOOP while True: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, tools=[get_stock_price_tool], messages=messages ) if response.stop_reason == "end_turn": break # model is done; extract final answer # Model wants to call a tool tool_uses = [b for b in response.content if b.type == "tool_use"] # Execute each tool call. WE run it, then return the result. tool_results = [] for tu in tool_uses: result = execute_tool(tu.name, tu.input) # actual execution tool_results.append({ "type": "tool_result", "tool_use_id": tu.id, "content": result # result goes BACK to model }) # Feed result back. The model now knows what happened. messages += [ {"role": "assistant", "content": response.content}, {"role": "user", "content": tool_results} ] # Key difference from function calling: the model SAW the result. final = extract_text(response.content)
Agents
"Agent" is the most overloaded word in the ecosystem. A chatbot with memory gets called an agent. A single tool call loop gets called an agent. The useful definition is narrower: a system where the LLM drives the execution sequence, not just a single step of it.
The structural difference from tool calling: an agent holds state across turns, can revise its plan when a tool fails, and can spawn sub-tasks or other agents. It is the difference between asking someone to look up a fact and asking them to research and write a report. The latter involves decisions about how to proceed that were not specified upfront.
Agents are the right choice when the path to the goal is unknown or variable. If you can write the workflow as a fixed flowchart, you probably want orchestrated tool calling, not an agent. Agents trade predictability for adaptability. Understand that tradeoff before reaching for the pattern.
from langgraph.graph import StateGraph, END from typing import TypedDict, List # STATE: persists across the entire agent run class ResearchState(TypedDict): goal: str plan: List[str] # agent generates this dynamically findings: List[str] # accumulates across turns iterations: int # self-corrects if stuck done: bool # NODES: each is a model call that reads + writes state def planner(state: ResearchState) -> ResearchState: # Model decides what to do next. Not pre-specified by the caller. plan = llm_plan(state["goal"], state["findings"]) return {**state, "plan": plan} def executor(state: ResearchState) -> ResearchState: # Runs tool calls determined by the planner, not hardcoded. results = [run_tool(step) for step in state["plan"]] return {**state, "findings": state["findings"] + results, "iterations": state["iterations"] + 1} def evaluator(state: ResearchState) -> ResearchState: # Model self-assesses: is the goal met? Should I retry? done = llm_evaluate(state["goal"], state["findings"]) return {**state, "done": done} # GRAPH: model drives the control flow graph = StateGraph(ResearchState) graph.add_node("plan", planner) graph.add_node("execute", executor) graph.add_node("evaluate", evaluator) # Conditional edge: agent decides whether to loop or finish graph.add_conditional_edges("evaluate", lambda s: END if s["done"] or s["iterations"] >= 5 else "plan") agent = graph.compile(checkpointer=MemorySaver()) # persistent state result = agent.invoke({"goal": "research Q3 competitor pricing", "findings": [], "iterations": 0})
Side-by-side comparison
The single page that should hang above the architecture review whiteboard. Twelve dimensions, three columns, the answer to most "wait, which one are we doing?" questions.
| dimension | function calling | tool calling | agent |
|---|---|---|---|
core job | structured extraction | observe and react to world | pursue goal autonomously |
execution | caller only | caller runs, result returns | model-directed, multi-step |
feedback loop | none | single return trip | continuous, adaptive |
state | stateless | within-session | persistent, checkpointed |
planning | none | implicit | explicit, revisable |
latency | lowest (1 call) | medium (N calls) | highest (N calls + overhead) |
predictability | highest | medium | lowest |
correct scope | extract, classify, validate | look up, fetch, post, calculate | research, orchestrate, decide |
failure mode | schema mismatch | tool error, context overflow | loop, hallucinated plan |
mitigation | schema validation | result validation, retry | confidence budget, eval harness |
needs orchestrator? | no | rarely | almost always |
cost per invocation | cheapest | medium | most expensive |
Decision guide
Start with the simplest primitive that satisfies the requirement. Reach up the complexity stack only when the simpler option structurally cannot do the job, not when it would require more careful prompting.
Sales meeting cheat sheet
When someone says one of these things, here is what they probably mean and what to clarify or correct. For use in meetings where the terminology has already become load-bearing.
Pin this section to the document you share with stakeholders before the scoping meeting. The 30 seconds of vocabulary alignment saves the 30 minutes of "wait, by agent you mean what exactly?" that otherwise happens twice per call.
If you would like a second opinion on whether what you are building is tool calling or actually an agent, the contact form is the fastest way. We do 30-minute architecture reviews on agentic systems in flight, free.