Tool Use: Giving Agents Hands

Topics Covered

How Tool Use Works

The tool use loop

Designing Tool Interfaces

1. Descriptive names

2. Precise descriptions

3. Constrained parameters

4. Minimal parameters

5. Good error messages

Model Context Protocol (MCP)

Tool Selection and Routing

Error Handling in Tool Calls

Failure mode 1: Invalid arguments

Failure mode 2: Execution failure

Failure mode 3: Unexpected results

Timeout handling

Logging and observability

Common Tool Patterns

Read tools

Write tools

Execute tools

Observe tools

Parallel tool calling

Composing tool categories

Large language models can reason, plan, and generate text, but they cannot do anything in the real world. They cannot read a database, call an API, or check the weather. Tool use is the protocol that bridges this gap. It gives the model a way to say "I need to perform an action" and gives your code a way to execute that action and return the result.

Tool call flow showing model generating a call, code executing it, and result returning to the model

The protocol works in four steps.

Step 1: Define available tools. You send the LLM a list of tool definitions, each with a name, a description, and a JSON schema for parameters. This tells the model what actions are possible and what arguments each action requires.

Step 2: The LLM generates a tool call. Instead of producing text, the model outputs a structured JSON object containing the tool name and arguments. This is not a function call. It is a description of a function call. The model has decided what to do but has not done it.

Step 3: Your code executes the call. Your application receives the JSON, validates the arguments against the schema, and runs the actual function. This is where the real work happens: the API request, the database query, the file read.

Step 4: Return the result. You send the function's output back to the LLM as a tool result message. The model reads the result and either calls another tool (if more work is needed) or generates a final text response for the user.

Here is a concrete example showing all four steps:

python
1# Step 1: Tool definition sent to the LLM
2tools = [{
3    "name": "get_weather",
4    "description": "Get current weather for a city",
5    "parameters": {
6        "type": "object",
7        "properties": {
8            "city": {"type": "string", "description": "City name"},
9            "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
10        },
11        "required": ["city"]
12    }
13}]
14
15# Step 2: LLM returns a tool call (not text)
16# {"name": "get_weather", "arguments": {"city": "Tokyo", "units": "celsius"}}
17
18# Step 3: Your code executes it
19result = get_weather(city="Tokyo", units="celsius")
20# result: {"temperature": 22, "condition": "partly cloudy"}
21
22# Step 4: Send result back to LLM, which generates final response
23# "The current weather in Tokyo is 22°C and partly cloudy."

The critical insight here is separation of concerns. The LLM never executes anything. It only generates a description of what to execute. Your code is the bridge between the LLM's intent and the actual world. The model proposes an action; your code decides whether to carry it out.

Key Insight

The LLM never actually calls a tool. It generates a JSON description of a call, and your code executes it. This separation is critical for safety. You can validate arguments, check permissions, rate-limit calls, and log everything before any action is taken. The LLM proposes; your code disposes.

This architecture means the LLM's power is bounded by what you choose to expose. If you only provide a search tool, the agent can search but cannot write files. If you add a send_email tool, the agent gains the ability to send emails, but only through your code, which can enforce recipient whitelists, content filtering, and rate limits. The tool definitions are both a capability list and a security boundary.

The tool use loop

A single tool call rarely solves a complex problem. Most useful agent interactions involve multiple rounds of the four-step protocol. The user asks "book me a flight from NYC to London next Friday under $500." The agent calls search_flights, reads the results, finds one that matches, calls book_flight, gets a confirmation, and then generates a summary for the user. Each round adds real-world information to the conversation, letting the LLM make increasingly informed decisions.

This loop (reason, act, observe, repeat) is the fundamental pattern behind all agentic behavior. The tool use protocol provides the "act" and "observe" steps. The LLM's natural language capabilities provide the "reason" step. Together, they create an agent that can accomplish goals in the real world rather than merely discussing them.

The protocol is model-agnostic. The same tool definitions work with different LLMs. GPT, Claude, Gemini, or open-source models. The JSON format for tool calls and results is a shared convention. This means you can swap the underlying model without rewriting your tools, or run the same agent with different models for different cost and capability tradeoffs.

Tool design is UX, but your user is an LLM, not a human. The LLM reads tool names and descriptions to decide which tool to call and how to call it. A poorly designed tool interface causes the same problems as a poorly designed API: wrong calls, bad arguments, and confused users. The difference is that your user cannot read documentation, browse examples, or ask a colleague. It can only work with what you put in the tool definition.

Five principles make the difference between tools that work reliably and tools that fail unpredictably.

1. Descriptive names

The name alone should convey what the tool does. search_knowledge_base tells the LLM this tool searches a knowledge base. tool_1 or sk tells it nothing. When the LLM sees a user ask "find articles about authentication," it needs to match that intent to a tool name. Descriptive names make that match obvious.

2. Precise descriptions

The description is the most important field in a tool definition. Compare these two:

 
1Bad:  "Searches stuff"
2Good: "Search the company knowledge base for articles matching
3       the query. Returns top 5 results with title, URL, and
4       relevance score. Use when the user asks about company
5       policies, product features, or troubleshooting steps."

The good description tells the LLM three things: what the tool does, what the output looks like, and when to use it. That last part (when to use it) is crucial for tool selection when multiple tools could plausibly match a query.

3. Constrained parameters

Use enums instead of free text wherever possible. If a status field can only be "open", "closed", or "pending", define it as an enum. This prevents the LLM from inventing values like "maybe_open" or "partially_closed". Every free-text parameter is an opportunity for the LLM to generate an invalid value.

json
1{
2  "status": {
3    "type": "string",
4    "enum": ["open", "closed", "pending"],
5    "description": "Filter tickets by status"
6  }
7}

4. Minimal parameters

Only require what is necessary. Every additional parameter increases the chance of the LLM omitting one, using the wrong type, or getting confused about what it means. Optional parameters with good defaults should be handled server-side, not exposed to the LLM. If 90% of calls use the default value, the parameter does not belong in the tool definition.

5. Good error messages

Return structured errors that help the LLM recover. Compare:

json
Bad:  {"error": "404"}
Good: {"error": "User not found",
       "suggestion": "Try searching by email instead of name"}

The good error message tells the LLM what went wrong and what to do next. The LLM can then adjust its approach: search by email, ask the user for clarification, or try a different tool entirely.

Interview Tip

When designing tools for an LLM, pretend you are writing documentation for a new developer who can only read the function signature and docstring. If a human developer could figure out how to call your function from its name, description, and parameter types alone (without seeing the implementation) then an LLM can too.

Model Context Protocol (MCP)

As tool ecosystems grow, a standard for tool integration becomes necessary. The Model Context Protocol (MCP) is the emerging answer. MCP provides a universal protocol for connecting LLMs to tools, data sources, and capabilities through three primitives: Tools (actions the LLM can take), Resources (data the LLM can read), and Prompts (templates the LLM can use).

Instead of building custom integrations for each LLM provider, MCP provides one standard interface. Think of it as USB-C for AI: any MCP-compatible tool works with any MCP-compatible LLM host. A tool developer writes one MCP server, and it works with Claude, GPT, Gemini, or any other MCP-compatible client. This decouples tool development from model development, enabling an ecosystem where tools and models evolve independently.

The practical benefit is composability. A company can connect their agent to an MCP server for Slack (messaging), another for Jira (project tracking), another for GitHub (code), and another for their internal database, each maintained by a different team or vendor. The agent sees a unified set of tools, resources, and prompts without knowing or caring which MCP server provides each one. Adding a new capability means connecting a new MCP server, not rewriting the agent's tool integration layer.

When an LLM receives a user message alongside a list of tool definitions, it must decide whether to call a tool and, if so, which one. This decision is driven by matching the user's intent against the tool descriptions. If you give an agent three tools: search_web, read_file, and run_code, and the user says "find the latest Python release date," the LLM matches "find" and "latest" to search_web because its description mentions searching the internet for current information. This intent matching works reliably when the agent has 3 to 10 well-described tools.

Two-stage tool routing architecture for scaling tool selection

The problem emerges at scale. With 50 or more tools, selection accuracy degrades. The LLM must compare the user's intent against dozens of descriptions, some of which overlap in wording. A query like "get user data" might match get_user_profile, search_users, get_user_activity, and export_user_data. The LLM spends more tokens reasoning about which tool to pick, and the probability of choosing the wrong one increases with every additional candidate.

Four strategies address this scaling challenge.

Strategy 1: Tool categories. Group tools into categories and let the LLM pick in two stages. First, present category descriptions: "User Management," "Analytics," "Notifications." The LLM selects a category. Then present only the tools in that category. This reduces the selection space from 50 to 5-8 tools per decision, restoring accuracy. The tradeoff is an extra LLM call per tool use, adding about 200-500ms of latency.

Strategy 2: Few-shot examples. Include examples of correct tool selections in the system prompt. "When the user asks about order status, call get_order. When the user asks about shipping, call track_shipment." These examples anchor the LLM's decision-making for common queries. The tradeoff is prompt length: each example consumes tokens, and the examples become stale as tools change.

Strategy 3: Two-stage routing. Use a fast, cheap model (like a small classifier or a lightweight LLM) to select the tool, then use the full-capability model to generate the arguments. The routing model only needs to map intent to tool name, a much simpler task than generating structured arguments. This cuts cost because the expensive model only runs after tool selection, and it improves accuracy because the routing model can be fine-tuned specifically for your tool set.

Strategy 4: Dynamic tool loading. Only show the LLM tools relevant to the current task. Use keywords from the user's message, conversation context, or a retrieval step (embed tool descriptions and find the nearest matches) to filter the tool list before each LLM call. If the user is talking about billing, only present billing-related tools. This keeps the selection space small without requiring explicit categories. The tradeoff is that the retrieval step can miss relevant tools if descriptions and queries use different vocabulary.

In practice, most production agents combine two or more of these strategies. A common pattern is dynamic tool loading (to reduce the candidate set) plus few-shot examples (to anchor selection among the remaining candidates). The key metric to track is tool selection accuracy: the percentage of tool calls where the LLM chose the correct tool on the first attempt. If this drops below 90%, your tool set needs restructuring.

Tool naming also affects selection. When two tools have similar names (search_orders and search_order_history), the LLM confuses them more often than when names are distinct (get_recent_orders and search_order_archive). Treat tool names as part of your routing strategy: they are the first signal the LLM uses before reading descriptions. A clear naming convention that reflects tool categories (read_, create_, delete_*) gives the LLM a structural cue that complements the description.

Finally, measure and iterate. Log every tool call with the selected tool, the user query, and whether the selection was correct. Review misrouted calls weekly. Common patterns emerge quickly: two tools that are frequently confused need clearer differentiation in their descriptions, or should be merged into one tool with a parameter that distinguishes the two behaviors. Tool routing is not a set-and- forget design. It requires ongoing tuning as you add tools, change descriptions, and observe how the LLM's selection behavior shifts.

Tools fail. APIs return 500 errors. Databases time out. Files are missing. Rate limits are hit. The difference between a useful agent and a frustrating one is how it handles these failures. An agent that crashes on the first error gives the user nothing. An agent that receives the error, understands it, and tries a different approach gives the user a result.

Tool call error handling flow across three failure modes

Tool call failures fall into three categories, each requiring a different response.

Failure mode 1: Invalid arguments

The LLM generates malformed JSON, uses the wrong type for a parameter, or omits a required field. This is the most common failure and the easiest to fix. Validate arguments against the tool's schema before executing the function. If validation fails, return a structured error message that includes the expected schema. The LLM reads this error, understands what went wrong, and generates corrected arguments on the next attempt. In practice, about 5-15% of tool calls from frontier models have argument issues, and the vast majority self-correct on the first retry when given a clear error message.

Failure mode 2: Execution failure

The tool receives valid arguments but the underlying operation fails. The API returns an error, the database query times out, the file does not exist. Return the error to the LLM as a tool result, not as an exception that crashes the agent loop. The LLM can then try an alternative approach: a different query, a different tool, or an explanation to the user about why the operation failed. The key distinction is between transient failures (network timeout, rate limit) that are worth retrying and permanent failures (resource deleted, permission denied) that require a different strategy. Include this distinction in your error messages so the LLM can decide accordingly.

Failure mode 3: Unexpected results

The tool succeeds but returns data in a format the LLM does not expect. A search returns an empty array when the LLM expected results. A calculation returns a negative number when the LLM assumed positive. Include the result schema in the tool description so the LLM knows what to expect, and handle edge cases (empty results, null values) explicitly in the tool's output.

Here is a pattern that handles all three failure modes:

python
1def execute_tool_safely(tool_call, available_tools):
2    tool = available_tools.get(tool_call.name)
3    if not tool:
4        return {"error": f"Unknown tool: {tool_call.name}"}
5    try:
6        args = validate_arguments(tool.schema, tool_call.arguments)
7        result = tool.execute(**args)
8        return {"result": result}
9    except ValidationError as e:
10        return {"error": f"Invalid arguments: {e}",
11                "hint": f"Expected schema: {tool.schema}"}
12    except ToolExecutionError as e:
13        return {"error": f"Tool failed: {e}",
14                "suggestion": "Try a different approach"}

The key principle is: always return errors to the LLM as tool results rather than crashing the agent. Every exception becomes a message the LLM can reason about. A validation error becomes "your arguments were wrong, here is the expected schema." An execution error becomes "the operation failed, try something else." The LLM is surprisingly good at recovering from errors when given clear, actionable error messages.

Common Pitfall

Never let a tool call crash your agent loop. Always catch exceptions and return them as structured error messages to the LLM. A crashed agent gives the user nothing. An agent that receives an error message can try a different tool, adjust its arguments, or explain the problem to the user. Error messages are not failures. They are information the agent uses to adapt.

Retry budgets matter too. An agent that retries a failed tool call indefinitely wastes tokens and time. Set a maximum of 2-3 retries per tool per conversation turn. If the tool still fails after retries, the agent should explain the failure to the user and suggest alternatives. "I could not access the database, but I can search the knowledge base for similar information." Graceful degradation is always better than silent failure or infinite retry loops.

Timeout handling

External tools have unpredictable latency. An API call that normally takes 200ms might take 30 seconds during a service degradation. Without timeouts, the agent loop hangs, the user waits, and the experience degrades silently. Set explicit timeouts on every tool execution: typically 5-10 seconds for API calls, 30 seconds for complex computations, and 60 seconds for file operations on large data. When a timeout fires, return it as a structured error: {"error": "Tool timed out after 10 seconds", "suggestion": "The service may be experiencing high load. Try again or use an alternative tool."} The LLM can then decide whether to retry, try a different approach, or inform the user.

Logging and observability

Every tool call (successful or failed) should be logged with the conversation ID, tool name, arguments, result (or error), and execution time. This log is essential for debugging agent behavior. When an agent produces a wrong answer, the tool call log tells you exactly what happened: which tools were called, what arguments the LLM generated, what results came back, and where the reasoning went wrong. Without this log, debugging an agent is like debugging a program without stack traces.

Most agent tools fall into four categories. Understanding these categories helps you design tool sets that cover an agent's needs without redundancy or gaps.

Parallel tool calling showing multiple tools executing simultaneously

Read tools

Read tools retrieve information without side effects. They search, look up, fetch, and query. Because they do not modify state, they are safe to retry if they fail and safe to call in parallel if you need multiple pieces of information at once. Examples include search_database, read_file, get_user_profile, and check_inventory.

Read tools are the foundation of most agents. An agent that can only read is still useful. It can answer questions, find information, and synthesize data from multiple sources. The safety profile is excellent because no call can cause damage. The worst case is a wasted API call, not a corrupted database. When starting a new agent project, begin with read tools. They deliver immediate value with minimal risk, and you can add write, execute, and observe tools incrementally as you gain confidence in the agent's behavior.

Write tools

Write tools modify state. They create, update, and delete records. Because they change the world, they carry risk. A write tool called with wrong arguments can create duplicate records, overwrite good data, or delete something important.

Design write tools with safety in mind. Make operations idempotent where possible: creating a user with a specific ID should return the existing user if one already exists, not create a duplicate. For destructive actions (delete, overwrite), consider requiring confirmation: the tool returns a preview of the change, and the agent must call a confirm_action tool to proceed. This gives the user a chance to intervene before irreversible changes happen.

Examples include create_issue, update_record, send_email, and delete_file.

Execute tools

Execute tools run code or commands. They are the most powerful category and the most dangerous. A run_code tool can compute anything, and also delete everything. Execute tools need sandboxing: run code in a container with no network access and limited CPU/memory, set execution timeouts, and restrict file system access.

The power of execute tools is that they make agents programmable. An agent with a run_python_code tool can perform calculations, transform data, generate charts, and solve problems that would be impossible with pre-built tools alone. The risk is proportional to the power, and every execute tool needs resource limits and security boundaries.

Examples include run_python_code, execute_sql_query, and run_shell_command.

Observe tools

Observe tools give the agent perception beyond text. They read screenshots, parse PDFs, analyze images, and process audio. These tools expand what the agent can understand by converting non-text data into text descriptions or structured data that the LLM can reason about.

Observe tools are becoming increasingly important as agents interact with visual interfaces. A take_screenshot tool lets an agent see what a web page looks like. A read_pdf tool lets an agent extract data from documents. An analyze_image tool lets an agent describe what is in a photo. Each observation becomes context that the LLM uses for its next decision.

Examples include take_screenshot, read_pdf, analyze_image, and transcribe_audio.

Parallel tool calling

When an agent needs multiple independent pieces of information, calling tools sequentially wastes time. If the agent needs both the user's profile and their recent orders, it does not need to wait for the profile before requesting orders. These are independent operations.

Parallel tool calling lets the LLM generate multiple tool calls in a single response. The orchestrator dispatches all calls simultaneously, waits for all results, and sends all results back to the LLM in one message. For two independent calls that each take 200ms, sequential execution takes 400ms. Parallel execution takes 200ms: the latency of the slowest call, not the sum of all calls.

This pattern is most valuable with read tools, which are safe to run in parallel by definition. Write tools can also run in parallel when they modify independent resources (creating a ticket and sending a notification). Execute tools require more caution, because parallel code execution may need isolated sandboxes to prevent interference.

The LLM decides when to parallelize based on the independence of the operations. If fetching weather requires a city name that comes from the user's profile, those calls are dependent and must be sequential. If fetching weather and stock prices are both independent of each other, the LLM generates both calls at once.

Composing tool categories

The most capable agents combine all four categories. Consider a DevOps agent: it reads logs (read tool), restarts a failing service (write tool), runs a diagnostic script (execute tool), and takes a screenshot of the monitoring dashboard to confirm recovery (observe tool). Each category contributes a different capability. Removing any one category limits what the agent can accomplish. The art of tool design is choosing the right set of tools from each category to cover the agent's use cases without creating redundancy or confusion in tool selection.