Last Updated: March 15, 2026
LLMs started as powerful text generators. You give them a prompt, and they produce coherent, useful text in response. But modern AI systems are expected to do more than just generate answers. They need to take actions.
In real-world applications, generating text is often only the first step. A useful AI system might need to search the web, query a database, call an API, run code, or interact with external tools to complete a task. Instead of simply responding with information, the model becomes part of a system that can observe, decide, and act.
This shift marks an important transition in AI engineering. We move from building systems that generate text to systems that use language models as reasoning engines that orchestrate actions.
In this chapter, we will explore how LLMs can be connected to tools, APIs, and external systems, enabling them to move beyond conversation and start getting real work done.
In the previous module, you learned how RAG lets models access external knowledge by retrieving documents at query time. That solved the "stale knowledge" problem. But retrieval is read-only. What about when the user wants to write, create, update, or interact with external systems?
Consider what an LLM can and cannot do today:
The pattern is clear. Anything that requires interacting with the world outside the model's context window is impossible with text generation alone. The model can describe the action perfectly, but it has no hands to carry it out.
Function calling gives the model hands. The model still only outputs text. But instead of outputting a natural language response, it outputs a structured JSON message that says "I want to call function X with these parameters." Your application code then executes that function, feeds the result back to the model, and the model incorporates the real data into its response.
Key Insight: The model never executes code. It never makes HTTP requests. It never touches your database. It produces a structured request, and your code does the work. This distinction matters because it means you stay in full control of what actions are allowed, what parameters are valid, and what gets executed.
The diagram above shows the two paths a model can take when it receives a message. If the user asks something the model can answer from its training data, it responds directly with text (Option A). If the user asks something that requires external data or an action, the model outputs a tool call (Option B), your code executes it, the result goes back to the model, and then the model responds with real information.
Now that you understand why function calling exists, let's walk through exactly how it works, step by step. The entire mechanism comes down to four phases.
When you send a request to the LLM API, you include a list of tool definitions alongside the user's message. Each tool definition describes a function: its name, what it does, and what parameters it accepts. Think of this as handing the model a menu of capabilities.
Based on the user's message and the available tool definitions, the model makes a decision. If the user's question can be answered without any tools, the model just responds with text. If a tool is needed, the model outputs a special tool_calls object containing the function name and arguments as structured JSON.
You receive the model's tool call, extract the function name and arguments, and run the actual function in your application. This might mean calling a weather API, querying a database, or creating a record in an external system. The model has no involvement in this step.
You take the function's return value and send it back to the model as a new message with the role tool. The model reads this result and generates a natural language response that incorporates the real data.
Here is a sequence diagram showing this full loop:
Notice the two round trips to the LLM API. The first sends the user's message and gets back a tool call. The second sends the tool's result and gets back the final response. This is different from a normal API call, which only requires one round trip. Every function call adds latency because of this extra round trip, something to keep in mind when designing your application.
The quality of your tool definitions directly determines how well the model uses them. A vague description leads to wrong tool calls. A precise description leads to accurate ones. The model reads your tool definitions the same way a developer reads API documentation, so write them with the same care.
Each tool definition has three parts:
get_current_weather, create_ticket)Here is what a complete tool definition looks like for the OpenAI API:
Let's break down why each part matters.
The name should be a verb-noun pair that clearly describes the action: get_current_weather, search_users, create_ticket. The model uses this name to understand what the function does at a glance.
The description is the most important field. This is where you tell the model when to use this function. Be specific. Instead of "Gets weather data," write "Get the current weather conditions for a specific location. Use this when the user asks about current weather, temperature, or conditions in a city or region." The more context you provide, the better the model's decisions.
The parameters use standard JSON Schema. Each property needs a type (string, number, boolean, array, object) and a description. The enum field restricts values to a specific set, which prevents the model from inventing invalid options. The required array lists which parameters must be provided.
Here is a more complete example with multiple related tools:
Notice how the three tools cover different time horizons: current, future, and past. The descriptions explicitly state when to use each one. This is critical because a user asking "What will the weather be like in Paris next week?" needs the forecast tool, not the current weather tool. Without clear descriptions, the model might pick the wrong one.
Key Insight: The most common cause of wrong tool calls is not a model limitation. It is a bad tool description. If you find the model calling the wrong tool, improve the description before blaming the model. Add more context about when to use it, when not to use it, and what distinguishes it from similar tools.
Now let's put it all together in real code. The core pattern for function calling is a loop: send a message, check if the model wants to call a tool, execute the tool if so, send the result back, and repeat until the model gives a final text response.
Here is the complete implementation:
Let's walk through what happens when this code runs.
tool_calls response with get_current_weather(location="Tokyo, Japan").get_current_weather in the AVAILABLE_FUNCTIONS dictionary and calls it with the model's arguments.{"temp": 22, "condition": "Sunny", "humidity": 45, ...}.tool and the matching tool_call_id.The while loop is important. In some cases, the model might need to call multiple tools in sequence, or call the same tool multiple times with different arguments. The loop continues until the model responds with text instead of tool calls.
Notice the tool_call_id field when sending results back. Each tool call has a unique ID, and the result must reference that ID so the model knows which call produced which result. If you are handling multiple parallel tool calls, getting the IDs wrong will confuse the model.
When you send a message with tool definitions, the model goes through a decision process that resembles how a developer reads API documentation. It looks at the user's intent, scans the available tool descriptions, and picks the best match. If no tool matches, it just responds with text.
This decision process is not a simple keyword match. The model understands semantics. If a user asks "Is it going to rain tomorrow in Berlin?", the model understands that "tomorrow" means future weather, which maps to get_weather_forecast, not get_current_weather. Even though the user never said the word "forecast."
You can influence the model's tool selection behavior using the tool_choice parameter:
Here is how to use each option:
Most of the time, "auto" works well. But there are cases where you want more control:
"none" when you want the model to summarize tool results without making additional calls."required" when you know the user's request definitely needs a tool (e.g., the first message in a data lookup flow).The model sometimes picks the wrong tool, especially in these situations:
The fix for all of these is the same: improve your tool descriptions. Add negative examples ("Do NOT use this for forecast questions"), add explicit triggers ("Use this ONLY when the user asks about current, real-time conditions"), and test with edge cases.
Let's bring everything together into a complete, interactive weather assistant. This version supports multi-turn conversations, handles multiple tool calls, and includes proper error handling.
Try running this with a sequence of messages to see how multi-turn conversation works:
The fourth message is interesting. The model recognizes this is a general knowledge question, not a weather question, so it skips the tools entirely and responds with text. This is the "auto" behavior of tool_choice at work.