LLM APIs and Model Selection

Introduction to Agentic AI

LLM APIs and Model Selection

Topics Covered

Chat Completions API

The Messages Array

Role Behavior

Stateless by Design

Tokens and Costs

Temperature and Sampling

Response Structure

Max Tokens

Putting It Together

Function Calling and Tool Definitions

How Function Calling Works

Writing Effective Tool Definitions

Structured Outputs and JSON Mode

The Tool Use Protocol

Parallel Tool Calling

Error Handling in Tool Calls

Controlling Tool Use Behavior

Streaming and Latency

Server-Sent Events

Time to First Token

Implementing Streaming

Streaming with Tool Calls

Latency Breakdown

When Not to Stream

Prompt Caching

Model Selection and Trade-offs

Model Tiers

Quality vs Cost

Model Routing

Benchmarks vs Real-World Performance

Multimodal Capabilities

Provider Considerations

Future-Proofing Your Architecture

Every interaction with a modern LLM goes through the same API pattern: you send a list of messages, the model returns a response. The API is stateless: it has no memory of previous calls. Understanding this interface is the foundation for building anything with LLMs, from simple chatbots to complex agent systems.

The Messages Array

The core input to the API is an ordered list of messages, each with a role. Three roles exist: system (instructions that shape the model's behavior), user (human input), and assistant (previous model responses). The model receives the entire array as context and generates the next assistant message.

json

1{
2  "model": "gpt-4o",
3  "messages": [
4    {"role": "system", "content": "You are a helpful coding assistant."},
5    {"role": "user", "content": "What is a hash map?"},
6    {"role": "assistant", "content": "A hash map is a data structure..."},
7    {"role": "user", "content": "How does collision handling work?"}
8  ]
9}

The order matters. The system message sets the frame, and the conversation history gives the model context for the current question. Placing the system message anywhere other than the beginning produces unpredictable behavior.

Role Behavior

The system role carries special weight: it sets the model's persona, constraints, and output format. A system message saying "You are a JSON API. Respond only with valid JSON." changes the model's output format entirely. The user and assistant messages form the conversation history. By including previous assistant responses in the array, you teach the model the tone, format, and style it should continue using. If you want the model to always respond in bullet points, include an assistant message with bullet points. The model learns by example from its own history in the messages array.

Stateless by Design

Each API call is completely independent. The model does not remember your previous call. If a user asks a follow-up question, your application must include the full conversation history in the next request. This means a 20-turn conversation sends all 20 turns every time, and you pay for re-processing those input tokens on every call.

Key Insight

The API has no memory. Every call starts from zero. Your application is responsible for assembling the conversation history and sending it each time. This is why context window size matters so much, it limits how much history you can include.

This statelessness is a feature, not a limitation. It means the API is horizontally scalable: any server can handle any request. It means you control exactly what the model sees. And it means there is no hidden state that could cause surprising behavior.

The practical consequence is that your application's memory management code is more important than the model itself. A conversation that runs for 100 turns with a naive implementation resends every message on every call, and input costs grow quadratically. A conversation with smart truncation (keeping the system prompt, a summary of earlier turns, and the last 10 messages) keeps costs flat. How you manage the messages array determines whether your application costs $100 or $10,000 per month at scale.

Tokens and Costs

LLMs do not process text character by character. They use tokens, which are roughly 3-4 characters or about 0.75 words. Every API call has two token counts: input tokens (the messages you send) and output tokens (what the model generates). You pay for both, and output tokens are typically 3-5x more expensive than input tokens.

Input:  "Explain quicksort in three sentences."
Tokens: ["Explain", " quick", "sort", " in", " three", " sentences", "."]
Count:  7 tokens

Token counts determine both cost and whether your messages fit within the model's context window. A model with a 128K context window can process roughly 96,000 words of combined input and output. Exceeding the context window causes the API to return an error: the request is rejected entirely, not silently truncated. Production code must track token counts and truncate or summarize conversation history before it exceeds the limit.

Temperature and Sampling

The temperature parameter controls randomness in the model's output. At temperature: 0, the model always picks the most likely next token, deterministic and consistent. At temperature: 1, the model samples more broadly, producing creative and varied responses. For factual tasks (data extraction, classification, code generation), use low temperature. For creative tasks (writing, brainstorming), use higher temperature. Most production applications use 0 or 0.1 to ensure consistent, reproducible outputs.

python

1# Deterministic output for classification
2response = client.chat.completions.create(
3    model="gpt-4o-mini",
4    messages=[{"role": "user", "content": "Classify this email as spam or not spam: ..."}],
5    temperature=0
6)
7
8# Creative output for brainstorming
9response = client.chat.completions.create(
10    model="gpt-4o",
11    messages=[{"role": "user", "content": "Generate 5 creative product names for..."}],
12    temperature=0.8
13)

Response Structure

The API returns a structured response containing the generated message, a finish reason, and usage metadata.

json

1{
2  "choices": [
3    {
4      "message": {"role": "assistant", "content": "Quicksort works by..."},
5      "finish_reason": "stop"
6    }
7  ],
8  "usage": {
9    "prompt_tokens": 42,
10    "completion_tokens": 87,
11    "total_tokens": 129
12  }
13}

The finish_reason tells you why the model stopped generating. stop means it finished naturally. length means it hit the max token limit and the response was truncated, and you are missing content. tool_calls means the model wants to call a function instead of responding with text.

Max Tokens

You can set max_tokens to cap the output length. If the model hits this limit, the response cuts off mid-sentence and finish_reason returns length. This is useful for cost control but dangerous if set too low, you get incomplete answers with no warning in the content itself. Always check finish_reason in production code.

Putting It Together

A production API call combines all of these elements: a carefully constructed messages array with system instructions and conversation history, a model selection, temperature settings for the task type, and a max_tokens limit based on expected output length. The response gives you the generated text, the reason the model stopped, and a token usage breakdown for cost tracking. Every field in the request and response exists for a reason. Understanding each one gives you precise control over the model's behavior and your costs.

python

1response = client.chat.completions.create(
2    model="gpt-4o",
3    messages=[
4        {"role": "system", "content": "You are a concise technical writer."},
5        {"role": "user", "content": "Explain database indexing in 3 sentences."}
6    ],
7    temperature=0.1,
8    max_tokens=200
9)
10
11# Always check finish_reason in production
12if response.choices[0].finish_reason == "length":
13    log.warning("Response truncated — consider increasing max_tokens")

Function Calling and Tool Definitions

Function calling is the mechanism that transforms an LLM from a text generator into a system that can interact with the world. Instead of generating text, the model generates a structured function call (a name and arguments) that your code executes. This is the foundation of tool use in AI agents.

How Function Calling Works

You define available functions in the API request alongside your messages. Each function definition includes a name, a description, and a parameter schema. The model reads the user's message, considers the available functions, and decides whether to respond with text or generate a structured function call. If it chooses a function, it returns the function name and arguments as JSON, but it does not execute anything itself.

json

1{
2  "model": "gpt-4o",
3  "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
4  "tools": [
5    {
6      "type": "function",
7      "function": {
8        "name": "get_weather",
9        "description": "Get the current weather for a given city",
10        "parameters": {
11          "type": "object",
12          "properties": {
13            "city": {"type": "string", "description": "City name"},
14            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
15          },
16          "required": ["city"]
17        }
18      }
19    }
20  ]
21}

Interview Tip

Function calling is what makes agents possible. Without it, an LLM can only generate text. With it, an LLM can search databases, call APIs, execute code, and interact with any system your code can reach. The model decides WHAT to call and with WHAT arguments. Your code handles the actual execution.

Writing Effective Tool Definitions

The description field is the most important part of a tool definition. The model uses it to decide when to call the function and what arguments to pass. A vague description like "gets data" leads to incorrect tool selection. A specific description like "Retrieves the current stock price for a given ticker symbol from the NYSE" gives the model enough information to make good decisions.

Parameter descriptions matter too. If a parameter accepts a date, specify the expected format: "Date in YYYY-MM-DD format". If it accepts an enum, list the valid values. The more precise your schema, the more accurate the model's generated arguments.

Structured Outputs and JSON Mode

Beyond function calling, most APIs now support structured output modes that force the model to return valid JSON matching a provided schema. This eliminates the need to parse free-form text and handle malformed responses. When combined with function calling, structured outputs guarantee that the arguments object is valid JSON conforming to your parameter schema, no try/catch parsing logic needed. This is particularly valuable for pipelines where the model's output feeds directly into downstream code.

The Tool Use Protocol

Function calling is a multi-step process, not a single API call:

Your code sends messages and tool definitions to the API
The model returns a tool_calls response with function name and arguments
Your code executes the function with the provided arguments
Your code sends the result back as a message with role tool
The model generates a final response using the function result

json

1// Step 2: Model returns tool call
2{
3  "choices": [{
4    "message": {
5      "role": "assistant",
6      "tool_calls": [{
7        "id": "call_abc123",
8        "function": {"name": "get_weather", "arguments": "{\"city\": \"Tokyo\"}"}
9      }]
10    },
11    "finish_reason": "tool_calls"
12  }]
13}
14
15// Step 4: Your code sends the result back
16{
17  "messages": [
18    ...previous_messages,
19    {"role": "assistant", "tool_calls": [...]},
20    {"role": "tool", "tool_call_id": "call_abc123", "content": "{\"temp\": 22, \"condition\": \"sunny\"}"}
21  ]
22}

Parallel Tool Calling

Models can request multiple function calls in a single response when operations are independent. If a user asks "What is the weather in Tokyo and New York?", the model returns two tool calls in one response. Your code should execute both in parallel and send both results back before the model generates its final answer. This reduces round trips and latency.

json

1// Model returns two tool calls in one response
2{
3  "tool_calls": [
4    {"id": "call_1", "function": {"name": "get_weather", "arguments": "{\"city\": \"Tokyo\"}"}},
5    {"id": "call_2", "function": {"name": "get_weather", "arguments": "{\"city\": \"New York\"}"}}
6  ]
7}

Your code executes both calls concurrently and sends both results back in the same messages array. The model then synthesizes both results into a single response: "It is 22 degrees and sunny in Tokyo, and 15 degrees and cloudy in New York."

Error Handling in Tool Calls

When a function call fails (the API is down, the arguments are invalid, the operation times out) send the error back to the model as the tool result. Do not crash or silently drop the call. The model can adapt: it might try different arguments, select a different tool, or explain the failure to the user. The model handles errors surprisingly well when given clear error messages.

The key principle is to treat errors as data, not exceptions. If a database query returns zero results, that is information the model can use ("No orders found for that email. Would you like to try a different email address?"). If an API returns a 404, the model can explain that the resource does not exist. By feeding errors back into the conversation, you let the model act as a graceful error handler rather than crashing the entire interaction.

json

1{
2  "role": "tool",
3  "tool_call_id": "call_abc123",
4  "content": "{\"error\": \"City 'Tokyoo' not found. Did you mean 'Tokyo'?\"}"
5}

Controlling Tool Use Behavior

Most APIs let you control how aggressively the model uses tools. Setting tool_choice: "auto" (the default) lets the model decide whether to call a tool or respond with text. Setting tool_choice: "required" forces the model to call at least one tool, useful for pipelines where you always need structured output. Setting tool_choice: "none" disables tool use for that request, even if tools are defined. You can also specify a particular tool by name, forcing the model to call that specific function.

This control is important for multi-step agent loops. On the first turn, you might use auto to let the model decide. After receiving a confusing result, you might force a specific clarification tool. After the final tool call, you might set none to force a text summary. Fine-grained tool choice control gives your orchestration code precise authority over the agent's behavior at each step.

python

1# Force the model to call a specific tool
2response = client.chat.completions.create(
3    model="gpt-4o",
4    messages=messages,
5    tools=tools,
6    tool_choice={"type": "function", "function": {"name": "search_database"}}
7)

Streaming and Latency

When a model generates a 500-token response, waiting for all 500 tokens before showing anything creates a poor user experience. Streaming delivers tokens as they are generated, so the user sees the response building in real-time. The total generation time is the same, but the perceived latency drops from seconds to milliseconds.

Server-Sent Events

The standard streaming protocol for LLM APIs is Server-Sent Events (SSE). The client opens an HTTP connection, and the server pushes events down that connection as tokens are generated. Each event contains a small chunk of the response, typically one or a few tokens. The client processes each chunk as it arrives rather than waiting for the complete response.

1data: {"choices":[{"delta":{"content":"Quick"}}]}
2
3data: {"choices":[{"delta":{"content":"sort"}}]}
4
5data: {"choices":[{"delta":{"content":" works"}}]}
6
7data: {"choices":[{"delta":{"content":" by"}}]}
8
9data: [DONE]

Unlike WebSockets, SSE is a one-directional protocol: the server pushes data to the client. The client cannot send new data over the same connection. This is fine for LLM APIs because the interaction pattern is strictly request-then-stream-response. SSE also handles reconnection gracefully, if the connection drops, the client can reconnect and resume.

One practical consideration: proxies, load balancers, and CDNs sometimes buffer SSE connections or impose timeout limits. If your infrastructure sits between the client and the LLM API, verify that streaming works end-to-end. A reverse proxy that buffers the entire response before forwarding it to the client defeats the purpose of streaming entirely: the client sees nothing until the full response is buffered, which is worse than a non-streaming call because it adds proxy latency on top.

Time to First Token

The most important latency metric for user experience is Time to First Token (TTFT): the delay between sending your request and receiving the first token. This is what determines how responsive your application feels. TTFT has two components: network latency (fixed, usually small) and input processing time (proportional to prompt length). A 500-token prompt might have a TTFT of 200ms. A 50,000-token prompt might take 2 seconds before the first token appears.

TTFT is the metric users feel most directly. A 2-second TTFT followed by fast streaming feels sluggish at the start. A 200ms TTFT followed by the same streaming feels instant. Users perceive responsiveness based on when the first content appears, not when the last content arrives.

For context, human reading speed is roughly 250 words per minute, or about 4 words per second. Most models generate at 30-100 tokens per second (roughly 22-75 words per second). This means the model generates text faster than humans can read it. Once streaming starts, the user never catches up. The experience feels instantaneous. The only bottleneck the user perceives is the TTFT delay before that first token appears.

Implementing Streaming

The response arrives as a series of delta objects. Each delta contains the next token or tokens. Your code concatenates them to build the complete response. The key difference from a non-streaming call is that you process a stream of partial results instead of a single complete result.

python

1response = client.chat.completions.create(
2    model="gpt-4o",
3    messages=messages,
4    stream=True
5)
6
7full_response = ""
8for chunk in response:
9    if chunk.choices[0].delta.content:
10        token = chunk.choices[0].delta.content
11        full_response += token
12        print(token, end="", flush=True)  # Display immediately

Streaming with Tool Calls

When the model decides to call a tool during streaming, the function name and arguments arrive as deltas just like text tokens. Your code must buffer the incoming chunks until the complete tool call is assembled, you cannot execute a function with half an argument string. Watch for finish_reason: tool_calls to know when the tool call is complete and ready for execution.

This creates an important design consideration: streaming text to the user is straightforward (display each token), but streaming tool calls requires buffering logic. Many implementations use a state machine that switches between "streaming text to UI" and "buffering tool call" modes based on the delta content type.

Latency Breakdown

Total response time breaks down into two phases. First, input processing: the model reads all input tokens. This step is parallelized on GPUs, so doubling input length does not double processing time, but it does increase TTFT. Second, generation: the model produces output tokens one at a time (auto-regressively). This step is sequential. 500 output tokens take roughly 500 steps regardless of input length. The generation speed is measured in tokens per second, typically 30-100 for frontier models.

1Total latency = TTFT + (output_tokens / tokens_per_second)
2
3Example:
4TTFT = 400ms (processing 5,000 input tokens)
5Output = 200 tokens at 50 tokens/second = 4,000ms
6Total = 4,400ms
7
8With streaming: user sees first token at 400ms
9Without streaming: user sees nothing until 4,400ms

This breakdown reveals an important optimization insight: for short responses, TTFT dominates total latency. For long responses, generation speed dominates. Optimize accordingly: reduce input length (shorter prompts, prompt caching) for short-output tasks, and choose faster models for long-output tasks.

When Not to Stream

Streaming is not always the right choice. For background processing tasks (batch summarization, data extraction pipelines, automated classification), no human is watching the output. Streaming adds complexity to your code (buffering logic, chunk concatenation, error handling for dropped connections) without any benefit when there is no user interface. For agent pipelines where the model's output feeds into the next processing step, you need the complete response before continuing anyway. Use streaming for user-facing interactions and non-streaming for backend pipelines.

Prompt Caching

Providers like OpenAI and Anthropic offer prompt caching: when multiple requests share the same prefix (e.g., a long system prompt), the provider caches the processed prefix. Subsequent requests skip re-processing the cached portion, reducing both TTFT and cost. This is particularly valuable for applications with large, stable system prompts that are sent with every request. Cached input tokens are typically billed at 50-90% discount.

To maximize cache hit rates, structure your messages so the stable content (system prompt, few-shot examples, reference documents) comes first, and the variable content (user message) comes last. The cache matches on prefix, if the first 10,000 tokens are identical across requests, those 10,000 tokens are cached regardless of what follows. Rearranging your system prompt between requests invalidates the cache entirely.

Model Selection and Trade-offs

Choosing the right model is a three-way trade-off between quality, cost, and speed. The best model for a given task is rarely the most capable one. It is the cheapest model that meets the quality threshold. Since new models are released every few weeks, model selection is an ongoing decision, not a one-time choice.

Model Tiers

Models fall into roughly three tiers. Frontier models (Claude Opus, GPT-4o, Gemini Ultra) deliver the highest quality reasoning but are the most expensive and slowest. Mid-tier models (Claude Sonnet, GPT-4o) offer strong quality at moderate cost and are the workhorses for most production applications. Small models (Claude Haiku, GPT-4o-mini) are the fastest and cheapest, suitable for simple tasks like classification, extraction, and routing. The tier boundaries shift constantly as new releases push mid-tier quality into the small model price range.

Quality vs Cost

A frontier model might cost $15 per million input tokens. A small model costs $0.25 per million. If the small model handles 80% of your tasks correctly, using the frontier model for everything wastes money on the easy 80%. The question is never "which model is best"nullit is "which model is good enough for this specific task at the lowest cost."

To put this in concrete numbers: if your application handles 1 million requests per month with an average of 1,000 input tokens and 500 output tokens per request, a frontier model at $15/$60 per million tokens costs roughly $45,000 per month. A small model at $0.25/$1.25 per million tokens costs roughly $875 per month. That is a 50x difference. If the small model produces acceptable quality for your use case, you save $44,000 per month by switching.

Model Routing

The most cost-effective architecture uses different models for different tasks. A small model classifies incoming requests, routes simple ones to itself, and forwards complex ones to a frontier model. This pattern is called model routing or a model cascade. A well-tuned routing system can reduce costs by 70-80% compared to using a frontier model for all requests, with minimal quality degradation on the tasks that matter.

For example, a customer support agent might route "What are your business hours?" to a small model (instant, cheap) and "Help me debug why my integration is returning 403 errors with this specific OAuth configuration" to a frontier model (slower, expensive, but needs the reasoning capability).

The routing decision itself can be made by a small model. A classifier that reads the user message and outputs one of three labels ("simple", "moderate", "complex") costs fractions of a cent per request and takes under 100ms. The routing overhead is negligible compared to the savings on the downstream model choice. Some teams use rule-based routing (keyword matching, message length) for speed, then fall back to model-based routing for ambiguous cases.

Benchmarks vs Real-World Performance

Common Pitfall

Benchmark scores (MMLU, HumanEval, GPQA) give directional guidance but do not predict performance on your specific task. A model scoring 5% higher on a coding benchmark might perform worse on your particular codebase. Always evaluate models on your actual use cases with your actual prompts before making a selection decision.

The only evaluation that matters is testing each model on a representative sample of your real inputs with your real prompts. Build an evaluation dataset of 50-100 examples with expected outputs, run each candidate model against it, and measure accuracy. This takes a few hours and a few dollars, far less than the cost of choosing the wrong model and running it in production for months.

Your evaluation should measure what your application actually cares about. For a summarization tool, measure factual accuracy and completeness. For a classification pipeline, measure precision and recall. For a code generator, measure whether the output compiles and passes tests. Generic "quality" is not a metric. Define specific, measurable criteria tied to your application's success.

Multimodal Capabilities

Not all models support all input types. Vision (image understanding), audio transcription, PDF processing, and video analysis are available on some models but not others. If your application needs to process images, this requirement immediately narrows your model choices. Check the provider's documentation for supported modalities before committing to a model.

Multimodal capabilities also affect cost. Sending an image to a vision model consumes tokens based on the image resolution. A high-resolution image might cost 1,000+ tokens. Audio inputs are similarly tokenized by duration. Factor these costs into your model selection when your application processes non-text inputs at scale.

Additionally, multimodal quality varies significantly between models. One model might excel at reading text from images (OCR-like tasks) but struggle with diagram interpretation. Another might handle charts well but miss fine details in photographs. If multimodal processing is core to your application, test each candidate model on your specific image types: product photos, screenshots, handwritten notes, or whatever your users submit.

Provider Considerations

The model itself is only part of the decision. Rate limits determine your maximum throughput, if you need 1,000 requests per minute and the provider caps you at 500, you need either a higher tier or a second provider. Uptime and reliability affect your SLA. A provider with 99.5% uptime means roughly 3.6 hours of downtime per month. Geographic availability matters for data residency requirements. Some regulations require that data never leaves a specific region. Data privacy policies govern whether your inputs are used for training. Some providers use your data to improve their models unless you opt out. The cheapest model from an unreliable provider costs more in the long run than a slightly more expensive model from a provider with 99.9% uptime.

Future-Proofing Your Architecture

Build abstractions that let you swap models without changing application code. Use a model configuration layer that maps task types to model identifiers. When a better model launches (and it will, every few weeks) you change a configuration value, not a codebase. Hard-coding a specific model name throughout your application creates technical debt that compounds with every model release.

A practical approach is a configuration file or environment variable that maps task names to model identifiers:

task: summarization -> model: gpt-4o-mini task: complex_reasoning -> model: claude-opus task: classification -> model: gpt-4o-mini task: code_generation -> model: claude-sonnet

When a new model launches, you update one line per task, run your evaluation suite, and deploy. This takes minutes instead of days. The abstraction also makes A/B testing trivial: route 10% of traffic to the new model, compare metrics, and promote or rollback based on data.

Course

Introduction to Agentic AI

LLM Foundations

The Agent Paradigm

Reasoning and Planning

Memory and Knowledge

Agent Architectures

Safety and Reliability

Production Engineering

Real-World Agent Patterns