How Large Language Models Work

Topics Covered

Transformer Architecture

Tokens, Not Words

Self-Attention: The Core Mechanism

Multi-Head Attention

Feed-Forward Layers

Layer Stacking

Parameters: Learned Weights

Pretraining and Emergent Abilities

The Pretraining Objective

Training Data

Scaling Laws

Emergent Abilities

Fine-Tuning and RLHF

Knowledge Cutoff

Inference and Token Generation

Autoregressive Generation

Probability Distribution and Sampling

A Simple Generation Example

Context Window: The Model's Working Memory

Speed Characteristics

Capabilities and Limitations

What LLMs Can Do

Hallucination: The Fundamental Failure Mode

No Real-Time Knowledge

No Persistent Memory

Cannot Take Actions

Why This Matters for System Design

Every modern LLM (GPT, Claude, Gemini, Llama) is built on the same architecture: the transformer. Understanding what a transformer does (not how to implement one) is essential for knowing why LLMs behave the way they do. The transformer is not the only neural network architecture that has been tried for language, but it is the one that won, and the reason it won is a single mechanism called self-attention.

Tokens, Not Words

Before a transformer processes text, it breaks the input into tokens. Tokens are subword units, not whole words. The word "unhappiness" might become three tokens: "un", "happiness", and a suffix marker. Common words like "the" are single tokens. Rare words get split into smaller pieces. A typical LLM vocabulary contains roughly 100,000 tokens. Every piece of text the model reads or writes is a sequence of these tokens.

This tokenization step explains several quirks of LLM behavior. The model does not see characters. It sees tokens. This is why LLMs struggle with tasks like counting letters in a word or reversing a string: the individual characters are not the units the model operates on. It also explains why token limits matter more than word counts when working with LLM APIs.

Self-Attention: The Core Mechanism

Self-attention is the reason transformers outperform every previous architecture. The idea is simple: when processing a token, the model looks at every other token in the sequence to decide what that token means in context.

Consider the sentence: "The cat sat on the mat because it was tired." When the model processes the token "it," attention lets it look back at every other token and assign high weight to "cat", because "cat" is what "it" refers to. Without attention, the model would process "it" in isolation and have no way to resolve the reference.

Attention Mechanism Connecting Input Tokens
Key Insight

Attention is the key innovation that separates transformers from all prior language models. Previous architectures (RNNs, LSTMs) processed tokens sequentially, so distant context faded over long sequences. Attention lets every token directly connect to every other token regardless of distance, which is why transformers handle long documents where earlier models failed.

Multi-Head Attention

A single attention mechanism learns one pattern of relationships. But language has many simultaneous structures: syntax, semantics, coreference, temporal ordering. Multi-head attention runs multiple attention patterns in parallel. One head might learn that adjectives attend to the nouns they modify. Another might learn that pronouns attend to their antecedents. A third might track subject-verb agreement. Modern LLMs use 32 to 128 attention heads per layer.

Feed-Forward Layers

After attention determines how tokens relate to each other, a feed-forward network processes each token's updated representation. Think of attention as gathering information from context and the feed-forward layer as processing that information. Research suggests that feed-forward layers act as key-value memory stores: they encode factual associations like "Eiffel Tower is in Paris" as patterns in their weights. This is why larger feed-forward layers tend to store more factual knowledge.

Layer Stacking

A transformer is not one layer but many layers stacked. GPT-4-class models stack 120 or more transformer layers. Each layer takes the output of the previous layer as input and refines the representation further.

Early layers capture surface-level patterns: syntax, punctuation, local word relationships. Middle layers build semantic representations: what entities are being discussed, how they relate to each other, what role each phrase plays. Later layers capture abstract patterns: reasoning structure, task format, tone, and intent. This hierarchical refinement is why depth matters more than width: a deeper model builds richer abstractions, while a wider model at the same depth simply processes more features in parallel without the additional abstraction layers.

Parameters: Learned Weights

Every connection in the transformer has a weight: a number that was tuned during training. These weights are the model's parameters. GPT-4 has roughly 1.8 trillion parameters. Claude and Gemini are in the same order of magnitude. The model learns no explicit rules. Grammar, facts, reasoning patterns. All emerge from statistical patterns encoded in these parameters during training.

An LLM starts as random noise: billions of randomly initialized numbers. Pretraining transforms those numbers into a model that can write code, translate languages, and reason about complex problems. The training process is remarkably simple in concept: predict the next token. The results are extraordinary.

Emergent abilities appearing at scale during pretraining

The Pretraining Objective

The core training task is next-token prediction. Given a sequence of tokens, predict what comes next. Given "The capital of France is," the model should predict "Paris." Given a half-finished Python function, the model should predict the next line of code. This single objective, applied to trillions of tokens scraped from the internet, produces general capabilities that no one explicitly programmed.

Why does next-token prediction produce general intelligence-like behavior? Because predicting the next token in arbitrary text requires understanding grammar, facts, reasoning, context, style, and intent. A model that can predict the next token in a medical textbook has learned medical knowledge. A model that can predict the next token in a coding tutorial has learned programming. The objective is simple, but the knowledge required to do it well is vast.

Training Data

The breadth of training data determines the breadth of capabilities. Modern LLMs train on books, websites, academic papers, code repositories, conversation transcripts, and more, essentially the entire accessible internet. The dataset for a frontier model typically contains 10 to 15 trillion tokens. The model sees each example as a next-token prediction problem, adjusting its parameters slightly after each batch to improve its predictions.

Data quality matters as much as quantity. Models trained on curated, high-quality text outperform models trained on larger but noisier datasets. This is why training data curation (filtering out low-quality web pages, deduplicating content, balancing domains) is a critical engineering effort at every frontier lab.

Scaling Laws

Larger models trained on more data perform better, and this relationship is predictable. Researchers discovered that model quality follows a power law: doubling the parameters and doubling the training data yields a consistent improvement in prediction accuracy. This means you can estimate how well a model will perform before spending months training it. Scaling laws drove the industry shift toward building ever-larger models: the returns on scale were reliable and substantial. The practical implication is that training a frontier model is a predictable engineering problem, not a research gamble, if you invest enough compute and data, the model will reach a known quality level.

Emergent Abilities

This is where pretraining gets surprising. Some capabilities do not improve gradually with scale. They appear suddenly. Small models cannot do chain-of-thought reasoning no matter how you prompt them. Past a certain scale threshold, chain-of-thought works. Small models fail at multi-step arithmetic. Large models succeed. Small models cannot translate between language pairs they only saw separately during training. Large models can.

Interview Tip

Emergent abilities are why scaling laws matter so much. If every capability improved gradually, you could stop scaling once returns diminished. But emergent abilities mean that the next doubling of scale might unlock entirely new capabilities that the current model cannot do at all. This unpredictability is both the promise and the challenge of frontier AI research.

These abilities were never explicitly trained. The model was only trained to predict the next token. Somehow, at sufficient scale, the internal representations become rich enough to support reasoning, translation, and code generation as emergent byproducts. This is one of the most debated phenomena in AI research. Some researchers argue that emergent abilities are genuine phase transitions, while others suggest they may be artifacts of how we measure performance. Regardless of the theoretical debate, the practical observation holds: larger models can do things smaller models cannot.

Fine-Tuning and RLHF

A pretrained model is a powerful next-token predictor, but it is not a useful assistant. It will complete any text, including toxic, harmful, or unhelpful text, because that is what it was trained to do. Fine-tuning and Reinforcement Learning from Human Feedback (RLHF) transform the base model into an assistant.

Fine-tuning trains the model on curated examples of helpful, accurate responses. The training data shows the model what a good assistant response looks like: it follows instructions, answers questions directly, and avoids harmful content. RLHF goes further: human raters compare different model outputs and rank them by quality. The model learns to prefer outputs that humans rated highly. This process aligns the model's behavior with human preferences, producing responses that are helpful, harmless, and honest rather than simply probable.

Knowledge Cutoff

The model knows only what was in its training data. Every LLM has a knowledge cutoff date: events, publications, and discoveries after that date are unknown to the model. The model cannot access the internet during inference. This is not a limitation that can be solved with a bigger model. It is a design constraint that requires external tools (web search, RAG) to address. For system designers, this means any application requiring current information must include a retrieval component alongside the LLM.

When you send a prompt to an LLM, the model does not understand your question, search a database, or reason from first principles. It does one thing: predict the most likely next token, then repeat. Understanding this mechanism explains everything about how LLMs behave: their strengths, their failures, and their speed characteristics. Once you internalize that an LLM is a next-token predictor, every capability and every failure mode becomes predictable.

Autoregressive Generation

The model generates one token at a time. Given your prompt, it predicts the first token of the response. That token is appended to the sequence, and the entire sequence (prompt + generated token) is fed back as input to predict the second token. This repeats until the model produces a stop token or hits the maximum length.

Token by Token Text Generation

This is why the process is called autoregressive: each output becomes part of the input for the next step. The consequence is that a single poorly predicted token early in the sequence can derail the entire response, because every subsequent token is conditioned on it.

This also means the model cannot plan ahead. A human writer might outline an essay before writing it. The model has no such mechanism. It commits to each word as it generates it, with no ability to revise earlier tokens based on where the response is heading. Techniques like chain-of-thought prompting partially mitigate this by encouraging the model to "think out loud" before committing to a final answer.

Probability Distribution and Sampling

At each step, the model does not pick a single "answer." It computes a probability distribution over its entire vocabulary: roughly 100,000 tokens. Each token gets a probability score. The word "Paris" might get 0.85 probability, "Lyon" might get 0.03, and so on. The next token is then sampled from this distribution.

How that sampling works is controlled by two key parameters:

Temperature controls randomness. At temperature 0, the model always picks the highest-probability token, producing deterministic, repetitive output. At temperature 1, the model samples proportionally to the probabilities, producing varied, creative output. Above 1, the distribution flattens further, increasing randomness.

Top-p (nucleus sampling) limits the candidate pool. Instead of considering all 100,000 tokens, top-p=0.9 means: sort tokens by probability, take the smallest set whose cumulative probability exceeds 0.9, and sample only from that set. This prevents the model from occasionally picking a wildly improbable token while preserving diversity within the plausible range.

A Simple Generation Example

Here is what a single generation step looks like conceptually:

 
1Input:  "The capital of France is"
2Model output (probability distribution):
3  "Paris"0.85
4  "Lyon"0.03
5  "the"0.02
6  "a"0.01
7  ...         (100K other tokens with tiny probabilities)
8
9With temperature=0: always select "Paris"
10With temperature=1: select "Paris" 85% of the time, occasionally "Lyon" or others
11With top-p=0.9:     only consider "Paris" (0.85 already exceeds 0.9), so always "Paris"
12With top-p=0.95:    consider "Paris" (0.85) + "Lyon" (0.03) + "the" (0.02) + ...

Context Window: The Model's Working Memory

The context window is the maximum number of tokens the model can see at once. Everything (system prompt, conversation history, tool call results, and the response being generated) must fit within this window. Modern models offer 128K to 1M tokens, but the principle is the same: if information falls outside the window, the model cannot see it.

Think of the context window as working memory. The model has no long-term memory between API calls. Each call starts fresh with only the tokens provided in the prompt. This is why conversation history must be explicitly included in every request and why very long conversations eventually require summarization or truncation.

Context window management is one of the most important practical concerns when building LLM applications. A system prompt might consume 2,000 tokens. A multi-turn conversation might accumulate 50,000 tokens of history. Tool call results might add another 10,000. If the total exceeds the context window, older messages must be dropped or summarized, and the model loses access to whatever was removed.

Speed Characteristics

Input processing is parallelized: all input tokens are processed simultaneously through the transformer layers. This is why doubling your prompt length does not double the latency. The parallel processing phase is called the "prefill" stage, and it is GPU-compute bound.

Output generation is sequential: each token must be generated before the next one can start. This is the bottleneck. A 500-token response takes roughly 500 generation steps regardless of hardware. This sequential phase is called the "decode" stage, and it is memory-bandwidth bound: the model must read its entire parameter set from memory for each token generated.

This asymmetry explains why streaming exists: rather than waiting for the entire response, tokens are sent to the client as they are generated. The user sees the first token almost immediately (after the fast prefill stage) and then watches the response appear token by token during the slower decode stage. For applications where perceived latency matters, streaming transforms a 10-second wait into an immediate, progressively updating response.

It also explains why shorter prompts with longer outputs are more expensive than longer prompts with shorter outputs. Billing for LLM APIs typically charges more per output token than per input token, reflecting the higher computational cost of sequential generation.

LLMs are both more capable and more limited than most people expect. They can write working code, translate between languages, and reason through novel problems, but they also confidently fabricate facts, cannot count reliably, and have no ability to take actions in the world without external tools. Understanding both sides is essential for building reliable systems on top of LLMs.

LLM capabilities and limitations bridged by tool use

What LLMs Can Do

The capabilities of modern LLMs are broad: text generation, code generation, translation, summarization, reasoning (with chain-of-thought prompting), instruction following, and few-shot learning from examples provided in the prompt.

These capabilities work because of pattern matching at scale. The model has seen millions of code examples during training, so it can generate code. It has seen millions of translations, so it can translate. It has seen chain-of-thought reasoning in training data, so it can reason step-by-step when prompted to do so. None of these capabilities were explicitly programmed. They all emerge from the next-token prediction objective applied at sufficient scale.

Few-shot learning deserves special attention because it is one of the most practically useful capabilities. By providing a few examples of the desired input-output pattern in the prompt, you can steer the model to perform tasks it was never fine-tuned for. Show it three examples of converting natural language to SQL, and it will convert the fourth. This works because the model has learned to recognize and continue patterns: the examples in the prompt establish a pattern that the model extends to the new input.

Hallucination: The Fundamental Failure Mode

Hallucination is not a bug. It is a direct consequence of how the model works. The model is optimized to produce plausible next tokens, not true ones. It has no mechanism to verify facts against reality. When asked about a topic where its training data is sparse or contradictory, it generates the most plausible-sounding completion, which may be completely false.

Common Pitfall

Hallucination is the single biggest risk when deploying LLMs in production. The model does not know what it does not know. It will generate confident, detailed, well-structured answers that are entirely fabricated. Any system that surfaces LLM output to users without verification is exposing users to potential misinformation. Always design verification layers for factual claims.

Hallucination is particularly dangerous because it is confident. The model does not signal uncertainty. A fabricated answer looks identical to an accurate one. This means any production system using LLMs for factual information must include verification mechanisms. Retrieval-augmented generation (RAG), tool use for fact-checking, and confidence calibration are not optional enhancements. They are requirements.

No Real-Time Knowledge

The model's knowledge is frozen at its training cutoff date. It cannot access the internet during inference. It does not know about events, publications, or discoveries that occurred after training. This is a hard architectural constraint, not a limitation that improves with scale. A model with 10 trillion parameters still cannot tell you yesterday's stock price.

What makes this limitation dangerous is that the model does not know what it does not know. If asked about an event after its cutoff, it will not say "I have no information about this." It will generate a plausible response based on patterns from its training data, which may be outdated or entirely fabricated.

No Persistent Memory

Each API call is independent. The model does not remember previous conversations. If you ask a question in one API call and a follow-up in another, the model has no knowledge of the first interaction unless the full conversation history is included in the new prompt. Continuity is simulated by passing prior context in the prompt, not by the model actually remembering.

Cannot Take Actions

The model can only produce text. It cannot send emails, edit files, query databases, or call APIs on its own. Every action requires external tooling: a system that reads the model's text output, interprets it as a tool call, executes the action, and feeds the result back to the model. This is the foundation of AI agents: the model reasons and plans, and external systems execute.

The tool-use pattern works as follows: the model is given a list of available tools with descriptions. When it determines that a tool is needed, it generates a structured tool call (a JSON object with the tool name and arguments). The external system executes the tool, captures the result, and feeds it back into the model's context. The model then continues generating its response using the tool result. This loop (reason, call tool, observe result, continue) is the core mechanism behind every AI agent.

Why This Matters for System Design

Every limitation is a design requirement:

  • Hallucination means you need verification layers. RAG, fact-checking tools, or human review for critical claims.
  • Knowledge cutoff means you need retrieval or search tools to provide current information.
  • No persistent memory means you need external storage (databases, vector stores) for conversation state and user context.
  • No actions means you need tool-use frameworks that let the model call APIs, query databases, and interact with external systems.

The model is the reasoning engine. The system you build around it handles everything else. Understanding these boundaries is what separates a demo that impresses from a production system that delivers reliable results.