Context Engineering: Managing What Your LLM Sees

Topics Covered

Context Window Fundamentals

Why Bigger Windows Are Not Enough

Tokens Are Not Characters

The Context Engineering Paradigm

The Four Strategies

Why Prompt Tone Matters

Conversation History Management

Three Approaches to History Management

Hybrid: Summary Plus Recent Window

What to Preserve, What to Drop

Summarization and Compression

When to Summarize

Summarization Prompts

Compressing Tool Outputs

Token Budgeting

Anatomy of a Context Budget

Monitoring and Enforcement

Cost Optimization Strategies

Every LLM call works the same way: you send a sequence of tokens in, and the model generates tokens out. The context window is the maximum number of tokens the model can process in a single call. Input and output combined. Think of it as working memory. Everything the model knows about your current task must fit inside this window, because LLMs have no memory between API calls.

Early models had tiny windows. GPT-3.5 started at 4,096 tokens. Roughly 3,000 words. That is about 6 pages of text. Try fitting a system prompt, conversation history, retrieved documents, and a complex question into 6 pages. You cannot. This constraint forced developers into aggressive compression and truncation strategies from day one.

Modern models have dramatically larger windows. Claude offers 200K tokens. Gemini offers 1 million or more. GPT-4o supports 128K. These larger windows seem like they solve the problem, just send everything. But they do not, for three reasons.

Why Bigger Windows Are Not Enough

Cost scales linearly with tokens. Sending 100K tokens costs 25x more than sending 4K tokens. For an agent that makes 50 LLM calls per task, the difference between a lean context and a bloated one is the difference between $0.50 and $12.50 per task. At 10,000 tasks per day, that is $500 versus $125,000 daily.

Latency increases with context size. More tokens means more time for the model to process the input before generating the first output token. Time-to-first-token (TTFT) can jump from 500ms with 4K tokens to 5 seconds or more with 200K tokens. For interactive agents, this delay is noticeable and frustrating.

More context does not mean better answers. Research on the "lost-in-the-middle" phenomenon shows that LLMs attend most strongly to information at the beginning and end of the context window. Information buried in the middle (between positions 20% and 80% of the total context) gets significantly less attention. Dumping an entire knowledge base into the window does not help if the relevant paragraph is on page 47 of 100.

Key Insight

Andrej Karpathy framed context engineering as an operating system problem. The LLM is the CPU (it processes whatever is in working memory. The context window is RAM) limited, expensive, and volatile. You are the operating system. You decide what gets loaded into RAM, what gets swapped to disk, and what gets evicted. Most agent failures are not model failures. They are context failures: the right information was not in the window when the model needed it.

Tokens Are Not Characters

Tokenization splits text into subword units, not characters or words. The word "unbelievable" is one word but typically 3-4 tokens ("un", "believ", "able" or similar). Code is especially token-hungry. A 50-line Python function might consume 500 tokens because of variable names, indentation, and operators. JSON is worse: a 1KB JSON payload can consume 400+ tokens due to curly braces, colons, quotes, and repeated key names.

This matters for budgeting. If your system prompt is 2,000 characters, do not assume it is 500 tokens. Use your provider's tokenizer to count exactly. OpenAI provides tiktoken. Anthropic provides token counts in API responses. Always measure, never estimate.