Long-Term Memory and Knowledge Management

Introduction to Agentic AI

Long-Term Memory and Knowledge Management

Topics Covered

Persistent Memory Patterns

Episodic Memory

Why Episodic Memory Matters

Implementation

Retrieval

The Forgetting Problem

Decay and Importance Over Time

Few-Shot Learning from Experience

Knowledge Graphs and Structured Memory

How Knowledge Graphs Work

When to Use Knowledge Graphs

When NOT to Use Knowledge Graphs

The Hybrid Approach

Maintaining the Graph

Multi-Hop Reasoning in Practice

Build vs. Maintain Tradeoffs

Choosing the Right Graph Technology

Memory Architecture Design

Where to Store

When to Retrieve

How to Keep Current

The Relevance Problem

Memory Consolidation

The Full Architecture in Practice

Memory Isolation

Memory Lifecycle

Latency and Cost Tradeoffs

Testing Memory Systems

Privacy and Compliance

LLMs are stateless. Every API call starts fresh. The model has no recollection of anything that happened five minutes ago unless you explicitly include it in the prompt. This is the fundamental constraint that memory systems exist to solve.

Conversation history is the most common form of short-term memory. You append previous turns to the prompt so the model can reference what was said earlier in the session. But conversation history has three critical limitations. It lives in the context window, so it disappears when the session ends. It grows linearly with each turn, eventually exceeding the context window limit. And it contains raw dialogue (greetings, filler, clarifying questions) rather than distilled facts. Conversation history is not memory. It is a temporary buffer.

Persistent memory solves these problems by storing facts, preferences, and learned patterns externally. These memories survive across sessions, across devices, and across months of interaction. The agent that remembers "this user prefers Python over Java" or "the API uses REST, not gRPC" or "the refund policy changed in March 2025" can skip repetitive clarification questions and provide contextually relevant answers from the first message of every new session.

The value of persistent memory compounds over time. In the first week, the agent learns basic preferences: language, framework, coding style. By the second month, it knows the project architecture, the team structure, and the user's common patterns. By the sixth month, it has a deep model of the domain, which approaches work for this codebase, which dependencies are fragile, which parts of the system are undergoing active development. Each new memory makes future interactions slightly more efficient. The agent that has been working with a user for six months provides fundamentally better assistance than one meeting them for the first time, because it has accumulated hundreds of contextual facts that eliminate guesswork.

A persistent memory system supports CRUD operations. The agent creates new memories when it learns something noteworthy. It reads relevant memories before responding to a query. It updates memories when information changes, if the user says "we switched to Rust last month," the existing "prefers Python" memory must be updated, not duplicated. And it deletes memories that are incorrect or no longer relevant. Treating memory as a living data store rather than a write-once log is what separates useful memory systems from noisy ones.

The scale of the memory store matters for CRUD design. A personal agent with 200 memories can afford linear scans for duplicate detection. An enterprise agent with 50,000 memories across hundreds of users needs indexed lookups. Plan for the target scale early. Migrating from a simple list to an indexed store after the fact usually requires re-embedding every memory and rebuilding the retrieval pipeline.

The core workflow has two phases. After each interaction, the agent runs an extraction step: an LLM call that identifies noteworthy facts from the conversation and stores them in a structured format. Before each new interaction, the agent runs a retrieval step: querying the memory store for facts relevant to the current context and injecting them into the prompt. This write-then-read loop is the foundation of every persistent memory architecture.

Consider a concrete example. Without memory: Session 1nullthe user asks "What language should I use for my backend?" The agent responds with a balanced overview of Python, JavaScript, and Go. Session 2nullthe user says "Write a function to parse CSV files." The agent asks "Would you prefer Python, JavaScript, or Go?" because it has no memory of the previous session. With memory: the agent extracted "user chose Python for their backend project" from Session 1. In Session 2, it writes the CSV parser in Python without asking, because it already knows the answer. This is the difference between a tool that starts fresh every time and a colleague who remembers context.

Not every piece of conversation is worth storing. The extraction step must distinguish signal from noise. Good memories are specific, falsifiable, and actionable. "User's project uses PostgreSQL 15 on AWS RDS" is a good memory. It is precise, verifiable, and directly useful when answering database-related questions. "User works on software" is a bad memory. It is vague and adds no actionable context. "User had a good conversation" is a bad memory. It is subjective and cannot inform future responses. The extraction prompt should instruct the LLM to produce only memories that would change how the agent responds to a future question.

Memories fall into several types, each serving a different purpose. Preference memories record user choices. "prefers tabs over spaces," "uses vim keybindings," "wants concise answers without lengthy explanations." Factual memories record project and domain facts. "the API rate limit is 1000 requests per minute," "the staging environment uses a separate database." Contextual memories record the state of ongoing work. "currently migrating from monolith to microservices," "the auth service rewrite is blocked on the SSO integration." Relationship memories record connections between entities. "Alice owns the payments service," "the mobile team depends on the notification API." Each type requires different retrieval strategies and different update frequencies.

Every memory system faces the cold-start problem. During the first few sessions, the agent has no accumulated memories. It cannot personalize responses, cannot skip clarification questions, and cannot leverage past context. The user experience during cold-start is identical to a stateless agent. Three strategies mitigate this. First, front-load extraction by asking a few onboarding questions at the start of the first session. "What language do you primarily work in? What is the project about? What tools do you use?" Second, import existing context from configuration files, READMEs, or project documentation to bootstrap the memory store with factual memories. Third, be transparent: tell the user that the agent learns over time, so they understand why early sessions feel less personalized and are motivated to provide the context the agent needs.

Memory deduplication is another challenge that surfaces early. When the user mentions "we use PostgreSQL" in five different sessions, the extraction step produces five nearly identical memories. A naive system stores all five, wasting retrieval slots when the agent queries for database-related context. The solution is deduplication at write time. Before storing a new memory, embed it and check for existing memories with cosine similarity above 0.95. If a near-duplicate exists, merge the metadata (update the "last confirmed" timestamp and increment a confidence counter) rather than creating a separate entry. This keeps the memory store compact and ensures retrieval returns unique facts rather than five copies of the same one.

The retrieval step also requires careful design. Embedding the user's current message and searching for the top-k nearest memories works for simple cases, but production systems need more sophistication. Hybrid retrieval combines semantic similarity with metadata filters: retrieve memories that are both semantically relevant and tagged with the current project. Re-ranking applies a second-pass model that scores retrieved memories for actual relevance to the query, filtering out false positives from the embedding search. Recency weighting boosts memories that were recently confirmed or updated, since fresh memories are more likely to reflect the current state of the world. Without these refinements, the agent retrieves plausible-sounding but outdated or off-topic memories that degrade response quality.

The number of memories injected into the prompt also matters. Too few and the agent misses relevant context. Too many and the agent's context window fills with memory content, leaving less room for the actual conversation and instructions. Most production systems retrieve 10 to 20 candidate memories, re-rank them, and inject the top 3 to 5 into the prompt. The injected memories are formatted as a structured block ("Known facts about this user: ...") placed before the conversation history so the model treats them as established context rather than part of the dialogue.

Key Insight

Memory transforms an agent from a tool into a colleague. Without memory, every conversation starts from zero. The agent asks the same clarifying questions, makes the same mistakes, and ignores everything it learned yesterday. With memory, the agent builds a working model of the user, the project, and the domain that improves with every interaction.

Course

Introduction to Agentic AI

LLM Foundations

The Agent Paradigm

Reasoning and Planning

Memory and Knowledge

Agent Architectures

Safety and Reliability

Production Engineering

Real-World Agent Patterns

Long-Term Memory and Knowledge Management

Persistent Memory Patterns

1/27