Memory Systems in AI Agents
A language model without memory is stateless — it begins each interaction knowing nothing about past interactions, learned facts, or established preferences. For agents that handle multi-step tasks over extended time, this is an insurmountable limitation. Memory systems are the infrastructure that gives agents continuity, allowing them to accumulate knowledge, recall relevant context, and behave consistently across sessions. Understanding the four distinct types of agent memory — and how to architect them together — is fundamental to building agents that actually work on real tasks.
The four types of agent memory
Agent memory researchers and practitioners have converged on a taxonomy of four distinct memory types, each with different characteristics, costs, and appropriate use cases. Effective agent architectures typically combine multiple memory types, choosing the right one for each kind of information.
Vector databases explained
Vector databases are the storage infrastructure that makes semantic and procedural memory possible at scale. Understanding what they do — and what distinguishes the leading options — is essential for anyone designing agent memory architectures.
A vector database stores items as high-dimensional numerical vectors (embeddings) alongside their source content. When you query with a new vector, the database returns the stored vectors most similar to the query vector, typically measured by cosine similarity or dot product. This approximate nearest neighbor (ANN) search is what enables "find me everything semantically related to this query" rather than "find me the exact text 'annual revenue.'"
The leading vector databases each have distinct positioning:
| Database | Positioning | Deployment | Best For |
|---|---|---|---|
| Pinecone | Managed, production-grade | Cloud-only SaaS | Production agents needing zero ops overhead; scales automatically |
| Weaviate | Full-featured, open-source | Self-hosted or cloud | Complex schemas, multi-tenancy, built-in hybrid search |
| Chroma | Developer-focused, lightweight | Local or self-hosted | Local development, prototyping, smaller scale deployments |
| pgvector | PostgreSQL extension | Wherever Postgres runs | Teams already on Postgres who want vectors without a new system |
| Qdrant | High-performance, Rust-native | Self-hosted or cloud | High-throughput production workloads with filtering requirements |
For most new agent projects, the choice comes down to: Chroma for local development (zero setup, runs in memory or on disk), Pinecone for production when you want a managed service, and pgvector when you want to minimize infrastructure complexity by extending an existing Postgres database. Weaviate and Qdrant become attractive when you need advanced filtering, multi-tenancy, or very high query throughput.
Embedding and retrieval basics
Before anything can be stored in a vector database, it must be converted to an embedding — a numerical vector that captures the semantic meaning of the content. The embedding model is the function that performs this conversion. Choosing the right embedding model is a foundational decision that determines the quality of all semantic memory retrieval.
Embedding models take text as input and produce a fixed-length vector as output, typically 384 to 3072 dimensions depending on the model. The key property is that semantically similar texts produce vectors that are close together in vector space. Popular embedding models include OpenAI's text-embedding-3-large (1536 or 3072 dimensions, high quality, API-based), Google's text-embedding-004, and open-source options like all-MiniLM-L6-v2 (384 dimensions, fast, runs locally) and BGE-large (1024 dimensions, strong open-source performance).
You must use the same embedding model for both indexing and retrieval. If you embed your documents with OpenAI's text-embedding-3-large and then query with all-MiniLM-L6-v2, the vectors will be in completely incompatible spaces — the similarity scores will be meaningless and retrieval will be random. This sounds obvious but is a common source of subtle bugs when refactoring agent code or switching providers.
The retrieval process works as follows: (1) the agent receives a query or identifies something it needs to remember; (2) the query text is passed through the embedding model to produce a query vector; (3) the vector database performs an approximate nearest neighbor search against all stored vectors; (4) the top-k most similar results are returned along with their source content; (5) that content is injected into the agent's context for reasoning. The entire round-trip for a well-optimized vector database query typically takes 20–100ms, making it fast enough for interactive use.
Memory retrieval strategies
The simplest retrieval strategy — find the k most semantically similar items — works well in many cases but fails in others. More sophisticated retrieval strategies dramatically improve the quality of what gets surfaced to the agent.
Similarity search
The baseline: embed the query, return the top-k nearest neighbors by cosine similarity. Fast, simple, and often sufficient. The limitation is that pure similarity search can surface redundant results (many near-identical items that all say the same thing) and may miss important items that are phrased differently from the query.
Recency weighting
Recent events are often more relevant than older ones. A memory system that stores timestamps can combine recency scores with similarity scores, boosting results from the last hour or day while still considering semantic relevance. The challenge is calibrating the recency decay function: how quickly should old memories lose relevance? For conversation history, decay can be fast; for factual reference material, recency matters less.
Importance weighting
Not all memories are equally important. Some events — a user explicitly stating a preference, a critical decision made, an error that caused a task to fail — warrant higher retrieval priority than routine background information. Importance can be scored at write time (by the agent itself, using a separate importance-rating prompt) and stored as metadata, then factored into retrieval ranking alongside similarity. This is the approach used in the Stanford "Generative Agents" research paper, where importance was one of three factors (along with recency and relevance) used to score which memories to surface.
MMR: Maximal Marginal Relevance
MMR retrieves results that are both relevant to the query and diverse from each other, reducing redundancy. For each new result to add to the retrieved set, it selects the item that maximizes similarity to the query minus similarity to already-selected items. LangChain implements MMR retrieval as a first-class option. It is particularly valuable when the knowledge base contains many near-duplicate documents or when you want to retrieve a diverse set of perspectives rather than the single most similar cluster.
When to use each memory type
The decision of which memory type to use for a given piece of information is an architectural choice with significant performance and cost implications.
| Memory Type | Use When... | Avoid When... |
|---|---|---|
| In-context | Information is needed immediately for the current reasoning step; short conversations; high-reliability retrieval needed | Content exceeds context limit; multi-session continuity needed; cost per token is a concern |
| Episodic (DB) | User preferences, past decisions, session continuity; structured data about known entities | Large volumes of unstructured text; when semantic search is needed rather than exact recall |
| Semantic (Vector) | Large knowledge bases; document Q&A; open-ended "find relevant info" queries; dynamic knowledge that changes | Structured lookups where exact match is needed; very small knowledge bases that fit in context |
| Procedural | Task templates, style examples, worked demonstrations; when few-shot prompting improves output quality | One-off tasks with no established pattern; when generic instruction is sufficient |
Practical memory architecture patterns
Real-world agents rarely use a single memory type in isolation. The most effective architectures layer multiple memory systems, each handling the information type it is best suited for.
The working memory pattern
Distinguish between "working memory" (in-context, for the current task) and "long-term memory" (stored externally). At the start of each agent turn, relevant long-term memories are retrieved and loaded into the context window — the working memory. As the task progresses, new information is generated in working memory. At the end of a session or when something important happens, the agent writes important new information back to long-term storage. This mirrors how human working memory and long-term memory interact.
# Simplified working memory pattern
async def agent_turn(user_message, user_id):
# 1. Retrieve relevant long-term memories
relevant_memories = await vector_db.search(user_message, top_k=5)
user_profile = await episodic_db.get(user_id)
# 2. Build context (working memory)
context = build_context(
system_prompt=SYSTEM_PROMPT,
user_profile=user_profile,
memories=relevant_memories,
conversation_history=current_session_history,
user_message=user_message
)
# 3. Run the agent
response = await llm.complete(context)
# 4. Write important new info back to long-term memory
if is_important(response):
await vector_db.upsert(response.key_facts)
return response
Memory compression and summarization
As conversation history grows, injecting the full history into every context window becomes prohibitively expensive. A common pattern is progressive summarization: after a conversation exceeds a certain length, the agent summarizes older turns into a compact representation, replacing the raw history with the summary. This compresses the context used while preserving the key information. The summary is generated by the model itself, using a prompt like "Summarize the key decisions, preferences, and facts established in this conversation." Multiple rounds of summarization can be applied recursively for very long sessions.
Reflection and memory distillation
Advanced agent architectures include a periodic "reflection" step where the agent reviews its recent memories and derives higher-level insights. Rather than only storing raw events ("user said they prefer Python"), the agent can distill patterns ("user is an experienced Python developer who values code readability"). These higher-level reflections are themselves stored as semantic memories, making retrieval more powerful because the agent can surface summarized insights rather than only raw events. This is computationally expensive but dramatically improves the quality of long-running agents.
Information stored in external memory databases can become stale — a fact that was true three months ago may no longer be true today. Agents that retrieve stale memories without checking for recency can confidently act on outdated information. Best practices include: storing a timestamp with every memory, implementing TTL (time-to-live) expiration for time-sensitive facts, and having the agent validate important retrieved facts against current tool output when the stakes are high.
The scratchpad pattern
For long, multi-step reasoning tasks, agents benefit from maintaining a scratchpad — a structured in-context document that accumulates the results of completed steps, intermediate conclusions, and notes for future steps. Rather than trying to hold everything in the model's unstructured reasoning, the scratchpad provides an explicit, readable record that the model updates as it works. This is particularly important for tasks that exceed the natural span of a single reasoning chain.
Memory and the context window budget
Every element of in-context memory competes for space in the context window. The system prompt uses tokens. Retrieved memories use tokens. Conversation history uses tokens. Tool definitions use tokens. The actual user message and the space needed for the agent's response use tokens. Managing this budget is a real engineering challenge for complex agents.
Practical guidelines for context budget management: (1) rank retrieved memories by relevance and truncate at a maximum token count; (2) compress conversation history progressively rather than keeping all turns; (3) use separate embedding retrieval for procedural memory (few-shot examples) so only the most relevant examples are included; (4) consider streaming approaches that process long documents in chunks rather than all at once; (5) monitor context usage at runtime and implement graceful degradation when approaching limits.
When debugging agent memory issues, the fastest diagnostic is to log the full context passed to the model at each step. The most common memory bugs are: retrieved memories with too-low relevance scores, memory not being written back after important events, context overflow silently truncating important information from the end, and embedding mismatches causing retrieval to surface unrelated content. All of these are visible when you can see the full context.