Module 428 min read · Agentic AI and Autonomous Systems

Memory Systems in AI Agents

A language model without memory is stateless — it begins each interaction knowing nothing about past interactions, learned facts, or established preferences. For agents that handle multi-step tasks over extended time, this is an insurmountable limitation. Memory systems are the infrastructure that gives agents continuity, allowing them to accumulate knowledge, recall relevant context, and behave consistently across sessions. Understanding the four distinct types of agent memory — and how to architect them together — is fundamental to building agents that actually work on real tasks.

The four types of agent memory

Agent memory researchers and practitioners have converged on a taxonomy of four distinct memory types, each with different characteristics, costs, and appropriate use cases. Effective agent architectures typically combine multiple memory types, choosing the right one for each kind of information.

In-context memory — the prompt window itself

The simplest form of memory: information that is literally present in the model's current context window. Anything the model can "see" in its input is in-context memory. This includes the conversation history, system prompt, retrieved documents, tool outputs, and everything else packed into the current request. It is instant to access (no lookup required), requires no external infrastructure, and the model can reason over it as fluently as its own knowledge. The severe limitation is capacity — context windows, even very large ones, have a hard ceiling. Long conversations, large codebases, or multi-day tasks will inevitably exceed what can fit. In-context memory is best for information needed in the current reasoning step.

External episodic memory — conversation history in a database

Episodic memory stores records of past events and interactions in an external database. When a user returns for a second session, the agent retrieves relevant past conversation turns and injects them into the current context. This solves the continuity problem across sessions without the in-context window filling up. The tradeoff is retrieval latency and the challenge of deciding what to retrieve. Not all past conversation is relevant to the current query — fetching everything is wasteful; fetching too little loses important context. Episodic memory is well-suited for user preferences, past decisions, and established facts about a specific user or task.

Semantic memory — vector embeddings of facts and documents

Semantic memory stores facts, documents, and knowledge as vector embeddings in a vector database. When the agent needs information, it embeds the query and retrieves semantically similar stored content. This allows agents to draw on large knowledge bases — thousands of documents, millions of facts — far beyond what any context window could hold. The retrieval is meaning-based rather than keyword-based, so "car" and "automobile" retrieve similar results. Semantic memory is the foundation of Retrieval-Augmented Generation (RAG) and is covered in depth in Module 7. It is best suited for factual knowledge, reference documents, and domain-specific information that needs to scale.

Procedural memory — few-shot examples of how to do tasks

Procedural memory stores demonstrations of how to perform tasks correctly — essentially, worked examples or templates. When an agent needs to complete a task type it has seen before, it retrieves similar past examples and uses them as few-shot prompts to guide its current behavior. This is particularly powerful for tasks with consistent but complex structure: formatting reports, writing code in a specific style, filling out forms. Procedural memory encodes the "how," not just the "what." It is stored like semantic memory (as vectors for retrieval) but the content is task examples rather than factual documents.

Vector databases explained

Vector databases are the storage infrastructure that makes semantic and procedural memory possible at scale. Understanding what they do — and what distinguishes the leading options — is essential for anyone designing agent memory architectures.

A vector database stores items as high-dimensional numerical vectors (embeddings) alongside their source content. When you query with a new vector, the database returns the stored vectors most similar to the query vector, typically measured by cosine similarity or dot product. This approximate nearest neighbor (ANN) search is what enables "find me everything semantically related to this query" rather than "find me the exact text 'annual revenue.'"

The leading vector databases each have distinct positioning:

Database	Positioning	Deployment	Best For
Pinecone	Managed, production-grade	Cloud-only SaaS	Production agents needing zero ops overhead; scales automatically
Weaviate	Full-featured, open-source	Self-hosted or cloud	Complex schemas, multi-tenancy, built-in hybrid search
Chroma	Developer-focused, lightweight	Local or self-hosted	Local development, prototyping, smaller scale deployments
pgvector	PostgreSQL extension	Wherever Postgres runs	Teams already on Postgres who want vectors without a new system
Qdrant	High-performance, Rust-native	Self-hosted or cloud	High-throughput production workloads with filtering requirements

For most new agent projects, the choice comes down to: Chroma for local development (zero setup, runs in memory or on disk), Pinecone for production when you want a managed service, and pgvector when you want to minimize infrastructure complexity by extending an existing Postgres database. Weaviate and Qdrant become attractive when you need advanced filtering, multi-tenancy, or very high query throughput.

Embedding and retrieval basics

Before anything can be stored in a vector database, it must be converted to an embedding — a numerical vector that captures the semantic meaning of the content. The embedding model is the function that performs this conversion. Choosing the right embedding model is a foundational decision that determines the quality of all semantic memory retrieval.

Embedding models take text as input and produce a fixed-length vector as output, typically 384 to 3072 dimensions depending on the model. The key property is that semantically similar texts produce vectors that are close together in vector space. Popular embedding models include OpenAI's text-embedding-3-large (1536 or 3072 dimensions, high quality, API-based), Google's text-embedding-004, and open-source options like all-MiniLM-L6-v2 (384 dimensions, fast, runs locally) and BGE-large (1024 dimensions, strong open-source performance).

Critical Implementation Detail

You must use the same embedding model for both indexing and retrieval. If you embed your documents with OpenAI's text-embedding-3-large and then query with all-MiniLM-L6-v2, the vectors will be in completely incompatible spaces — the similarity scores will be meaningless and retrieval will be random. This sounds obvious but is a common source of subtle bugs when refactoring agent code or switching providers.

The retrieval process works as follows: (1) the agent receives a query or identifies something it needs to remember; (2) the query text is passed through the embedding model to produce a query vector; (3) the vector database performs an approximate nearest neighbor search against all stored vectors; (4) the top-k most similar results are returned along with their source content; (5) that content is injected into the agent's context for reasoning. The entire round-trip for a well-optimized vector database query typically takes 20–100ms, making it fast enough for interactive use.

Memory retrieval strategies

The simplest retrieval strategy — find the k most semantically similar items — works well in many cases but fails in others. More sophisticated retrieval strategies dramatically improve the quality of what gets surfaced to the agent.

Similarity search

The baseline: embed the query, return the top-k nearest neighbors by cosine similarity. Fast, simple, and often sufficient. The limitation is that pure similarity search can surface redundant results (many near-identical items that all say the same thing) and may miss important items that are phrased differently from the query.

Recency weighting

Recent events are often more relevant than older ones. A memory system that stores timestamps can combine recency scores with similarity scores, boosting results from the last hour or day while still considering semantic relevance. The challenge is calibrating the recency decay function: how quickly should old memories lose relevance? For conversation history, decay can be fast; for factual reference material, recency matters less.

Importance weighting

Not all memories are equally important. Some events — a user explicitly stating a preference, a critical decision made, an error that caused a task to fail — warrant higher retrieval priority than routine background information. Importance can be scored at write time (by the agent itself, using a separate importance-rating prompt) and stored as metadata, then factored into retrieval ranking alongside similarity. This is the approach used in the Stanford "Generative Agents" research paper, where importance was one of three factors (along with recency and relevance) used to score which memories to surface.

MMR: Maximal Marginal Relevance

MMR retrieves results that are both relevant to the query and diverse from each other, reducing redundancy. For each new result to add to the retrieved set, it selects the item that maximizes similarity to the query minus similarity to already-selected items. LangChain implements MMR retrieval as a first-class option. It is particularly valuable when the knowledge base contains many near-duplicate documents or when you want to retrieve a diverse set of perspectives rather than the single most similar cluster.

When to use each memory type

The decision of which memory type to use for a given piece of information is an architectural choice with significant performance and cost implications.

Memory Type	Use When...	Avoid When...
In-context	Information is needed immediately for the current reasoning step; short conversations; high-reliability retrieval needed	Content exceeds context limit; multi-session continuity needed; cost per token is a concern
Episodic (DB)	User preferences, past decisions, session continuity; structured data about known entities	Large volumes of unstructured text; when semantic search is needed rather than exact recall
Semantic (Vector)	Large knowledge bases; document Q&A; open-ended "find relevant info" queries; dynamic knowledge that changes	Structured lookups where exact match is needed; very small knowledge bases that fit in context
Procedural	Task templates, style examples, worked demonstrations; when few-shot prompting improves output quality	One-off tasks with no established pattern; when generic instruction is sufficient

Practical memory architecture patterns

Real-world agents rarely use a single memory type in isolation. The most effective architectures layer multiple memory systems, each handling the information type it is best suited for.

The working memory pattern

Distinguish between "working memory" (in-context, for the current task) and "long-term memory" (stored externally). At the start of each agent turn, relevant long-term memories are retrieved and loaded into the context window — the working memory. As the task progresses, new information is generated in working memory. At the end of a session or when something important happens, the agent writes important new information back to long-term storage. This mirrors how human working memory and long-term memory interact.

# Simplified working memory pattern
async def agent_turn(user_message, user_id):
    # 1. Retrieve relevant long-term memories
    relevant_memories = await vector_db.search(user_message, top_k=5)
    user_profile = await episodic_db.get(user_id)

    # 2. Build context (working memory)
    context = build_context(
        system_prompt=SYSTEM_PROMPT,
        user_profile=user_profile,
        memories=relevant_memories,
        conversation_history=current_session_history,
        user_message=user_message
    )

    # 3. Run the agent
    response = await llm.complete(context)

    # 4. Write important new info back to long-term memory
    if is_important(response):
        await vector_db.upsert(response.key_facts)

    return response

Memory compression and summarization

As conversation history grows, injecting the full history into every context window becomes prohibitively expensive. A common pattern is progressive summarization: after a conversation exceeds a certain length, the agent summarizes older turns into a compact representation, replacing the raw history with the summary. This compresses the context used while preserving the key information. The summary is generated by the model itself, using a prompt like "Summarize the key decisions, preferences, and facts established in this conversation." Multiple rounds of summarization can be applied recursively for very long sessions.

Reflection and memory distillation

Advanced agent architectures include a periodic "reflection" step where the agent reviews its recent memories and derives higher-level insights. Rather than only storing raw events ("user said they prefer Python"), the agent can distill patterns ("user is an experienced Python developer who values code readability"). These higher-level reflections are themselves stored as semantic memories, making retrieval more powerful because the agent can surface summarized insights rather than only raw events. This is computationally expensive but dramatically improves the quality of long-running agents.

Memory Staleness

Information stored in external memory databases can become stale — a fact that was true three months ago may no longer be true today. Agents that retrieve stale memories without checking for recency can confidently act on outdated information. Best practices include: storing a timestamp with every memory, implementing TTL (time-to-live) expiration for time-sensitive facts, and having the agent validate important retrieved facts against current tool output when the stakes are high.

The scratchpad pattern

For long, multi-step reasoning tasks, agents benefit from maintaining a scratchpad — a structured in-context document that accumulates the results of completed steps, intermediate conclusions, and notes for future steps. Rather than trying to hold everything in the model's unstructured reasoning, the scratchpad provides an explicit, readable record that the model updates as it works. This is particularly important for tasks that exceed the natural span of a single reasoning chain.

Memory and the context window budget

Every element of in-context memory competes for space in the context window. The system prompt uses tokens. Retrieved memories use tokens. Conversation history uses tokens. Tool definitions use tokens. The actual user message and the space needed for the agent's response use tokens. Managing this budget is a real engineering challenge for complex agents.

Practical guidelines for context budget management: (1) rank retrieved memories by relevance and truncate at a maximum token count; (2) compress conversation history progressively rather than keeping all turns; (3) use separate embedding retrieval for procedural memory (few-shot examples) so only the most relevant examples are included; (4) consider streaming approaches that process long documents in chunks rather than all at once; (5) monitor context usage at runtime and implement graceful degradation when approaching limits.

Practical Tip

When debugging agent memory issues, the fastest diagnostic is to log the full context passed to the model at each step. The most common memory bugs are: retrieved memories with too-low relevance scores, memory not being written back after important events, context overflow silently truncating important information from the end, and embedding mismatches causing retrieval to surface unrelated content. All of these are visible when you can see the full context.