Sections
Related Guides
AI Agents & Agentic Systems Framework
GenAI & Agents
Agentic RAG: ReAct, Self-RAG, and Multi-Step Retrieval
GenAI & Agents
Multi-Agent Systems: Orchestration, LangGraph, and Production Patterns
GenAI & Agents
LLM & Agent Evaluation: Trajectories, RAGAS, LLM-as-Judge, and Hallucination Mitigation
GenAI & Agents
RAG Architecture: From Basics to Production
GenAI & Agents
Agent Memory Systems: In-Context, Semantic, Episodic, and Procedural
LLM context windows are finite and expensive. Learn how production agents implement hierarchical memory — in-context buffers, vector DB semantic retrieval, episodic event logs, and procedural fine-tuning — and when each layer is worth the engineering investment.
The Memory Problem: Context Windows Are Finite and Expensive
Every production agent system eventually hits the same wall: context windows are finite, and conversation history grows without bound. Even with GPT-4o's 128K context window, a 6-month conversation history with a user who chats daily would contain ~1.8M tokens — 14× over the limit, and at $5/1M input tokens, that's $9 per query just for context.
The naive solution — "just use a bigger context window" — fails in two ways. First, cost scales linearly with context length: a 128K-token context costs $0.64 in GPT-4o input tokens per query. Second, and less obvious, performance degrades in long contexts even when the information fits.
The lost-in-the-middle problem (Liu et al., 2023) quantified this: LLMs show measurably lower accuracy when relevant information is placed in the middle of a long context vs. at the beginning or end. For 20+ retrieved documents in the context, models like Llama 2 had dramatically worse performance on information located at positions 5–15 vs. positions 1–2 or 19–20. This isn't a bug — it's a fundamental property of how attention mechanisms score tokens, and it means that a longer context is not a substitute for good memory architecture.
What production agents actually need: a memory system that stores information permanently across sessions, retrieves only what's relevant to the current query, and costs proportionally to what's retrieved — not to the total history.
The Cognitive Framework Interviewers Expect
The authoritative taxonomy comes from the Generative Agents paper (Park et al., 2023, Stanford). They describe three memory types: in-context (what the agent knows right now), external (vector DB / episodic log), and procedural (baked into model weights). A fourth type, working memory compression (rolling summarization), is the most practical production technique and the one most candidates miss. Anchoring your answer to this framework signals depth.
Memory Types Comparison
| Memory Type | Capacity | Latency | Update Cost | Persistence | Retrieval Method | Best For |
|---|---|---|---|---|---|---|
| In-Context (short-term) | ~4K–128K tokens | 0ms (already loaded) | Free | Session only | Linear scan (implicit) | Recent turns, immediate context |
| Episodic (event log) | Millions of events | ~50–100ms | Low (append-only) | Permanent | Recency + semantic scoring | Time-stamped user interactions |
| Semantic (vector DB) | Billions of chunks | ~50–150ms | Low per chunk | Permanent | ANN similarity search | Domain knowledge, past conversations |
| Procedural (fine-tuning) | Unlimited (in weights) | 0ms (baked in) | Very high (~$1K–100K) | Permanent | No retrieval needed | Task-specific skills, output formats |
In-Context Memory: Simple but Fragile
The simplest memory strategy is to prepend the full conversation history to every prompt. This works well for short sessions — up to ~4K–8K tokens, roughly 20–40 exchanges — and has zero retrieval latency because the context is already loaded.
Where it breaks: Beyond 8K tokens, the lost-in-the-middle problem becomes dominant. The earliest turns in the conversation (often the most important for understanding user preferences and constraints) get buried in the middle of a growing context and receive lower attention weights than the most recent turns. The LLM appears to "forget" what you told it at the start of a long conversation, even though it's technically in the context.
Sliding window mitigation: Keep only the last N turns in the context (N=5–10 is typical). Simple, predictable, but loses early context entirely. Fine for stateless tasks; catastrophic for long-running projects where early decisions constrain later ones.
Rolling summarization: Summarize older turns into a compressed "conversation summary" and prepend the summary instead of the raw turns. This preserves the information while reducing token count by 5–10×. The LLM-generated summary is lossy (nuance and exact wording are lost), but preserves the semantic content of key decisions, preferences, and constraints. This is the most practical technique for most production systems.
Critical nuance most candidates miss: When implementing rolling summarization, keep the last 3 raw turns unsummarized in addition to the summary. The LLM needs verbatim recent context for coherent follow-up; the summary provides background. Summary-only contexts cause awkward tonal discontinuities.
Semantic Memory: Vector DB for Long-Term Knowledge
External semantic memory stores past conversations, documents, and learned facts as vector embeddings in a database (Pinecone, Qdrant, Chroma). At query time, the current user message is embedded and the top-k most semantically relevant memories are retrieved and injected into the prompt.
Mem0 (mem0.ai) is the production library for this pattern. It wraps memory storage and retrieval with LLM-powered extraction: when a user says "I'm vegetarian", Mem0 extracts {preference: "vegetarian"} and stores it as a structured memory. On future queries, it retrieves relevant preferences and injects them as a "memory block" in the system prompt. The LLM then personalizes responses without the user repeating their preferences.
Architecture for scale: Each user has a separate vector namespace in the DB (Pinecone supports multi-tenancy at the namespace level). Memory entries are stored as embedding + metadata (user_id, timestamp, category, raw_text). Retrieval is top-5 by cosine similarity to the current query, filtered by user_id.
The embedding staleness problem: User preferences change. If a user tells the agent they're vegetarian in January but switches to keto in March, both memories will be retrieved and conflict. Mitigation: memory entries have a recency score as a secondary ranking signal (more recent = higher weight). On write, check for conflicting memories and mark older ones as superseded. Mem0 does this with an LLM-powered conflict resolver.
When semantic memory is overkill: Sessions shorter than 10 turns, stateless APIs (each request is independent with no returning users), or when the agent task genuinely has no persistent context. Don't add a vector DB to your architecture if users never expect personalization or session continuity.
Hierarchical Memory System Architecture
Episodic Memory: Time-Stamped Event Logs
Episodic memory, modeled after the Generative Agents paper (Park et al., 2023), stores a time-stamped log of events — things the agent observed, actions it took, and outcomes it received. Unlike semantic memory (which stores facts), episodic memory stores experiences.
The Generative Agents implementation: Each simulated agent maintained a memory stream — a list of observations with timestamps and an "importance" score (how significant is this event?). Retrieval ranked memories by a weighted combination of: recency (exponential decay from current time), relevance (cosine similarity to current context), and importance (importance score assigned at storage time).
retrieval_score = α·recency + β·relevance + γ·importance
This formula is the key insight: pure semantic retrieval (only relevance) misses important recent events. Pure recency (only recent turns) misses highly relevant but older memories. The weighted combination approximates how humans actually recall things — recent and relevant memories surface strongest.
Production use case: A customer service agent stores each support ticket interaction as an episodic memory. When a user calls back, the agent retrieves the 3 most relevant past interactions by the formula above — not just the most recent. This surfaces a 6-month-old incident that matches the current issue even if more recent interactions are about different topics.
Procedural Memory: Skills Baked into Model Weights
Procedural memory is knowledge that's been internalized into the model's weights through fine-tuning — "how to do things" rather than "what things are." An agent fine-tuned on thousands of code reviews learns to apply specific review criteria automatically without needing retrieval. A customer service agent fine-tuned on your company's policies responds correctly without needing those policies in context.
When procedural memory is the right choice:
- The behavior is stable and changes infrequently (update cost: $1K–100K+ per fine-tune)
- The behavior is required on every query (eliminating retrieval for this info saves latency)
- The task requires precise output formatting or consistent persona that's hard to specify via prompts
- Privacy constraints prevent storing certain information in a retrievable database
When fine-tuning is wrong: Don't proceduralize fast-changing knowledge (product prices, current events, organizational charts). The training data becomes stale within weeks, and the fine-tuned model's responses will confidently cite outdated information. This is one of the most common mistakes in production agent deployment — fine-tuning company FAQs that change monthly and then not having a clear update cadence.
Practical rule: Use procedural memory for skills and style, use semantic memory for facts. Fine-tune for "always respond in JSON with this schema" and "code reviewer should flag these specific anti-patterns." Use RAG for "the current return policy is 30 days."
The Lost-in-the-Middle Problem: Memory Layout Matters
The lost-in-the-middle effect (Liu et al., 2023, Stanford) is one of the most practically important findings in LLM context research and one that most candidates have never heard of. In a controlled experiment, models were given 20 documents and asked to find the one relevant document. Accuracy was ~80% when the relevant document was at position 1 or 20 (beginning/end). It dropped to ~50% when the document was at position 10 (middle).
Mechanism: Transformer attention is biased toward the beginning of the context (primacy effect, from positional embeddings) and the end of the context (recency effect, from the autoregressive generation direction). Tokens in the middle receive lower aggregate attention scores.
Practical implications for memory layout:
- Put the system prompt (persona, instructions) at the very beginning — it will receive the highest attention
- Put the most recent user turn and retrieved memories at the end — recency effect benefits the most critical context
- Put older conversation summary in the middle — it's less critical than instructions or current context
- Never put critical one-time constraints in the middle of a long context — the model will ignore them
The interview insight: When asked to design a memory system, mention that context layout is as important as what you store. This is the non-obvious detail that separates candidates who've built real systems from those who've only read tutorials.
Production Memory Manager with Semantic Retrieval
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
import json
# In production: replace stubs with real vector DB client (Qdrant, Pinecone)
# and real embedding model (text-embedding-3-small, BGE-M3)
@dataclass
class Memory:
content: str
user_id: str
timestamp: datetime
importance: float # 0.0 – 1.0
category: str # "preference", "fact", "event", "instruction"
embedding: Optional[list[float]] = None
@dataclass
class MemoryManager:
user_id: str
max_context_turns: int = 5 # Raw turns kept in-context
max_retrieved_memories: int = 5 # Semantic memories injected per query
rolling_summary: str = "" # Compressed older turns
recent_turns: list[dict] = field(default_factory=list)
def add_turn(self, role: str, content: str) -> None:
"""Add a conversation turn; compress if buffer exceeds limit."""
self.recent_turns.append({"role": role, "content": content})
if len(self.recent_turns) > self.max_context_turns * 2:
self._compress_older_turns()
def _compress_older_turns(self) -> None:
"""Summarize oldest half of turns into rolling_summary."""
turns_to_compress = self.recent_turns[:self.max_context_turns]
self.recent_turns = self.recent_turns[self.max_context_turns:]
# Production: call LLM to generate summary
new_summary = f"[Summary of {len(turns_to_compress)} turns]"
if self.rolling_summary:
# Merge with existing summary
self.rolling_summary = f"{self.rolling_summary}\n{new_summary}"
else:
self.rolling_summary = new_summary
def store_memory(self, content: str, importance: float = 0.5,
category: str = "fact") -> None:
"""
Store a new long-term memory.
Production: embed content, upsert to vector DB with user_id filter.
"""
memory = Memory(
content=content,
user_id=self.user_id,
timestamp=datetime.now(),
importance=importance,
category=category,
)
# vector_db.upsert(memory) # Replace with real DB call
print(f"Stored memory: {content[:50]}... (importance={importance})")
def retrieve_memories(self, query: str, top_k: int = 5) -> list[Memory]:
"""
Retrieve relevant memories using recency + relevance + importance scoring.
Production: embed query → ANN search in vector DB filtered by user_id.
"""
# Stub: production uses cosine similarity to query embedding
# retrieval_score = 0.3*recency + 0.5*relevance + 0.2*importance
return [] # Replace with real vector DB query
def build_context(self, current_query: str) -> list[dict]:
"""
Assemble the final context in optimal layout order:
1. System prompt (beginning → high attention)
2. User profile from retrieved memories
3. Rolling summary (middle — less critical)
4. Recent raw turns (end → recency effect)
"""
memories = self.retrieve_memories(current_query, top_k=self.max_retrieved_memories)
memory_block = "\n".join(f"- {m.content}" for m in memories)
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant with memory of past interactions.\n\n"
f"What you know about this user:\n{memory_block or 'No memories yet.'}"
),
}
]
# Add rolling summary as assistant context if exists
if self.rolling_summary:
messages.append({
"role": "assistant",
"content": f"[Earlier conversation summary: {self.rolling_summary}]",
})
# Add recent raw turns (verbatim, at the end for recency bias)
messages.extend(self.recent_turns)
messages.append({"role": "user", "content": current_query})
return messages
The Hierarchical Memory Answer That Lands Well
Design memory in layers, ordered by retrieval cost and scope: (1) In-context buffer: last 3–5 raw turns, 0ms retrieval, session-scoped. (2) Rolling summary: compressed older turns, 0ms, session-scoped. (3) Semantic memories: top-5 retrieved by cosine similarity + recency, ~80ms, user-scoped permanent storage. (4) User profile: structured JSON of extracted facts, prepended to system prompt. Most production systems need only layers 1–3; layer 4 (fine-tuning for procedural skills) is warranted only when a behavior is required on every query and the fine-tuning cost is justified.
The Memory Staleness Trap
The most subtle production failure in semantic memory systems: outdated memories that are semantically similar to the current query will be retrieved with high confidence scores, contradicting current user state. 'I prefer Python' stored 2 years ago conflicts with 'I've switched to TypeScript' stored last month. Mitigation requires: (1) timestamp-based recency decay in retrieval scoring, (2) LLM-powered conflict detection when writing new memories (check if incoming memory contradicts an existing one), and (3) explicit memory supersession — mark old memories as inactive rather than deleting them so you preserve the history. Most open-source memory libraries skip conflict detection entirely.
Write Gate — What Earns Persistence
Durability
Will this fact still be true next session? Persist only memories with stable shelf life. Reject ephemeral observations (e.g., 'user is currently on the checkout page').
Confidence
Is the source verified? Distinguish tool-output (high confidence), user-confirmed (high), and inferred-from-conversation (low). Tag the confidence at write time so retrieval can downweight uncertain memories.
Reuse value
Will storing this reduce future cost or improve accuracy? If a fact is unique to this session and unlikely to recur, keep it transient.
Required metadata
Every entry: source, confidence, category, created_at, expires_at. Without metadata, stale or wrong memory becomes hard to detect and silently degrades quality.
Audit cadence
Run periodic sweeps to detect contradiction clusters, low-hit entries (retrieved but never used in final response), and stale high-confidence facts. Archive or delete noisy items. This converts memory from append-only log into a maintained subsystem — a staff-level design point.
Memory Hygiene Beats Memory Volume
Writing everything to long-term memory creates contamination and retrieval noise — the model retrieves five conflicting memories about the same topic and produces incoherent responses. Strict write policies (the gate above) plus periodic audits matter more than raw memory capacity. Interviewers reward candidates who mention memory hygiene, not just memory storage.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →