Memory is one of the hardest problems in production AI agent design. Most tutorials show you how to pass chat history into a context window—and stop there. But real enterprise agents need multi-tier memory architectures that balance speed, cost, and recall accuracy. Here's what we've learned building memory systems at scale.
The Four Types of Agent Memory
Cognitive science distinguishes four memory types, and agent architecture mirrors this taxonomy:
- Working memory (in-context) — The current conversation and task state, limited by context window size
- Episodic memory (recent history) — A log of past interactions retrievable by recency or relevance
- Semantic memory (knowledge base) — Factual knowledge stored as embeddings in a vector database
- Procedural memory (skills) — Codified workflows and tool-use patterns, typically hardcoded in the agent graph
Most agent frameworks only address working memory. To build agents that learn from past interactions and scale beyond a single conversation, you need all four.
Working Memory: The Context Window
Working memory is whatever fits in the LLM's context window at inference time. For most production agents, this includes:
The critical insight: working memory is not free. Every token in context costs money and adds latency. You need a compression strategy to keep working memory efficient.
Episodic Memory: Retrievable History
Episodic memory stores past agent interactions in a retrievable format. We use a combination of structured storage and vector embeddings:
- Structured store — PostgreSQL with JSONB for full interaction logs with metadata (timestamp, task type, outcome)
- Vector index — Embeddings of interaction summaries for semantic retrieval
- Summary layer — LLM-generated summaries of past sessions, compressed to 200-300 tokens each
When a new task arrives, we retrieve the top-3 most semantically similar past interactions and inject their summaries into working memory. This gives the agent context from past experience without blowing the context budget.
Semantic Memory: The Knowledge Base
Semantic memory is your RAG layer—the indexed knowledge the agent draws on to answer questions. Design considerations:
| Decision | Options | Our Recommendation |
|---|---|---|
| Chunk size | 256–2048 tokens | 512 tokens with 50-token overlap |
| Embedding model | Ada-002, BGE, E5 | text-embedding-3-large |
| Retrieval strategy | Dense, Sparse, Hybrid | Hybrid (BM25 + dense) |
| Reranking | None, Cross-encoder | Cross-encoder for top-20 candidates |
Memory Tier Architecture
In production, we use a three-tier architecture that balances speed, cost, and recall:
- Hot tier (in-context) — Current task state, last 3 turns, pre-retrieved context chunks
- Warm tier (Redis cache) — Recent sessions cached by user/project, 24-hour TTL
- Cold tier (vector + SQL) — Full history, semantic search on demand
The retrieval decision tree runs before every agent invocation: check warm cache first, fall back to vector search only when needed. This reduces latency by ~40% and cuts embedding costs significantly.
Common Memory Anti-Patterns
- Naive history injection — Appending the full conversation log without compression. Causes context bloat and increased costs.
- No memory namespacing — Sharing a vector store across users/tenants without isolation. Causes data leakage.
- Stale embeddings — Not re-indexing when source documents change. Causes retrieval of outdated information.
- Memory without TTL — Storing every interaction forever without pruning. Causes recall degradation over time as irrelevant memories dominate.
Conclusion
Effective agent memory requires treating each tier differently: compress working memory aggressively, cache recent episodes for fast retrieval, and maintain a high-quality semantic index for deep knowledge access. The agents that perform best in production are those with explicit memory management—not those that simply grow the context window.
Need a production-grade agent memory architecture?
We design and implement multi-tier memory systems for enterprise AI agents. Book a technical discovery call.
Start Assessment