Agent Memory Patterns: Short-Term vs Long-Term Storage for AI Systems

Memory is one of the hardest problems in production AI agent design. Most tutorials show you how to pass chat history into a context window—and stop there. But real enterprise agents need multi-tier memory architectures that balance speed, cost, and recall accuracy. Here's what we've learned building memory systems at scale.

The Four Types of Agent Memory

Cognitive science distinguishes four memory types, and agent architecture mirrors this taxonomy:

Working memory (in-context) — The current conversation and task state, limited by context window size
Episodic memory (recent history) — A log of past interactions retrievable by recency or relevance
Semantic memory (knowledge base) — Factual knowledge stored as embeddings in a vector database
Procedural memory (skills) — Codified workflows and tool-use patterns, typically hardcoded in the agent graph

Most agent frameworks only address working memory. To build agents that learn from past interactions and scale beyond a single conversation, you need all four.

Working Memory: The Context Window

Working memory is whatever fits in the LLM's context window at inference time. For most production agents, this includes:

# Working memory structure

System prompt ~500-2000 tokens

Retrieved context ~1000-4000 tokens

Conversation history ~500-2000 tokens

Current task state ~200-500 tokens

─────────────────────────────────────

Total budget ~8000 tokens

The critical insight: working memory is not free. Every token in context costs money and adds latency. You need a compression strategy to keep working memory efficient.

Episodic Memory: Retrievable History

Episodic memory stores past agent interactions in a retrievable format. We use a combination of structured storage and vector embeddings:

Structured store — PostgreSQL with JSONB for full interaction logs with metadata (timestamp, task type, outcome)
Vector index — Embeddings of interaction summaries for semantic retrieval
Summary layer — LLM-generated summaries of past sessions, compressed to 200-300 tokens each

When a new task arrives, we retrieve the top-3 most semantically similar past interactions and inject their summaries into working memory. This gives the agent context from past experience without blowing the context budget.

Semantic Memory: The Knowledge Base

Semantic memory is your RAG layer—the indexed knowledge the agent draws on to answer questions. Design considerations:

Decision	Options	Our Recommendation
Chunk size	256–2048 tokens	512 tokens with 50-token overlap
Embedding model	Ada-002, BGE, E5	text-embedding-3-large
Retrieval strategy	Dense, Sparse, Hybrid	Hybrid (BM25 + dense)
Reranking	None, Cross-encoder	Cross-encoder for top-20 candidates

Memory Tier Architecture

In production, we use a three-tier architecture that balances speed, cost, and recall:

Hot tier (in-context) — Current task state, last 3 turns, pre-retrieved context chunks
Warm tier (Redis cache) — Recent sessions cached by user/project, 24-hour TTL
Cold tier (vector + SQL) — Full history, semantic search on demand

The retrieval decision tree runs before every agent invocation: check warm cache first, fall back to vector search only when needed. This reduces latency by ~40% and cuts embedding costs significantly.

Common Memory Anti-Patterns

Naive history injection — Appending the full conversation log without compression. Causes context bloat and increased costs.
No memory namespacing — Sharing a vector store across users/tenants without isolation. Causes data leakage.
Stale embeddings — Not re-indexing when source documents change. Causes retrieval of outdated information.
Memory without TTL — Storing every interaction forever without pruning. Causes recall degradation over time as irrelevant memories dominate.

Conclusion

Effective agent memory requires treating each tier differently: compress working memory aggressively, cache recent episodes for fast retrieval, and maintain a high-quality semantic index for deep knowledge access. The agents that perform best in production are those with explicit memory management—not those that simply grow the context window.

Need a production-grade agent memory architecture?

We design and implement multi-tier memory systems for enterprise AI agents. Book a technical discovery call.

Start Assessment