Architecture January 20, 2026 11 min read

Agent Memory Patterns: Short-Term vs Long-Term Storage

How to design robust memory architectures for production AI agents—from in-context working memory to persistent episodic stores.

Memory is one of the hardest problems in production AI agent design. Most tutorials show you how to pass chat history into a context window—and stop there. But real enterprise agents need multi-tier memory architectures that balance speed, cost, and recall accuracy. Here's what we've learned building memory systems at scale.

The Four Types of Agent Memory

Cognitive science distinguishes four memory types, and agent architecture mirrors this taxonomy:

Most agent frameworks only address working memory. To build agents that learn from past interactions and scale beyond a single conversation, you need all four.

Working Memory: The Context Window

Working memory is whatever fits in the LLM's context window at inference time. For most production agents, this includes:

# Working memory structure
System prompt ~500-2000 tokens
Retrieved context ~1000-4000 tokens
Conversation history ~500-2000 tokens
Current task state ~200-500 tokens
─────────────────────────────────────
Total budget ~8000 tokens

The critical insight: working memory is not free. Every token in context costs money and adds latency. You need a compression strategy to keep working memory efficient.

Episodic Memory: Retrievable History

Episodic memory stores past agent interactions in a retrievable format. We use a combination of structured storage and vector embeddings:

When a new task arrives, we retrieve the top-3 most semantically similar past interactions and inject their summaries into working memory. This gives the agent context from past experience without blowing the context budget.

Semantic Memory: The Knowledge Base

Semantic memory is your RAG layer—the indexed knowledge the agent draws on to answer questions. Design considerations:

Decision Options Our Recommendation
Chunk size 256–2048 tokens 512 tokens with 50-token overlap
Embedding model Ada-002, BGE, E5 text-embedding-3-large
Retrieval strategy Dense, Sparse, Hybrid Hybrid (BM25 + dense)
Reranking None, Cross-encoder Cross-encoder for top-20 candidates

Memory Tier Architecture

In production, we use a three-tier architecture that balances speed, cost, and recall:

  1. Hot tier (in-context) — Current task state, last 3 turns, pre-retrieved context chunks
  2. Warm tier (Redis cache) — Recent sessions cached by user/project, 24-hour TTL
  3. Cold tier (vector + SQL) — Full history, semantic search on demand

The retrieval decision tree runs before every agent invocation: check warm cache first, fall back to vector search only when needed. This reduces latency by ~40% and cuts embedding costs significantly.

Common Memory Anti-Patterns

Conclusion

Effective agent memory requires treating each tier differently: compress working memory aggressively, cache recent episodes for fast retrieval, and maintain a high-quality semantic index for deep knowledge access. The agents that perform best in production are those with explicit memory management—not those that simply grow the context window.

Need a production-grade agent memory architecture?

We design and implement multi-tier memory systems for enterprise AI agents. Book a technical discovery call.

Start Assessment