LLM Technology

Context Window

The maximum number of tokens an LLM can process in a single inference call, defining the total capacity for system prompt, retrieved context, conversation history, and output.

Definition

The Context Window is the maximum number of tokens an LLM can process in a single inference call. It defines the total available capacity shared among all inputs: the system prompt, retrieved context documents, conversation history, tool definitions, and the model's output. Everything the model can "see" and "remember" during a single inference call must fit within this window. Modern frontier models offer context windows of 128k to 1M tokens, but larger windows come with higher latency and cost.

Engineering Context

Context window management is a core concern in agent design. With 128k-token windows, the bottleneck is often cost and latency rather than capacity. Best practice: allocate context budget explicitly by component—system prompt (approximately 1k tokens), retrieved context (approximately 4k), conversation history (approximately 2k), tool definitions (approximately 1k), and reserved output space (approximately 2k). Use summarization to compress conversation history when it grows long. RAG retrieval should retrieve only the most relevant chunks, not all available content. "Lost in the middle" is a real phenomenon: models attend better to content at the beginning and end of the context window than to content in the middle.

Related Terms

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment