LLM Technology

Tokenization

The process of converting raw text into numerical tokens for LLM processing, where token count determines API cost and whether input fits within the context window.

Definition

Tokenization is the process of converting raw text into a sequence of numerical tokens that an LLM can process. A tokenizer (such as OpenAI's tiktoken) splits text into sub-word units—not words, but frequent sequences of characters—and maps each to a numerical ID. The model operates on this token sequence, and all LLM pricing, context window limits, and rate limits are denominated in tokens rather than characters or words. Understanding tokenization is essential for budgeting, cost estimation, and context management in production AI systems.

Engineering Context

Engineering decisions directly affected by tokenization: (1) Cost estimation: 1 token approximates 0.75 words in English, so a 1,000-word document is roughly 1,300 tokens. (2) Context budgeting: 128k token context windows fill up fast with system prompts, retrieved chunks, and conversation history—budget each allocation explicitly. (3) Prompt compression: prompts written in token-efficient language (concise, avoiding filler words) cost less and fit more content. Non-English text typically tokenizes less efficiently than English text, requiring more tokens per semantic unit. Always count tokens programmatically before sending to the API to avoid unexpected context window overflows.

Related Terms

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment