Deployment & Infrastructure

Throughput

The number of tokens or requests a model serving system processes per unit of time, measured in tokens/second or requests/second, determining batch processing capacity.

Definition

Throughput measures the productive output rate of an LLM serving system: how many tokens it generates or how many requests it completes per second across all concurrent users. It is the key metric for batch processing workloads where total processing capacity matters more than individual request latency. High throughput means more work done per dollar of GPU infrastructure, making it the primary cost efficiency metric for AI agent pipelines processing large volumes of documents or requests.

Engineering Context

Throughput and latency are in tension: maximizing throughput (via larger batches) increases per-request latency. For batch processing workloads (overnight document analysis, bulk extraction), optimize for throughput. For interactive agents, optimize for latency. Continuous batching in vLLM significantly improves throughput by dynamically grouping requests without fixed batch size limits. GPU utilization is the primary throughput lever: idle GPU cycles are wasted capacity. Speculative decoding (using a small draft model to propose tokens that the main model verifies in parallel) can increase throughput by 2-3x for certain workloads without quality degradation.

Related Terms

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment