Definition
Model serving is the operational infrastructure that makes a trained LLM accessible as a production service. It encompasses the software layer between raw model weights and application code: loading model weights into GPU memory, managing the inference runtime, handling concurrent requests, streaming outputs to clients, and exposing a standardized API. Model serving transforms a research artifact (model weights) into a reliable, scalable production service.
Engineering Context
Model serving is the operational layer between the model weights and application code. Popular serving frameworks: vLLM (high throughput), TGI (Text Generation Inference), TensorRT-LLM (NVIDIA-optimized), and Ollama (local development). Key optimizations: continuous batching (processing multiple requests in parallel), KV-cache reuse for shared prefixes, and quantized weights for memory efficiency. In production, model servers expose OpenAI-compatible APIs enabling drop-in model swapping. Horizontal scaling requires attention to GPU memory: each replica needs full model weights loaded, so scaling out means proportionally more GPU hardware. Use autoscaling policies tied to queue depth rather than CPU/memory metrics.
Related Terms
Building production AI agents?
We design and implement deterministic AI agent systems for enterprise teams.
Start Assessment