Model Serving — AI Agent Glossary

Definition

Model serving is the operational infrastructure that makes a trained LLM accessible as a production service. It encompasses the software layer between raw model weights and application code: loading model weights into GPU memory, managing the inference runtime, handling concurrent requests, streaming outputs to clients, and exposing a standardized API. Model serving transforms a research artifact (model weights) into a reliable, scalable production service.

Engineering Context

Model serving is the operational layer between the model weights and application code. Popular serving frameworks: vLLM (high throughput), TGI (Text Generation Inference), TensorRT-LLM (NVIDIA-optimized), and Ollama (local development). Key optimizations: continuous batching (processing multiple requests in parallel), KV-cache reuse for shared prefixes, and quantized weights for memory efficiency. In production, model servers expose OpenAI-compatible APIs enabling drop-in model swapping. Horizontal scaling requires attention to GPU memory: each replica needs full model weights loaded, so scaling out means proportionally more GPU hardware. Use autoscaling policies tied to queue depth rather than CPU/memory metrics.

Related Terms

Inference Endpoint On-Premise LLM GPU Compute Model Quantization Latency

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment