Model Quantization — AI Agent Glossary

Definition

Model quantization converts the floating-point weights of a trained neural network from high-precision formats (FP32 or FP16) to lower-precision integer representations (INT8 or INT4). Since model weights represent the majority of memory consumed during inference, reducing their precision dramatically shrinks VRAM requirements and accelerates memory-bandwidth-bound inference operations. Quantization makes larger, more capable models deployable on hardware that would otherwise be insufficient.

Engineering Context

Quantization is the primary technique for deploying large models on constrained GPU budgets. A 70B parameter model at FP16 requires 140GB VRAM; at INT4 (4-bit quantization with GPTQ or AWQ), it fits in ~40GB. Quality trade-off: INT8 is nearly lossless; INT4 has measurable quality degradation on complex reasoning tasks. QLoRA enables fine-tuning on quantized models. In practice, use AWQ or GPTQ for static quantization, or GGUF format with llama.cpp for CPU-offload scenarios. Always benchmark quantized model quality against your specific task distribution before production deployment—degradation varies significantly by task type.

Related Terms

GPU Compute On-Premise LLM Model Serving Inference Endpoint Large Language Model

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment