Definition
GPUs are specialized processors containing thousands of smaller compute cores optimized for parallel floating-point arithmetic. Unlike CPUs that excel at sequential logic with few powerful cores, GPUs excel at simultaneously executing the same operation across massive arrays of data. This architecture maps perfectly to transformer attention and feed-forward computations, which are fundamentally large matrix multiplications that can be parallelized across thousands of GPU cores simultaneously.
Engineering Context
GPUs are the dominant compute substrate for LLMs because matrix multiplication (the core transformer operation) maps naturally to GPU parallelism. NVIDIA A100 (40GB/80GB) and H100 (80GB) are the standard enterprise inference GPUs. Rule of thumb: 1GB VRAM per 1B parameters at FP16 precision. For on-premise deployment, VRAM is the primary constraint determining which model sizes are feasible. The H100 offers ~3x inference throughput over the A100 due to improved memory bandwidth and Transformer Engine support. AMD MI300X and Intel Gaudi are emerging alternatives but require software ecosystem investment. Monitor GPU utilization and memory bandwidth utilization—not just VRAM usage—for performance optimization.
Related Terms
Building production AI agents?
We design and implement deterministic AI agent systems for enterprise teams.
Start Assessment