GPU Memory Bandwidth: Complete Guide to HBM, GDDR, and LPDDR
Quick Summary
- HBM3: 3.35 TB/s (H100), 8 TB/s (B200), stack-based, high bandwidth
- GDDR6: 768 GB/s (L40S), high capacity, cost-effective
- GDDR7: 1.5 TB/s expected, next-gen graphics memory
- Bandwidth Impact: Directly determines LLM training tokens/second
- Guideline: 1 GB/s bandwidth per 1B model parameters for efficient training
GPU Memory Bandwidth NVIDIA H200 NVL: Complete Technical Guide
GPU memory bandwidth—the rate at which data can be read from or written to GPU memory—is the single most important performance determinant for AI training workloads. Unlike CPU workloads that benefit from complex caching hierarchies, AI training is dominated by streaming memory access patterns that directly depend on raw memory bandwidth. Understanding memory bandwidth characteristics is essential for GPU selection, workload optimization, and infrastructure planning.
Memory Bandwidth Comparison by GPU
| GPU | Memory Type | Bandwidth | Capacity | Bandwidth per Watt |
|---|---|---|---|---|
| NVIDIA A100 | HBM2e | 2.0 TB/s | 40/80 GB | 2.5 GB/s/W |
| NVIDIA H100 SXM | HBM3 | 3.35 TB/s | 80 GB | 4.8 GB/s/W |
| NVIDIA H200 NVL | HBM3e | 4.8 TB/s | 141 GB | 6.9 GB/s/W |
| NVIDIA B200 | HBM3e | 8.0 TB/s | 192 GB | 8.0 GB/s/W |
| AMD MI300X | HBM3 | 5.3 TB/s | 192 GB | 7.1 GB/s/W |
| NVIDIA L40S | GDDR6 | 864 GB/s | 48 GB | 2.5 GB/s/W |
| NVIDIA L4 | GDDR6 | 300 GB/s | 24 GB | 4.2 GB/s/W |
Bandwidth Impact on Training Performance
LLM training throughput scales nearly linearly with memory bandwidth up to the point where compute capacity becomes the bottleneck. For transformer models with moderate batch sizes, each 1% increase in memory bandwidth translates to approximately 0.8-1% increase in training throughput. The relationship is strongest for memory-bound workloads like long-sequence transformers and attention-heavy architectures.
Practical Guidelines
For AI infrastructure planning, a practical rule of thumb is that each 1 billion model parameters requires approximately 1 GB/s of memory bandwidth for efficient training. A 70B parameter model benefits from 70 GB/s aggregate bandwidth across the GPU configuration. Eight H100 GPUs providing 26.8 TB/s aggregate bandwidth (3.35 TB/s × 8) provides substantial headroom for 70B model training.
Related Content
Explore more about this topic:
Frequently Asked QuestionsDoes memory bandwidth affect inference as much as training?
Inference is more memory bandwidth-bound than training because batch sizes are smaller (often batch size 1 for interactive applications) and compute utilization is lower. For inference, memory bandwidth is often the primary bottleneck, making high-bandwidth GPUs particularly valuable for serving applications.
How does memory bandwidth scale across multiple GPUs?
Effective memory bandwidth scales linearly with GPU count for independent workloads. For distributed training with model parallelism, effective bandwidth is limited by the slowest interconnect path—NVLink within a node, InfiniBand between nodes. The overall system bandwidth is determined by the min(compute × memory_bw_per_gpu, interconnect_bandwidth).