Enterprise GPU Memory Hierarchy: HBM3, GDDR7, and LPDDR5X…
Quick Summary
- HBM3: 3.35-4.8 TB/s, stack architecture, used in H100/H200/B200
- GDDR7: 32-48 GB/s per pin, next-gen graphics memory, 2025+
- LPDDR5X: 8.5 GB/s, low power, used in Grace Hopper unified memory
- Selection: HBM for AI training, GDDR for graphics, LPDDR for edge
- Bandwidth Gap: HBM3 offers 100x more bandwidth than DDR5
Understanding GPU Memory Hierarchy
GPU memory architecture determines how quickly data can move between storage and compute units, directly impacting AI training throughput and inference latency. Modern enterprise GPUs employ a multi-level memory hierarchy that balances capacity, bandwidth, and latency trade-offs. Understanding this hierarchy is essential for selecting the right GPU configuration for specific AI workloads.
| Memory Type | Bandwidth | Capacity per GPU | Latency | Used In |
|---|---|---|---|---|
| HBM3 | 3.35-4.8 TB/s | 80-144 GB | ~100ns | H100, H200, MI300X |
| HBM3e | 4.8-8 TB/s | 141-192 GB | ~100ns | H200 NVL, B200 |
| GDDR6 | 300-864 GB/s | 24-48 GB | ~200ns | L4, L40S, RTX 6000 |
| GDDR7 | ~1.5 TB/s | 24-48 GB | ~180ns | RTX 5090, future data center |
| LPDDR5X | 512 GB/s | 480 GB (system) | ~250ns | Grace Hopper unified memory |
| SRAM (on-chip) | 20-50 TB/s | 20-50 MB | ~5ns | All GPUs (L1/L2 cache) |
HBM3: The AI Training Standard
High Bandwidth Memory (HBM) uses a 3D stack architecture with through-silicon vias (TSVs) connecting multiple DRAM dies vertically. This provides an extremely wide memory bus (1024 bits per stack) that delivers bandwidth far exceeding traditional GDDR memory. HBM3, used in H100, achieves NVIDIA H200 with HBM3e 3.35 TB/s bandwidth through six 16GB stacks. The disadvantage of HBM is cost and complexity—HBM requires an interposer substrate that adds manufacturing expense and limits maximum capacity per stack.
GDDR Memory for Inference
GDDR (Graphics Double Data Rate) memory uses a traditional PCB-based architecture with individual memory chips surrounding the GPU die. GDDR6, used in L40S (864 GB/s) and L4 (300 GB/s), offers lower bandwidth than HBM but provides higher capacity at lower cost. For AI inference workloads, where model weights must fit in GPU memory but bandwidth requirements are lower than training, GDDR offers an optimal cost-performance balance.
Selecting the Right Memory Technology
The choice between HBM and GDDR GPUs depends on workload characteristics. Training workloads benefit from HBM's extreme bandwidth for gradient accumulation and weight updates. Inference workloads can effectively use GDDR memory, particularly when combined with model quantization to reduce memory requirements. For government deployments where data security is paramount, hardware-encrypted memory paths in HBM-based GPUs provide enhanced protection for classified AI workloads.
Related Content
Explore more about this topic:
- GPU Memory Bandwidth: Complete Guide
- FP8 vs FP16 vs BF16 vs FP32: Precision Formats
- What is Model Quantization?
Why don't inference GPUs use HBM memory?
Cost is the primary factor. HBM memory costs 3-5x more per GB than GDDR. For inference serving where GDDR bandwidth is sufficient, the cost premium of HBM cannot be justified. However, as inference model sizes grow, HBM's benefits are becoming relevant for large-scale serving.
Will GDDR7 replace HBM for AI?
No, GDDR7 targets a different performance point. HBM will remain the standard for AI training due to its 3-5x higher bandwidth and lower power consumption per bit. GDDR7 improves inference GPU capabilities but does not approach HBM3 bandwidth levels.