GPU Sizing Guide for LLM Workloads
Quick Summary
- 7B Model: ~14GB VRAM minimum; single GPU feasible with quantization
- 70B Model: ~140GB VRAM; requires 4-8 GPUs with model parallelism
- 405B Model: ~800GB VRAM; requires multi-node cluster
- Memory Bandwidth: Scales nearly linearly with training throughput
- Fine-tuning: Requires 2-4x less memory than full training
Accurately sizing GPU infrastructure for large language model B200 GPU server (LLM) workloads is one of the most critical decisions facing AI teams. Misestimating GPU requirements leads to either wasted capital on over-provisioned hardware or crippling performance bottlenecks from under-provisioned clusters. This guide provides a rigorous, step-by-step methodology for calculating GPU requirements across model sizes from 7B to 405B+ parameters, with specific guidance for enterprise and government deployments.
GPU Memory Requirements by Model Size
The primary constraint for LLM training is GPU memory (VRAM). Model weights, optimizer states, gradients, and activations each consume significant memory. The total memory requirement follows this formula:
Total Memory = Model Weights (FP16) + Optimizer States (2x weights) + Gradients (1x weights) + Activations (variable)
For a 70B parameter model at FP16 (2 bytes per parameter): weights = 140GB, optimizer states = 280GB (using Adam), gradients = 140GB, activations = 20-60GB depending on sequence length and batch size. Total: 580-620GB minimum, requiring 8x H100 80GB GPUs or 4x MI300X 192GB GPUs.
| Model Size | FP16 Weights | Optimizer States | Gradients | Activations | Total Memory | Min GPUs |
|---|---|---|---|---|---|---|
| 7B (Llama 3 8B) | 16 GB | 32 GB | 16 GB | 4-8 GB | 68-72 GB | 1x H100 |
| 13B (Llama 2) | 26 GB | 52 GB | 26 GB | 6-12 GB | 110-116 GB | 2x H100 |
| 34B (CodeLlama) | 68 GB | 136 GB | 68 GB | 12-24 GB | 284-296 GB | 4x H100 |
| 70B (Llama 3) | 140 GB | 280 GB | 140 GB | 20-60 GB | 580-620 GB | 8x H100 / 4x MI300X |
| 130B (Yi-34B x4) | 260 GB | 520 GB | 260 GB | 40-80 GB | 1,080-1,120 GB | 14x H100 / 6x MI300X |
| 405B (Llama 3) | 810 GB | 1,620 GB | 810 GB | 80-200 GB | 3,320-3,440 GB | 42x H100 / 18x MI300X |
Compute Requirements: FLOPs and Training Time
Training compute requirements scale with model size and dataset size. The industry-standard approximation is that training requires 6x the model parameters per token (6 * N * D FLOPs, where N = parameters, D = dataset tokens). For Llama 3 70B trained on 15 trillion tokens: 6 * 70e9 * 15e12 = 6.3e24 FLOPs (6.3 zettaFLOPs).
On an 8x H100 cluster delivering 3.2 petaFLOPS of sustained throughput, this requires approximately 6.3e24 / 3.2e15 = 1.97e6 seconds = 22.8 days of continuous training. Real-world factors (communication overhead, recompilation, checkpointing, failures) add 15-30% overhead, extending actual time to 26-30 days.
Sizing for Fine-Tuning vs Pre-Training
Fine-tuning requires significantly less compute than pre-training—typically 0.1-2% of pre-training FLOPs depending on dataset size and training epochs. For Llama 3 70B fine-tuning on 100M tokens for 3 epochs: approximately 6.3e22 FLOPs, requiring 5-8 hours on 8x H100. This makes fine-tuning accessible on modest GPU clusters.
Pre-training, by contrast, demands the full cluster for weeks or months. The decision to pre-train vs fine-tune should be based on domain specificity requirements and available compute budget. For most enterprise and government applications, fine-tuning a pre-trained foundation model is more cost-effective.
Memory Bandwidth Requirements
GPU memory bandwidth determines how fast weights and data move through the compute pipeline. For training, inadequate bandwidth causes GPU compute units to stall while waiting for data, resulting in utilization rates of 30-50% instead of the theoretical 80-90%.
Minimum bandwidth thresholds: For LLM training, each GPU requires at least 2 TB/s memory bandwidth for acceptable utilization. H100's 3.35 TB/s is adequate for 70B models with batch sizes of 8-16. MI300X's 5.2 TB/s enables larger batch sizes and higher utilization for memory-bound workloads. H200's 4.8 TB/s provides a balanced profile for mixed workloads.
Interconnect Bandwidth Scaling
For multi-GPU training, interconnect bandwidth determines how efficiently GPUs synchronize gradients and share intermediate results. The rule of thumb: interconnect bandwidth should be at least 10% of GPU memory bandwidth per direction for near-linear scaling.
NVLink (600-900 GB/s): Provides adequate bandwidth for 8-GPU single-node training. Tensor parallelism over 2-8 GPUs achieves 80-95% scaling efficiency with NVLink.
InfiniBand NDR400 (50 GB/s per link): For multi-node training, each GPU requires at least 25 GB/s inter-node bandwidth. An 8-GPU node with 4x NDR400 links provides 200 GB/s aggregate inter-node bandwidth, adequate for 70B models with data parallelism.
Ethernet (100/200/400GbE): Suitable for inference and fine-tuning but insufficient for large-scale pre-training. RDMA over Converged Ethernet (RoCE) improves performance but generally achieves 60-70% of InfiniBand efficiency for collective operations.
Cost-Benefit Analysis by Workload Type
| Workload | Recommended GPU | Min GPUs | Estimated Monthly Cost* | Alternative |
|---|---|---|---|---|
| Research / Experimentation | NVIDIA A100 80GB | 1-4 | $2,000-$8,000 | Cloud (Lambda, RunPod) |
| Fine-tuning (7B-70B) | NVIDIA H100 80GB | 4-8 | $12,000-$30,000 | Single 8-GPU server |
| Fine-tuning (70B-405B) | NVIDIA H200 141GB | 8-32 | $30,000-$120,000 | 4-node HGX cluster |
| Pre-training (7B-34B) | NVIDIA H100 80GB | 16-64 | $60,000-$250,000 | 8-32 node cluster |
| Pre-training (70B-405B) | NVIDIA H200 / B200 | 64-1024+ | $250,000-$4M+ | Enterprise cluster |
* Estimated monthly cost includes hardware depreciation (3-year), power, cooling, facility, and support. Cloud comparable costs are typically 2-4x higher.
Related Content
Explore more about this topic:
- How Tensor Cores Accelerate Deep Learning
- NVIDIA B200 vs H100: Architecture Comparison
- NVIDIA H200 NVL Deep Dive
How much GPU memory is needed for Llama 3 70B inference?
Llama 3 70B requires ~140GB at FP16 for weights alone. With KV-cache for 4096-token context, total memory is ~160GB. A single H100 80GB cannot serve this model without quantization. INT8 quantization reduces to 70GB, fitting on a single H100. INT4 further reduces to 35GB, enabling deployment on L40S 48GB GPUs.
What happens if I undersize my GPU cluster?
Undersized clusters experience: out-of-memory errors during training, severely limited batch sizes (reducing convergence quality), excessive gradient accumulation steps (increasing training time 2-5x), and reduced GPU utilization due to memory swapping.
Should I buy or rent GPU capacity for LLM training?
For training runs under 3 months total duration, cloud rental is cost-effective. For sustained training programs exceeding 6 months, on-premise infrastructure provides 40-60% TCO savings. Government agencies with classified workloads require on-premise deployment regardless of cost.