GPU Sizing Guide for LLM Workloads

May 13, 2026 · Technical Deep Dives

Reviewed by NTS AI Infrastructure Engineer · Technical accuracy verified for enterprise & federal deployment

NVIDIA Ampere A100 80 GB PCIe 4.0 Graphic Card – Dual Slot Passive Cooling — click to enlarge

Quick Summary

7B Model: ~14GB VRAM minimum; single GPU feasible with quantization
70B Model: ~140GB VRAM; requires 4-8 GPUs with model parallelism
405B Model: ~800GB VRAM; requires multi-node cluster
Memory Bandwidth: Scales nearly linearly with training throughput
Fine-tuning: Requires 2-4x less memory than full training

Accurately sizing GPU infrastructure for large language model B200 GPU server (LLM) workloads is one of the most critical decisions facing AI teams. Misestimating GPU requirements leads to either wasted capital on over-provisioned hardware or crippling performance bottlenecks from under-provisioned clusters. This guide provides a rigorous, step-by-step methodology for calculating GPU requirements across model sizes from 7B to 405B+ parameters, with specific guidance for enterprise and government deployments.

GPU Memory Requirements by Model Size

The primary constraint for LLM training is GPU memory (VRAM). Model weights, optimizer states, gradients, and activations each consume significant memory. The total memory requirement follows this formula:

Total Memory = Model Weights (FP16) + Optimizer States (2x weights) + Gradients (1x weights) + Activations (variable)

For a 70B parameter model at FP16 (2 bytes per parameter): weights = 140GB, optimizer states = 280GB (using Adam), gradients = 140GB, activations = 20-60GB depending on sequence length and batch size. Total: 580-620GB minimum, requiring 8x H100 80GB GPUs or 4x MI300X 192GB GPUs.

Model Size	FP16 Weights	Optimizer States	Gradients	Activations	Total Memory	Min GPUs
7B (Llama 3 8B)	16 GB	32 GB	16 GB	4-8 GB	68-72 GB	1x H100
13B (Llama 2)	26 GB	52 GB	26 GB	6-12 GB	110-116 GB	2x H100
34B (CodeLlama)	68 GB	136 GB	68 GB	12-24 GB	284-296 GB	4x H100
70B (Llama 3)	140 GB	280 GB	140 GB	20-60 GB	580-620 GB	8x H100 / 4x MI300X
130B (Yi-34B x4)	260 GB	520 GB	260 GB	40-80 GB	1,080-1,120 GB	14x H100 / 6x MI300X
405B (Llama 3)	810 GB	1,620 GB	810 GB	80-200 GB	3,320-3,440 GB	42x H100 / 18x MI300X

Compute Requirements: FLOPs and Training Time

Training compute requirements scale with model size and dataset size. The industry-standard approximation is that training requires 6x the model parameters per token (6 * N * D FLOPs, where N = parameters, D = dataset tokens). For Llama 3 70B trained on 15 trillion tokens: 6 * 70e9 * 15e12 = 6.3e24 FLOPs (6.3 zettaFLOPs).

On an 8x H100 cluster delivering 3.2 petaFLOPS of sustained throughput, this requires approximately 6.3e24 / 3.2e15 = 1.97e6 seconds = 22.8 days of continuous training. Real-world factors (communication overhead, recompilation, checkpointing, failures) add 15-30% overhead, extending actual time to 26-30 days.

Sizing for Fine-Tuning vs Pre-Training

Fine-tuning requires significantly less compute than pre-training—typically 0.1-2% of pre-training FLOPs depending on dataset size and training epochs. For Llama 3 70B fine-tuning on 100M tokens for 3 epochs: approximately 6.3e22 FLOPs, requiring 5-8 hours on 8x H100. This makes fine-tuning accessible on modest GPU clusters.

Pre-training, by contrast, demands the full cluster for weeks or months. The decision to pre-train vs fine-tune should be based on domain specificity requirements and available compute budget. For most enterprise and government applications, fine-tuning a pre-trained foundation model is more cost-effective.

Memory Bandwidth Requirements

GPU memory bandwidth determines how fast weights and data move through the compute pipeline. For training, inadequate bandwidth causes GPU compute units to stall while waiting for data, resulting in utilization rates of 30-50% instead of the theoretical 80-90%.

Minimum bandwidth thresholds: For LLM training, each GPU requires at least 2 TB/s memory bandwidth for acceptable utilization. H100's 3.35 TB/s is adequate for 70B models with batch sizes of 8-16. MI300X's 5.2 TB/s enables larger batch sizes and higher utilization for memory-bound workloads. H200's 4.8 TB/s provides a balanced profile for mixed workloads.

Interconnect Bandwidth Scaling

For multi-GPU training, interconnect bandwidth determines how efficiently GPUs synchronize gradients and share intermediate results. The rule of thumb: interconnect bandwidth should be at least 10% of GPU memory bandwidth per direction for near-linear scaling.

NVLink (600-900 GB/s): Provides adequate bandwidth for 8-GPU single-node training. Tensor parallelism over 2-8 GPUs achieves 80-95% scaling efficiency with NVLink.

InfiniBand NDR400 (50 GB/s per link): For multi-node training, each GPU requires at least 25 GB/s inter-node bandwidth. An 8-GPU node with 4x NDR400 links provides 200 GB/s aggregate inter-node bandwidth, adequate for 70B models with data parallelism.

Ethernet (100/200/400GbE): Suitable for inference and fine-tuning but insufficient for large-scale pre-training. RDMA over Converged Ethernet (RoCE) improves performance but generally achieves 60-70% of InfiniBand efficiency for collective operations.

Cost-Benefit Analysis by Workload Type

Workload	Recommended GPU	Min GPUs	Estimated Monthly Cost*	Alternative
Research / Experimentation	NVIDIA A100 80GB	1-4	$2,000-$8,000	Cloud (Lambda, RunPod)
Fine-tuning (7B-70B)	NVIDIA H100 80GB	4-8	$12,000-$30,000	Single 8-GPU server
Fine-tuning (70B-405B)	NVIDIA H200 141GB	8-32	$30,000-$120,000	4-node HGX cluster
Pre-training (7B-34B)	NVIDIA H100 80GB	16-64	$60,000-$250,000	8-32 node cluster
Pre-training (70B-405B)	NVIDIA H200 / B200	64-1024+	$250,000-$4M+	Enterprise cluster

* Estimated monthly cost includes hardware depreciation (3-year), power, cooling, facility, and support. Cloud comparable costs are typically 2-4x higher.

GPU Sizing Guide for LLM Workloads

Quick Summary

GPU Memory Requirements by Model Size

Compute Requirements: FLOPs and Training Time

Sizing for Fine-Tuning vs Pre-Training

Memory Bandwidth Requirements

Interconnect Bandwidth Scaling

Cost-Benefit Analysis by Workload Type

Related Content

How much GPU memory is needed for Llama 3 70B inference?

What happens if I undersize my GPU cluster?

Should I buy or rent GPU capacity for LLM training?

Ready to Build Your AI Infrastructure?