GPU Sizing Guide for LLM Workloads

May 13, 2026 · Technical Deep Dives
Reviewed by NTS AI Infrastructure Engineer · Technical accuracy verified for enterprise & federal deployment
NVIDIA Ampere A100 80 GB PCIe 4.0 Graphic Card – Dual Slot Passive Cooling
NVIDIA Ampere A100 80 GB PCIe 4.0 Graphic Card – Dual Slot Passive Cooling — click to enlarge

Quick Summary

  • 7B Model: ~14GB VRAM minimum; single GPU feasible with quantization
  • 70B Model: ~140GB VRAM; requires 4-8 GPUs with model parallelism
  • 405B Model: ~800GB VRAM; requires multi-node cluster
  • Memory Bandwidth: Scales nearly linearly with training throughput
  • Fine-tuning: Requires 2-4x less memory than full training

Accurately sizing GPU infrastructure for large language model B200 GPU server (LLM) workloads is one of the most critical decisions facing AI teams. Misestimating GPU requirements leads to either wasted capital on over-provisioned hardware or crippling performance bottlenecks from under-provisioned clusters. This guide provides a rigorous, step-by-step methodology for calculating GPU requirements across model sizes from 7B to 405B+ parameters, with specific guidance for enterprise and government deployments.

GPU Memory Requirements by Model Size

The primary constraint for LLM training is GPU memory (VRAM). Model weights, optimizer states, gradients, and activations each consume significant memory. The total memory requirement follows this formula:

Total Memory = Model Weights (FP16) + Optimizer States (2x weights) + Gradients (1x weights) + Activations (variable)

For a 70B parameter model at FP16 (2 bytes per parameter): weights = 140GB, optimizer states = 280GB (using Adam), gradients = 140GB, activations = 20-60GB depending on sequence length and batch size. Total: 580-620GB minimum, requiring 8x H100 80GB GPUs or 4x MI300X 192GB GPUs.

Model SizeFP16 WeightsOptimizer StatesGradientsActivationsTotal MemoryMin GPUs
7B (Llama 3 8B)16 GB32 GB16 GB4-8 GB68-72 GB1x H100
13B (Llama 2)26 GB52 GB26 GB6-12 GB110-116 GB2x H100
34B (CodeLlama)68 GB136 GB68 GB12-24 GB284-296 GB4x H100
70B (Llama 3)140 GB280 GB140 GB20-60 GB580-620 GB8x H100 / 4x MI300X
130B (Yi-34B x4)260 GB520 GB260 GB40-80 GB1,080-1,120 GB14x H100 / 6x MI300X
405B (Llama 3)810 GB1,620 GB810 GB80-200 GB3,320-3,440 GB42x H100 / 18x MI300X

Compute Requirements: FLOPs and Training Time

Training compute requirements scale with model size and dataset size. The industry-standard approximation is that training requires 6x the model parameters per token (6 * N * D FLOPs, where N = parameters, D = dataset tokens). For Llama 3 70B trained on 15 trillion tokens: 6 * 70e9 * 15e12 = 6.3e24 FLOPs (6.3 zettaFLOPs).

On an 8x H100 cluster delivering 3.2 petaFLOPS of sustained throughput, this requires approximately 6.3e24 / 3.2e15 = 1.97e6 seconds = 22.8 days of continuous training. Real-world factors (communication overhead, recompilation, checkpointing, failures) add 15-30% overhead, extending actual time to 26-30 days.

Sizing for Fine-Tuning vs Pre-Training

Fine-tuning requires significantly less compute than pre-training—typically 0.1-2% of pre-training FLOPs depending on dataset size and training epochs. For Llama 3 70B fine-tuning on 100M tokens for 3 epochs: approximately 6.3e22 FLOPs, requiring 5-8 hours on 8x H100. This makes fine-tuning accessible on modest GPU clusters.

Pre-training, by contrast, demands the full cluster for weeks or months. The decision to pre-train vs fine-tune should be based on domain specificity requirements and available compute budget. For most enterprise and government applications, fine-tuning a pre-trained foundation model is more cost-effective.

Memory Bandwidth Requirements

GPU memory bandwidth determines how fast weights and data move through the compute pipeline. For training, inadequate bandwidth causes GPU compute units to stall while waiting for data, resulting in utilization rates of 30-50% instead of the theoretical 80-90%.

Minimum bandwidth thresholds: For LLM training, each GPU requires at least 2 TB/s memory bandwidth for acceptable utilization. H100's 3.35 TB/s is adequate for 70B models with batch sizes of 8-16. MI300X's 5.2 TB/s enables larger batch sizes and higher utilization for memory-bound workloads. H200's 4.8 TB/s provides a balanced profile for mixed workloads.

Interconnect Bandwidth Scaling

For multi-GPU training, interconnect bandwidth determines how efficiently GPUs synchronize gradients and share intermediate results. The rule of thumb: interconnect bandwidth should be at least 10% of GPU memory bandwidth per direction for near-linear scaling.

NVLink (600-900 GB/s): Provides adequate bandwidth for 8-GPU single-node training. Tensor parallelism over 2-8 GPUs achieves 80-95% scaling efficiency with NVLink.

InfiniBand NDR400 (50 GB/s per link): For multi-node training, each GPU requires at least 25 GB/s inter-node bandwidth. An 8-GPU node with 4x NDR400 links provides 200 GB/s aggregate inter-node bandwidth, adequate for 70B models with data parallelism.

Ethernet (100/200/400GbE): Suitable for inference and fine-tuning but insufficient for large-scale pre-training. RDMA over Converged Ethernet (RoCE) improves performance but generally achieves 60-70% of InfiniBand efficiency for collective operations.

Cost-Benefit Analysis by Workload Type

WorkloadRecommended GPUMin GPUsEstimated Monthly Cost*Alternative
Research / ExperimentationNVIDIA A100 80GB1-4$2,000-$8,000Cloud (Lambda, RunPod)
Fine-tuning (7B-70B)NVIDIA H100 80GB4-8$12,000-$30,000Single 8-GPU server
Fine-tuning (70B-405B)NVIDIA H200 141GB8-32$30,000-$120,0004-node HGX cluster
Pre-training (7B-34B)NVIDIA H100 80GB16-64$60,000-$250,0008-32 node cluster
Pre-training (70B-405B)NVIDIA H200 / B20064-1024+$250,000-$4M+Enterprise cluster

* Estimated monthly cost includes hardware depreciation (3-year), power, cooling, facility, and support. Cloud comparable costs are typically 2-4x higher.

Related Content

Explore more about this topic:

Frequently Asked Questions

How much GPU memory is needed for Llama 3 70B inference?

Llama 3 70B requires ~140GB at FP16 for weights alone. With KV-cache for 4096-token context, total memory is ~160GB. A single H100 80GB cannot serve this model without quantization. INT8 quantization reduces to 70GB, fitting on a single H100. INT4 further reduces to 35GB, enabling deployment on L40S 48GB GPUs.

What happens if I undersize my GPU cluster?

Undersized clusters experience: out-of-memory errors during training, severely limited batch sizes (reducing convergence quality), excessive gradient accumulation steps (increasing training time 2-5x), and reduced GPU utilization due to memory swapping.

Should I buy or rent GPU capacity for LLM training?

For training runs under 3 months total duration, cloud rental is cost-effective. For sustained training programs exceeding 6 months, on-premise infrastructure provides 40-60% TCO savings. Government agencies with classified workloads require on-premise deployment regardless of cost.