AI Model Parallelism Explained: Data, Tensor, and Pipelin…
Quick Summary
- Data Parallelism: Each GPU holds full model copy, processes different data
- Tensor Parallelism: Model layers split across GPUs, each handles partial computation
- Pipeline Parallelism: Different layers on different GPUs in sequential stages
- Hybrid: Most 70B+ models use all three strategies simultaneously
- Scaling: Combination enables efficient training up to 10,000+ GPUs
Distributed Training Strategies for Large Models 8× B200 SXM server
Training large language models with billions of parameters requires distributing computation across multiple GPUs using parallelism strategies that optimize for memory capacity, compute utilization, and communication efficiency. The three primary parallelism techniques—data parallelism, tensor parallelism, and pipeline parallelism—can be combined in hybrid approaches to train models exceeding the memory capacity of any single GPU.
Data Parallelism
Data parallelism replicates the entire model on each GPU, with each GPU processing different micro-batches of training data. Gradients are synchronized across GPUs after each step using all-reduce communication. Data parallelism is the simplest to implement and scales efficiently for models that fit within single-GPU memory. PyTorch Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP) are the standard implementations.
Tensor Parallelism
Tensor parallelism splits individual model layers across multiple GPUs, with each GPU computing a portion of each layer's matrix operations. This requires high-bandwidth GPU-to-GPU communication (NVLink or NVSwitch) since every matrix multiplication requires communication between partition boundaries. Tensor parallelism is essential for models where individual layers exceed single-GPU memory capacity—common in 70B+ parameter models.
Pipeline Parallelism
Pipeline parallelism places different model layers on different GPUs in a sequential pipeline. Micro-batches flow through the pipeline, with each GPU computing its assigned layers before passing activations to the next stage. Pipeline parallelism reduces communication volume compared to tensor parallelism but introduces idle time (pipeline bubbles) that reduces efficiency. The optimal pipeline configuration balances layer assignment for balanced compute across stages.
Hybrid Parallelism in Practice
Most production LLM training uses all three strategies simultaneously. A typical 8-node H100 cluster with 64 GPUs might use: data parallelism across 8 nodes (DP=8), tensor parallelism within each node (TP=8 using NVLink), and pipeline parallelism across stages (PP=variable). This hybrid approach, called 3D parallelism, enables training of 100B+ parameter models with 50-60% MFU.
Related Content
Explore more about this topic:
Frequently Asked QuestionsWhich parallelism strategy should I use first?
Start with data parallelism (FSDP) for models fitting in single-GPU memory. Add tensor parallelism when individual layers exceed GPU memory. Add pipeline parallelism for models requiring multiple nodes.
How does NVLink affect parallelism choice?
NVLink enables efficient tensor parallelism by providing the bandwidth needed for frequent all-to-all communication within a node. Without NVLink, tensor parallelism becomes communication-bound and pipeline or data parallelism should be preferred.