What is GPU Interconnect Bandwidth? Understanding Communi…

May 13, 2026 · Technical Deep Dives

Reviewed by NTS AI Infrastructure Engineer · Technical accuracy verified for enterprise & federal deployment

NTS Elite NX-72GP-Liquid — click to enlarge

Quick Summary

NVLink 4.0: 900 GB/s unidirectional per GPU, NVIDIA proprietary
InfiniBand NDR400: 400 Gb/s per port, open standard for cluster networking
PCIe Gen5: 64 GB/s per direction, universal compatibility
Training Impact: NVLink eliminates GPU communication bottleneck; InfiniBand scales across nodes
Selection Guide: NVLink for single-node, InfiniBand for multi-node clusters

GPU interconnect bandwidth B200 SXM server—the speed at which GPUs communicate with each other and with the broader computing system—is a critical yet often overlooked factor in AI infrastructure performance. While GPU compute capability (FLOPS) and memory bandwidth receive the most attention, interconnect bandwidth frequently becomes the bottleneck in distributed AI training at scale. This comprehensive guide explains GPU interconnect technologies, their performance characteristics, and how to select the right interconnect for your AI workloads.

The Role of Interconnect in AI Computing

In distributed AI training, model parameters and gradients must be synchronized across GPUs after each training step. The size of these gradient exchanges scales with model size—for a 70B parameter model at FP16 precision, each gradient synchronization round exchanges 140GB of data across all GPUs. With training batches completing in 1-5 seconds, the interconnect must transfer 140GB in under 100ms to avoid becoming the bottleneck.

The efficiency of distributed training is determined by Amdahl's Law applied to communication: if communication overhead consumes 10% of each training step, the maximum theoretical scaling efficiency is 90%. In practice, scaling efficiency of 70-85% is typical for InfiniBand-connected clusters, while NVLink-connected clusters achieve 90-95%.

Interconnect Technologies Compared

Technology	Peak Bandwidth (per direction)	Typical Latency	Cable Reach	Power per Port	Best For
PCIe Gen5 x16	64 GB/s	200-500 ns	Board-level	~15W	GPU-to-CPU, storage
PCIe Gen6 x16	128 GB/s	150-350 ns	Board-level	~20W	Future GPU platforms
NVLink 4.0	450 GB/s	100-200 ns	Server-internal	~5W per link	Multi-GPU within server
NVLink 5.0	900 GB/s	80-150 ns	Server-internal + NVSwitch	~8W per link	Next-gen GPU clusters
InfiniBand NDR200	25 GB/s	500-1000 ns	100m+ (copper) / 300m+ (optical)	~12W	Mid-size clusters
InfiniBand NDR400	50 GB/s	400-800 ns	100m+ / 300m+	~16W	Large training clusters
InfiniBand XDR800	100 GB/s	300-600 ns	100m+ / 300m+	~22W	Future ultra-scale clusters
400GbE (RoCEv2)	50 GB/s	1000-3000 ns	100m+ / 10km+	~14W	Inference, storage, management

Interconnect Requirements by Workload

Different AI workloads have dramatically different interconnect requirements. Understanding these requirements prevents over-investing in expensive interconnect fabrics while ensuring adequate performance for target workloads.

Single-GPU inference: No GPU-to-GPU interconnect required. PCIe Gen4 x16 provides adequate bandwidth for model loading and data transfer. Total interconnect budget: $0 (no additional cost beyond GPU server configuration).

Multi-GPU inference (tensor parallelism): Requires 50-200 GB/s per GPU for models split across 2-8 GPUs. NVLink (for between-GPU within a server) or InfiniBand (for across-server) provides adequate bandwidth. Total interconnect budget: 5-10% of total cluster cost.

Single-node training (8 GPUs): NVLink is strongly recommended. With 8 H100 GPUs connected via NVLink Switch (900 GB/s per GPU), the single node can train models up to 175B parameters efficiently. Interconnect cost: included in HGX platform cost.

Multi-node training (16-256 GPUs): Requires NVLink internally + InfiniBand NDR400 externally. Minimum 4x NDR400 links per node (200 GB/s aggregate). Interconnect cost: 15-25% of total cluster cost.

Ultra-scale training (256+ GPUs): Requires NVLink Switch System or InfiniBand NDR800/XDR with Fat Tree topology. Interconnect cost: 25-35% of total cluster cost—a significant but necessary investment for efficient large-scale training.

Topology Design for GPU Interconnects

The network topology connecting GPUs determines aggregate bandwidth, latency distribution, and fault tolerance. Common topologies include:

Fat Tree (Clos): The most scalable topology, providing full bisection bandwidth across any number of nodes. Requires 2x oversubscription of leaf-to-spine links (each leaf switch connects to 2x spine switches). Recommended for clusters exceeding 64 GPUs.

Dragonfly+: Optimized for all-to-all communication patterns common in AI training. Group-based topology with high-bandwidth intra-group connections and reduced inter-group bandwidth. Efficient for clusters up to 512 GPUs with 30-40% lower cabling requirements than Fat Tree.

3D Torus: Used primarily in HPC systems (Fugaku, Frontier). Excellent for nearest-neighbor communication patterns in scientific computing but suboptimal for all-to-all gradient synchronization in AI training. Not recommended for pure AI workloads.

What is GPU Interconnect Bandwidth? Understanding Communi…

Quick Summary

The Role of Interconnect in AI Computing

Interconnect Technologies Compared

Interconnect Requirements by Workload

Topology Design for GPU Interconnects

Related Content

Can I use Ethernet for AI training instead of InfiniBand?

How do I determine my interconnect bandwidth requirements?

Does interconnect bandwidth affect inference latency?

Ready to Build Your AI Infrastructure?