What is GPU Interconnect Bandwidth? Understanding Communi…

May 13, 2026 · Technical Deep Dives
Reviewed by NTS AI Infrastructure Engineer · Technical accuracy verified for enterprise & federal deployment
NTS Elite NX-72GP-Liquid
NTS Elite NX-72GP-Liquid — click to enlarge

Quick Summary

  • NVLink 4.0: 900 GB/s unidirectional per GPU, NVIDIA proprietary
  • InfiniBand NDR400: 400 Gb/s per port, open standard for cluster networking
  • PCIe Gen5: 64 GB/s per direction, universal compatibility
  • Training Impact: NVLink eliminates GPU communication bottleneck; InfiniBand scales across nodes
  • Selection Guide: NVLink for single-node, InfiniBand for multi-node clusters

GPU interconnect bandwidth B200 SXM server—the speed at which GPUs communicate with each other and with the broader computing system—is a critical yet often overlooked factor in AI infrastructure performance. While GPU compute capability (FLOPS) and memory bandwidth receive the most attention, interconnect bandwidth frequently becomes the bottleneck in distributed AI training at scale. This comprehensive guide explains GPU interconnect technologies, their performance characteristics, and how to select the right interconnect for your AI workloads.

The Role of Interconnect in AI Computing

In distributed AI training, model parameters and gradients must be synchronized across GPUs after each training step. The size of these gradient exchanges scales with model size—for a 70B parameter model at FP16 precision, each gradient synchronization round exchanges 140GB of data across all GPUs. With training batches completing in 1-5 seconds, the interconnect must transfer 140GB in under 100ms to avoid becoming the bottleneck.

The efficiency of distributed training is determined by Amdahl's Law applied to communication: if communication overhead consumes 10% of each training step, the maximum theoretical scaling efficiency is 90%. In practice, scaling efficiency of 70-85% is typical for InfiniBand-connected clusters, while NVLink-connected clusters achieve 90-95%.

Interconnect Technologies Compared

TechnologyPeak Bandwidth (per direction)Typical LatencyCable ReachPower per PortBest For
PCIe Gen5 x1664 GB/s200-500 nsBoard-level~15WGPU-to-CPU, storage
PCIe Gen6 x16128 GB/s150-350 nsBoard-level~20WFuture GPU platforms
NVLink 4.0450 GB/s100-200 nsServer-internal~5W per linkMulti-GPU within server
NVLink 5.0900 GB/s80-150 nsServer-internal + NVSwitch~8W per linkNext-gen GPU clusters
InfiniBand NDR20025 GB/s500-1000 ns100m+ (copper) / 300m+ (optical)~12WMid-size clusters
InfiniBand NDR40050 GB/s400-800 ns100m+ / 300m+~16WLarge training clusters
InfiniBand XDR800100 GB/s300-600 ns100m+ / 300m+~22WFuture ultra-scale clusters
400GbE (RoCEv2)50 GB/s1000-3000 ns100m+ / 10km+~14WInference, storage, management

Interconnect Requirements by Workload

Different AI workloads have dramatically different interconnect requirements. Understanding these requirements prevents over-investing in expensive interconnect fabrics while ensuring adequate performance for target workloads.

Single-GPU inference: No GPU-to-GPU interconnect required. PCIe Gen4 x16 provides adequate bandwidth for model loading and data transfer. Total interconnect budget: $0 (no additional cost beyond GPU server configuration).

Multi-GPU inference (tensor parallelism): Requires 50-200 GB/s per GPU for models split across 2-8 GPUs. NVLink (for between-GPU within a server) or InfiniBand (for across-server) provides adequate bandwidth. Total interconnect budget: 5-10% of total cluster cost.

Single-node training (8 GPUs): NVLink is strongly recommended. With 8 H100 GPUs connected via NVLink Switch (900 GB/s per GPU), the single node can train models up to 175B parameters efficiently. Interconnect cost: included in HGX platform cost.

Multi-node training (16-256 GPUs): Requires NVLink internally + InfiniBand NDR400 externally. Minimum 4x NDR400 links per node (200 GB/s aggregate). Interconnect cost: 15-25% of total cluster cost.

Ultra-scale training (256+ GPUs): Requires NVLink Switch System or InfiniBand NDR800/XDR with Fat Tree topology. Interconnect cost: 25-35% of total cluster cost—a significant but necessary investment for efficient large-scale training.

Topology Design for GPU Interconnects

The network topology connecting GPUs determines aggregate bandwidth, latency distribution, and fault tolerance. Common topologies include:

Fat Tree (Clos): The most scalable topology, providing full bisection bandwidth across any number of nodes. Requires 2x oversubscription of leaf-to-spine links (each leaf switch connects to 2x spine switches). Recommended for clusters exceeding 64 GPUs.

Dragonfly+: Optimized for all-to-all communication patterns common in AI training. Group-based topology with high-bandwidth intra-group connections and reduced inter-group bandwidth. Efficient for clusters up to 512 GPUs with 30-40% lower cabling requirements than Fat Tree.

3D Torus: Used primarily in HPC systems (Fugaku, Frontier). Excellent for nearest-neighbor communication patterns in scientific computing but suboptimal for all-to-all gradient synchronization in AI training. Not recommended for pure AI workloads.

Related Content

Explore more about this topic:

Frequently Asked Questions

Can I use Ethernet for AI training instead of InfiniBand?

Ethernet with RoCEv2 (RDMA over Converged Ethernet) works for AI training up to 32-GPU clusters with acceptable efficiency. Beyond 32 GPUs, InfiniBand provides 20-30% better scaling efficiency due to lower latency and better congestion management. For clusters under 16 GPUs, the difference is minimal.

How do I determine my interconnect bandwidth requirements?

The industry rule of thumb: interconnect bandwidth per GPU should be at least 10% of GPU memory bandwidth for near-linear scaling. For H100 (3.35 TB/s memory bandwidth), minimum interconnect bandwidth is 335 GB/s per GPU—achievable only with NVLink. For inter-node connections, 50 GB/s per GPU (4x NDR400 per 8-GPU node) provides adequate performance for most workloads.

Does interconnect bandwidth affect inference latency?

For single-GPU inference, interconnect bandwidth has negligible impact. For multi-GPU inference with tensor parallelism (required for models exceeding single GPU memory), interconnect bandwidth directly affects p99 latency. Slow interconnect adds 10-50ms per inference request for 70B+ parameter models.