What is NVLink? Complete Guide to GPU Interconnect Techno…

May 13, 2026 · Technical Deep Dives

Reviewed by NTS AI Infrastructure Engineer · Technical accuracy verified for enterprise & federal deployment

NVIDIA H100 NVL PCIE RETAIL SCB — click to enlarge

Quick Summary

Technology: NVIDIA high-speed direct GPU-to-GPU interconnect
Bandwidth: Up to 900 GB/s per link (NVLink 4.0) vs PCIe Gen5 64 GB/s
Topology: Full-mesh NVSwitch enables all-to-all GPU communication
Scaling: Supports up to 256 GPUs in a single NVLink domain
Impact: 3-5x faster multi-GPU training vs PCIe-based interconnects

NVIDIA NVLink represents one of the most significant technological HGX B200 with NVLink advancements in GPU computing architecture. As a high-bandwidth, low-latency interconnect technology, NVLink fundamentally changes how multiple GPUs communicate within a server, enabling linear scaling for AI training and HPC workloads that would be impossible with traditional PCIe interconnects. This comprehensive guide covers NVLink architecture from first principles through practical deployment considerations.

NVLink Architecture and Generations

NVLink is NVIDIA's proprietary GPU-to-GPU interconnect technology, designed to overcome the bandwidth limitations of PCI Express for multi-GPU communication. Each NVLink generation has progressively increased bandwidth while reducing latency and power consumption:

Generation	GPU	Bandwidth per Link	Links per GPU	Total BW per GPU	Topology
NVLink 1.0	P100	160 GB/s	4	640 GB/s	Hybrid Cube Mesh
NVLink 2.0	V100	300 GB/s	6	900 GB/s	Full Mesh
NVLink 3.0	A100	600 GB/s	12	600 GB/s	NVSwitch
NVLink 4.0	H100/H200	900 GB/s	18	900 GB/s	NVSwitch 3
NVLink 5.0	B100/B200	1,800 GB/s	18	1,800 GB/s	NVSwitch 4

NVLink Switch technology enables beyond-8-GPU scaling by creating a fully-connected NVLink domain across up to 256 GPUs in the DGX H100 SuperPOD architecture. NVSwitch is a specialized ASIC providing 64 NVLink ports in a single chip, with all-to-all bandwidth of 7.2 TB/s per switch.

NVLink vs PCIe: Performance Comparison

For multi-GPU AI training, NVLink provides 7-15x higher bandwidth than PCIe Gen5, dramatically reducing communication overhead in parallel training strategies. In tensor parallelism—where activations are split across GPUs with frequent all-reduce operations—NVLink's bandwidth advantage translates directly to training speed improvements.

Measured performance impact: GPT-3 175B training on 8x H100 with NVLink achieves 3.2 petaFLOPS sustained throughput. The same GPUs connected via PCIe Gen5 achieve approximately 1.8-2.0 petaFLOPS—a 40-60% reduction in training throughput due to communication bottlenecks. For models using pipeline parallelism with less frequent communication, the NVLink advantage reduces to 20-30%.

NVLink is essential for: Tensor parallelism (splitting layers across GPUs), fully-sharded data parallelism (FSDP with fine-grained all-gather/reduce-scatter), mixture-of-experts (MoE) routing with expert parallelism, and large-batch training requiring frequent all-reduce synchronization.

NVLink Topologies and Configuration

NVIDIA HGX platforms implement NVLink through specific topology configurations that determine inter-GPU communication patterns:

HGX A100 (NVSwitch): All 8 GPUs connected through 6 NVSwitch ASICs, providing full bisection bandwidth between any GPU pair. The "dragonfly+" topology ensures uniform latency and bandwidth regardless of which GPUs communicate.

HGX H100 (NVSwitch 3): Third-generation NVSwitch with third-generation NVLink. Each H100 GPU connects to all NVSwitch ASICs simultaneously, providing 900 GB/s total bidirectional bandwidth per GPU. The topology supports multi-node NVLink domains up to 256 GPUs.

HGX B200 (NVSwitch 4): Fourth-generation NVSwitch with NVLink 5.0, doubling bandwidth to 1.8 TB/s per GPU. The topology introduces adaptive routing for congestion management in large multi-node configurations.

NVLink for Multi-Node Clusters

NVLink Switch System (NVSS) extends NVLink beyond single-server boundaries. NVSS connects up to 32 HGX servers (256 GPUs) in a single NVLink domain, eliminating the performance penalty of inter-node communication over InfiniBand or Ethernet.

NVSS architecture: Each HGX H100 server connects to external NVSwitch trays via 8x OSFP optical transceivers (200 Gbps per lane, 800 Gbps aggregate per port). The switch trays provide non-blocking all-to-all connectivity across the entire 256-GPU domain with 3.6 TB/s aggregate throughput.

Performance benefit: Multi-node training with NVSS achieves 95%+ scaling efficiency up to 32 nodes, compared to 70-80% with InfiniBand. For large model training (70B+ parameters), NVSS reduces inter-node communication time by 3-5x compared to InfiniBand NDR400.

What is NVLink? Complete Guide to GPU Interconnect Techno…

Quick Summary

NVLink Architecture and Generations

NVLink vs PCIe: Performance Comparison

NVLink Topologies and Configuration

NVLink for Multi-Node Clusters

Related Content

Is NVLink required for multi-GPU AI training?

Can NVLink connect GPUs across different servers without NVSwitch?

Does NVLink benefit AI inference?

Ready to Build Your AI Infrastructure?