What is NVLink? Complete Guide to GPU Interconnect Techno…

May 13, 2026 · Technical Deep Dives
Reviewed by NTS AI Infrastructure Engineer · Technical accuracy verified for enterprise & federal deployment
NVIDIA H100 NVL PCIE RETAIL SCB
NVIDIA H100 NVL PCIE RETAIL SCB — click to enlarge

Quick Summary

  • Technology: NVIDIA high-speed direct GPU-to-GPU interconnect
  • Bandwidth: Up to 900 GB/s per link (NVLink 4.0) vs PCIe Gen5 64 GB/s
  • Topology: Full-mesh NVSwitch enables all-to-all GPU communication
  • Scaling: Supports up to 256 GPUs in a single NVLink domain
  • Impact: 3-5x faster multi-GPU training vs PCIe-based interconnects

NVIDIA NVLink represents one of the most significant technological HGX B200 with NVLink advancements in GPU computing architecture. As a high-bandwidth, low-latency interconnect technology, NVLink fundamentally changes how multiple GPUs communicate within a server, enabling linear scaling for AI training and HPC workloads that would be impossible with traditional PCIe interconnects. This comprehensive guide covers NVLink architecture from first principles through practical deployment considerations.

NVLink Architecture and Generations

NVLink is NVIDIA's proprietary GPU-to-GPU interconnect technology, designed to overcome the bandwidth limitations of PCI Express for multi-GPU communication. Each NVLink generation has progressively increased bandwidth while reducing latency and power consumption:

GenerationGPUBandwidth per LinkLinks per GPUTotal BW per GPUTopology
NVLink 1.0P100160 GB/s4640 GB/sHybrid Cube Mesh
NVLink 2.0V100300 GB/s6900 GB/sFull Mesh
NVLink 3.0A100600 GB/s12600 GB/sNVSwitch
NVLink 4.0H100/H200900 GB/s18900 GB/sNVSwitch 3
NVLink 5.0B100/B2001,800 GB/s181,800 GB/sNVSwitch 4

NVLink Switch technology enables beyond-8-GPU scaling by creating a fully-connected NVLink domain across up to 256 GPUs in the DGX H100 SuperPOD architecture. NVSwitch is a specialized ASIC providing 64 NVLink ports in a single chip, with all-to-all bandwidth of 7.2 TB/s per switch.

NVLink vs PCIe: Performance Comparison

For multi-GPU AI training, NVLink provides 7-15x higher bandwidth than PCIe Gen5, dramatically reducing communication overhead in parallel training strategies. In tensor parallelism—where activations are split across GPUs with frequent all-reduce operations—NVLink's bandwidth advantage translates directly to training speed improvements.

Measured performance impact: GPT-3 175B training on 8x H100 with NVLink achieves 3.2 petaFLOPS sustained throughput. The same GPUs connected via PCIe Gen5 achieve approximately 1.8-2.0 petaFLOPS—a 40-60% reduction in training throughput due to communication bottlenecks. For models using pipeline parallelism with less frequent communication, the NVLink advantage reduces to 20-30%.

NVLink is essential for: Tensor parallelism (splitting layers across GPUs), fully-sharded data parallelism (FSDP with fine-grained all-gather/reduce-scatter), mixture-of-experts (MoE) routing with expert parallelism, and large-batch training requiring frequent all-reduce synchronization.

NVLink Topologies and Configuration

NVIDIA HGX platforms implement NVLink through specific topology configurations that determine inter-GPU communication patterns:

HGX A100 (NVSwitch): All 8 GPUs connected through 6 NVSwitch ASICs, providing full bisection bandwidth between any GPU pair. The "dragonfly+" topology ensures uniform latency and bandwidth regardless of which GPUs communicate.

HGX H100 (NVSwitch 3): Third-generation NVSwitch with third-generation NVLink. Each H100 GPU connects to all NVSwitch ASICs simultaneously, providing 900 GB/s total bidirectional bandwidth per GPU. The topology supports multi-node NVLink domains up to 256 GPUs.

HGX B200 (NVSwitch 4): Fourth-generation NVSwitch with NVLink 5.0, doubling bandwidth to 1.8 TB/s per GPU. The topology introduces adaptive routing for congestion management in large multi-node configurations.

NVLink for Multi-Node Clusters

NVLink Switch System (NVSS) extends NVLink beyond single-server boundaries. NVSS connects up to 32 HGX servers (256 GPUs) in a single NVLink domain, eliminating the performance penalty of inter-node communication over InfiniBand or Ethernet.

NVSS architecture: Each HGX H100 server connects to external NVSwitch trays via 8x OSFP optical transceivers (200 Gbps per lane, 800 Gbps aggregate per port). The switch trays provide non-blocking all-to-all connectivity across the entire 256-GPU domain with 3.6 TB/s aggregate throughput.

Performance benefit: Multi-node training with NVSS achieves 95%+ scaling efficiency up to 32 nodes, compared to 70-80% with InfiniBand. For large model training (70B+ parameters), NVSS reduces inter-node communication time by 3-5x compared to InfiniBand NDR400.

Related Content

Explore more about this topic:

Frequently Asked Questions

Is NVLink required for multi-GPU AI training?

NVLink is not strictly required but strongly recommended. PCIe-connected GPUs can train models but with 30-50% lower GPU utilization and 40-60% longer training times for models requiring tensor parallelism. For models under 13B parameters that fit in data parallelism, NVLink provides less benefit.

Can NVLink connect GPUs across different servers without NVSwitch?

Standard NVLink connections are server-internal. Cross-server NVLink requires NVLink Switch System (NVSS) hardware, which is a significant infrastructure investment justified only for clusters exceeding 32 GPUs.

Does NVLink benefit AI inference?

NVLink provides marginal benefit for single-GPU inference but significant benefit for multi-GPU inference using tensor parallelism. For serving Llama 3 70B across 4 H100 GPUs, NVLink reduces inter-GPU communication latency by 5-10x compared to PCIe, enabling lower p99 inference latency.