What is an HGX Platform? NVIDIA AI Supercomputing Archite…
Quick Summary
- Definition: NVIDIA reference architecture for multi-GPU AI supercomputing
- Components: 8x GPU baseboard, NVLink Switch, reference motherboard design
- OEM Partners: Supermicro, Dell, HPE, Lenovo build HGX-based systems
- Current Gen: HGX H100 (80GB HBM3) and HGX H200 (141GB HBM3e)
- Federal: TAA-compliant HGX configurations available for US government
The NVIDIA HGX platform represents the definitive architecture for enterprise AI computing. HGX (which evolved from "Hybrid Graphics Hub" in early generations to become a standalone product brand) is NVIDIA's reference design for multi-GPU servers, integrating 4 or 8 high-performance GPUs with NVLink interconnect, optimized power delivery, and validated thermal management into a standardized platform adopted by every major server OEM. This comprehensive guide explains HGX architecture, platform generations, and how to select the right HGX configuration for your AI workloads.
HGX Platform Architecture HGX B200 platform
The HGX platform is designed as a modular "baseboard" that hosts multiple NVIDIA GPUs with integrated NVLink switching, power regulation, and thermal monitoring. Server OEMs integrate the HGX baseboard into their chassis designs, adding CPUs, memory, storage, and networking to create complete AI server solutions.
HGX baseboard components: The baseboard contains GPU sockets (SXM form factor), NVLink Switch ASICs for GPU-to-GPU communication (on A100/H100/H200/B200 generations), voltage regulator modules (VRMs) delivering 500-2000A of GPU power, temperature sensors and fan control logic, and management firmware interfaces for out-of-band monitoring.
SXM GPU form factor: HGX uses NVIDIA's SXM (Server PCI Express Module) form factor, distinct from standard PCIe slot-mounted GPUs. SXM provides 2-3x higher power delivery capacity (700W vs 300-450W for PCIe), direct NVLink connections without PCIe bridge overhead, and optimized thermal interface for liquid cooling cold plate attachment. SXM GPUs cannot be installed in standard PCIe slots—they require HGX baseboard compatibility.
HGX Generations Compared
| Generation | GPU | GPU Count | Interconnect | GPU Memory | AI FLOPs (FP8) | Max Power |
|---|---|---|---|---|---|---|
| HGX A100 | A100 80GB | 4 or 8 | NVLink 3.0 + NVSwitch | 80GB HBM2e | 2.5 PFLOPS | 6.5 kW |
| HGX H100 | H100 80GB | 4 or 8 | NVLink 4.0 + NVSwitch 3 | 80GB HBM3 | 3.2 PFLOPS | 7.0 kW |
| HGX H200 | H200 141GB | 4 or 8 | NVLink 4.0 + NVSwitch 3 | 141GB HBM3e | 3.2 PFLOPS | 7.0 kW |
| HGX B200 | B200 192GB | 4 or 8 | NVLink 5.0 + NVSwitch 4 | 192GB HBM4 | 9.0 PFLOPS | 8.5 kW |
HGX Server Ecosystem
The HGX platform is manufactured through NVIDIA's OEM partner program, with major server vendors offering HGX-based AI servers:
Supermicro: The most popular HGX partner for enterprise deployments. Supermicro offers the 8U HGX H100 server (8x H100, 2x Intel Xeon/AMD EPYC, 2TB RAM, 8x 400Gb NICs), the 4U HGX H100 server (4x H100, single CPU, 1TB RAM, 4x 200Gb NICs), and the 10U HGX B200 server. Supermicro provides the best price-performance ratio and fastest delivery times for HGX servers.
Dell PowerEdge: The XE9680 is Dell's 6U HGX H100 server, offering 8x H100 GPUs with integrated OpenManage enterprise management. Dell provides superior service and support contracts for enterprise customers, including ProSupport Plus with 4-hour mission-critical response.
HPE: The HPE Cray XD670 is HPE's 8U HGX H100 server for AI training, featuring 8x H100 GPUs and HPE's integrated Slingshot interconnect for multi-node clustering. HPE targets government and research customers through their HPC division.
Lenovo: The ThinkSystem SR675 V3 is Lenovo's 8U HGX H100 server, featuring Neptune liquid cooling integration and Lenovo's ThinkShield security features. Lenovo's HGX servers are popular in university and research deployments.
NTS Integration: NTS provides specialized HGX server configurations with enhanced features for government and enterprise deployments, including FISMA-compliant security configurations, custom liquid cooling integration, MIL-STD-810 ruggedization, and TAA-compliant supply chain documentation.
Selecting the Right HGX Configuration
4-GPU HGX: Best for AI inference, model fine-tuning, and small-scale training. Provides cost-effective AI computing for organizations deploying models up to 70B parameters. Recommended GPU: H100 for performance, H200 for larger model capacity.
8-GPU HGX: The standard for production AI training. Supports models up to 405B parameters with tensor parallelism. NVLink Switch provides full-bandwidth connectivity between all 8 GPUs. Recommended for any organization with serious AI training requirements.
HGX + NVLink Switch System: Extends HGX to 256-GPU domains for ultra-scale AI training. Required for pre-training large foundation models (70B+ parameters). Considered infrastructure investment for AI leaders, not entrants.
Related Content
Explore more about this topic:
- NVIDIA H200 NVL Deep Dive
- NVIDIA B200 vs H100: Architecture Comparison
- How Tensor Cores Accelerate Deep Learning
Can HGX GPUs be upgraded independently?
No. HGX baseboards are designed for specific GPU generations. Upgrading from HGX H100 to HGX B200 requires replacing the entire baseboard. Server OEMs typically offer forklift upgrades requiring new chassis, power supplies, and cooling systems.
What is the difference between HGX and DGX?
DGX (Data Center GPU X) is NVIDIA's first-party integrated AI system, combining HGX hardware with NVIDIA-engineered software and support. HGX is a platform licensed to OEMs. DGX systems include additional features: InfiniBand switch integration, Base Command management software, and NVIDIA AI Enterprise software suite. DGX carries a 30-50% premium over equivalent HGX OEM systems.
Does HGX support AMD CPUs?
Yes. HGX A100/H100/H200 platforms support 4th and 5th Gen AMD EPYC processors (9004/9005 series) in addition to Intel Xeon Scalable (4th/5th Gen). AMD EPYC configurations typically offer 2-3x more PCIe lanes and 20-40% higher memory bandwidth for CPU-GPU data transfer.