Technology
NVIDIA Rubin R100 NVL72: What Makes It 5x Faster
Gani, CTO·2026-03-15·7 min read
Rubin Architecture Overview
The NVIDIA Vera Rubin R100 represents a generational leap in GPU compute. Built on TSMC's N3 process node (3nm-class) and packing approximately 208 billion transistors, Rubin delivers over 1,400 ExaFLOPS of FP4 performance per NVL72 rack — approximately 5x the throughput of the H100 SXM that currently dominates cloud GPU offerings.
The R100 features a completely redesigned streaming multiprocessor (SM) with native support for FP4, FP6, and FP8 data types alongside the established FP16, BF16, and TF32 formats. This expanded precision hierarchy allows AI researchers to use the lowest precision that maintains acceptable model quality for each layer of their neural network, dramatically improving effective throughput without sacrificing convergence.
At a system level, the Rubin platform pairs the R100 GPU with the Vera CPU — NVIDIA's custom ARM-based host processor — eliminating the PCIe bottleneck between CPU and GPU that has been a persistent constraint in previous generations. The Vera CPU connects to the R100 via a coherent chip-to-chip link running at 900 GB/s, enabling unified memory addressing and seamless data movement between host and accelerator.
NVL72 Rack Configuration
The NVL72 is NVIDIA's rack-scale GPU system, containing 72 R100 GPUs and 36 Vera CPUs in a single liquid-cooled rack. Unlike previous DGX systems that were organized as discrete 8-GPU nodes, the NVL72 treats the entire rack as a single unified compute domain, with all 72 GPUs interconnected through a unified NVLink fabric. This eliminates the node-boundary bottleneck that previously forced all-reduce operations to traverse slower InfiniBand links.
Each NVL72 rack delivers approximately 1.4 ExaFLOPS of FP4 tensor performance — enough to train a 70B-parameter language model from scratch in under two weeks, a task that would require months on a comparable H100 cluster. The rack consumes approximately 120 kW under full training load, which is substantial but represents a 3-4x improvement in performance-per-watt over the previous generation.
The physical form factor is a standard 52U rack with rear-door liquid cooling connections. The rack requires facility water at 30-45C inlet temperature with a flow rate of approximately 60 liters per minute. Power is delivered via six 20 kW redundant power shelves with 2N redundancy, ensuring that the loss of any single power feed does not interrupt GPU operations.
NVLink 6.0 Fabric and Interconnect
NVLink 6.0, the interconnect fabric inside the NVL72, operates at 1800 GB/s per GPU — double the bandwidth of NVLink 4.0 in the H100 generation. More importantly, the NVL72's switch architecture provides full bisection bandwidth across all 72 GPUs, meaning any GPU can communicate with any other GPU at the full link rate without contention.
This is achieved through a two-tier switching topology: each group of 9 GPUs connects to a local NVSwitch, and the NVSwitches are interconnected through a second tier of spine switches. The result is a non-blocking fat-tree topology within the rack, with an aggregate fabric bandwidth of 3.6 TB/s per direction across the bisection. For training workloads that require frequent all-reduce operations across hundreds of GPUs, this fabric bandwidth is the single most important performance differentiator.
For multi-rack scaling beyond 72 GPUs, NVIDIA provides ConnectX-8 InfiniBand adapters running at NDR800 (800 Gb/s per port). Each NVL72 rack has 8 uplink ports, providing 800 GB/s of inter-rack bandwidth. This is sufficient for efficient distributed training across 8-16 racks (576-1152 GPUs) for most model architectures, with linear scaling efficiency above 90% for well-optimized workloads.
HBM4 Memory: Capacity and Bandwidth
Each R100 GPU is equipped with 288 GB of HBM4 memory, a 3.6x increase over the H100's 80 GB of HBM3. The aggregate memory across an NVL72 rack is 20.7 TB — enough to hold a dense 1.5-trillion-parameter model entirely in GPU memory without any CPU offloading or parameter sharding tricks. This is transformative for large model training, where the complexity of memory management has been one of the primary engineering challenges.
HBM4 also delivers 8 TB/s of memory bandwidth per GPU, up from 3.35 TB/s on the H100. This bandwidth increase is critical for inference workloads, where the autoregressive decoding process is fundamentally memory-bandwidth-limited. With 8 TB/s per GPU, the R100 can serve approximately 3x more inference tokens per second than the H100 for large language models.
The combination of massive capacity and high bandwidth also enables new training techniques such as in-memory checkpointing, where the optimizer state and model parameters are periodically snapshotted to a reserved region of HBM rather than being flushed to SSDs. This reduces checkpoint overhead from minutes to milliseconds, improving overall training throughput by 5-10% for long-running jobs.
FP4 and FP8 Performance
The R100's tensor cores natively support FP4 (4-bit floating point) computation, a capability not available on any previous NVIDIA GPU. FP4 tensor operations deliver 2x the throughput of FP8 at a given clock rate, enabling the R100 to reach its peak 1.4 ExaFLOPS rating. In practice, FP4 training is most effective for the forward pass and gradient computation of attention layers, while weight updates and certain normalization operations still benefit from FP8 or higher precision.
NVIDIA's Transformer Engine has been updated for Rubin to include automatic per-tensor precision selection, dynamically choosing between FP4, FP6, FP8, and BF16 on a layer-by-layer basis to maximize throughput while maintaining training convergence. Early benchmarks from NVIDIA show that FP4-aware training of GPT-class models achieves equivalent validation loss to BF16 training while requiring 60-70% fewer GPU-hours.
For inference, FP4 quantization of production models is particularly compelling. A 70B-parameter model quantized to FP4 requires only 35 GB of memory — easily fitting on a single R100 GPU with room to spare for KV-cache and batch processing. This enables single-GPU inference for models that previously required multi-GPU setups, dramatically reducing the cost per query for production LLM deployments.
Liquid Cooling Requirements
At 120 kW per rack, the NVL72 is impossible to cool with traditional air-based methods. NVIDIA specifies direct liquid cooling (DLC) as the only supported thermal solution, with cold plates on every GPU and CPU connected to a facility cooling loop. The cooling system must deliver water at 30-45C inlet temperature and maintain a maximum junction temperature of 83C under sustained full-load operation.
Qube Compute's data center uses a two-stage cooling architecture designed specifically for NVL72 deployment. The primary loop consists of coolant distribution units (CDUs) on each row that exchange heat between the server-side coolant loop and the facility water loop. The secondary loop uses absorption-based heat management (ABHM) chillers powered by waste heat from the facility's gas generators, achieving a cooling PUE contribution of only 0.05 — dramatically more efficient than traditional mechanical chillers.
This cooling architecture enables a facility-level PUE of 1.10, meaning only 10% of total power consumption goes to cooling and infrastructure overhead. By comparison, typical hyperscaler data centers achieve PUE of 1.20-1.30 with air cooling, and even the most efficient liquid-cooled facilities rarely achieve better than 1.15. The 5-10% PUE advantage translates directly into lower operating costs per GPU-hour for Qube Compute customers.
Upgrade Path to Rubin Ultra
NVIDIA has announced Rubin Ultra as the follow-on to the standard R100, expected to ship in 2027. Rubin Ultra will use a multi-chip module (MCM) design, combining two R100 dies on a single package with a high-bandwidth die-to-die interconnect. This effectively doubles the compute and memory per socket — delivering approximately 576 GB of HBM4e memory and 2x the tensor throughput of a single R100.
Critically, Rubin Ultra is designed to be socket-compatible with the NVL72 rack infrastructure. This means that Qube Compute's data center, power delivery, and cooling systems are already designed to support the Rubin Ultra upgrade without facility modifications. The 120 kW per rack power budget accommodates the Ultra's expected power envelope, and the liquid cooling system's thermal capacity has been sized with headroom for the increased heat density.
For customers, this provides a clear and cost-effective upgrade path. A training workload that initially runs on an NVL72 rack with standard R100 GPUs can be migrated to Rubin Ultra GPUs in the same rack, doubling throughput without any changes to the orchestration software, networking, or storage infrastructure. This forward compatibility protects long-term investment and eliminates the facility-level disruption that typically accompanies GPU generational transitions.
Ready to Scale Your AI?
Limited Phase 1 capacity — 8 racks available. Reserve now to lock in anchor pricing.
GPU access from July 2027. Reserve now to secure anchor pricing.