I awoke Sunday morning to an article in The Information written to instigate fear, uncertainty and doubt amongst Nvidia investors and users. Don’t worry. Nvidia’s got this.
The article circulating this weekend highlighted the thermal challenges some customers face with the dense Blackwell water-cooled racks, the GB200 NVL36 and NVL72. No doubt, many data centers will need to do some homework to implement new data center layouts with water-cooling technologies to take advantage of the new systems. And of course, customers have had a couple years to prepare. This is not a surprise.
But if you want the power of Hopper or Blackwell and NV-Link, you don’t have to implement a moon-shot. But be clear that many cloud companies are already installing these next generation racks; the efficiency of water cooling is a compelling motivator.
Michael Dell announced today on X that the company’s NVL72 server racks are now shipping. “The 1st in the world @nvidia GB200 NVL72 server racks are now shipping. We are thrilled to deliver our liquid-cooled PowerEdge XE9712 to @CoreWeave. The AI rocket just got a massive boost! 🤖🚀🤝
When fully loaded, the 72-GPU GB200-based NVL72 rack weighs about 3,000 pounds (~1.5 tons), and consumes some 125,000 watts of power. Compare that to a traditional rack of ~2000 pounds and 12,000 watts, and one can imagine the challenges customers face. You need structurally sound flooring, power distribution systems, and new water cooling systems to install this new generation of hardware. But this is the future of cloud computing, and install they will. And they are. It just may take smart engineers a little more work to plan for it and implement this next generation of accelerated computing.
Nvidia’s “GB200 systems are the most advanced computers ever created” and “integrating them into a diverse range of data center environments requires co-engineering with our customers,” an Nvidia spokesperson told The Information. “The engineering iterations are normal and expected.”
New Hardware For Mere Mortals
But, if you need a more sedate pace of technology adoption, or just want to access the best AI chips ASAP while you design your next generation of cooling infrastructure, Nvidia has just the solution. At SuperComputing ‘24, Nvidia introduced PCIe-based H200 (Hopper) that can be installed in traditional servers. and a new 4-Way GB200 (Grace-Blackwell) boards that still require liquid cooling but place a far easier hurdle to clear.
The H200 NVL (for NVlink) is now generally available, with 50% more HBM, 70% better AI performance, and 30% faster performance for traditional High Performance Computing. The H200 NVL is a four PCIe card package with an NVLink interconnect which is 7-times faster than a cluster of PCIe cards. And, as Nvidia says, it fits into today’s servers and data centers. You don’t get the massive scaling of an NVL72, but you can just plug it in and go. And maybe you don’t need that level of compute anyway.
Nvidia also introduced the GB200 NVL4, a water-cooled motherboard designed for server builders that is a complete single server solution, with two Arm-based Grace CPUs and four NVLink-connected Blackwell GPUs. This creates a single-node NVLink Domain for HPC and AI workloads that is very fast: 80% faster than Hopper for AI, and 2.2 times the performance for HPC). The four-way design presents fewer challenges for data centers, with the potential for lower rack density and weight.
Net-Net
While installing the NVL36 and NVL72 racks may be a daunting task, it’s not rocket science. Michael Dell tweeted this morning that the company’s first NVL72 has shipped to Coreweave this morning. And most cloud service providers have said theirs will be up and running in calendar Q1.
And perhaps you don’t have to go there. Nvidia now offers four platforms including the H200 NVL, the GH200, the GB200, and the new GB200 NVL4 that can handle all but the most demanding HPC and AI workloads. And these platforms are available from some 16 server providers around the world, from enterprise partners HPE, Dell, and Lenovo, to hyperscale vendors like SuperMicro, Gigabyte, and Wistron.