A Digital Twin of your Data Center can help identify and reduce wasted IT infrastructure, even before you build it. Nvidia uses one to plan and operate its AI supercomputers, as eliminating wasted IT has an enormous economic impact and improves reliability and sustainability.
Dr. Jonathan Koomey, President and Founder at Koomey Analytics, contributed to this article.
From the earliest days of electronic computing, delivery of power and removal of heat forced design and operational choices. Dr. Jonathan Koomey and I have penned a report that describes a technique for applying computer modeling, already widely used in the design of electronic equipment and data centers, to optimize the operation of data centers, which is now rarely done. The full paper can be downloaded here.
Digital Twins
With the acquisition of Future Facilities in 2022, Cadence Design embraced the use of digital twins to design and operate data centers. “Digital twins” in this context simulate the characteristics and performance of data centers operating in the real world. Of course, just having an exact physical or digital replica of a complex system isn’t enough. A twin becomes transformational when it can be paired with computer simulations of physical systems and when it is used to drive institutional change. With data centers adding power-hungry GPUs to run AI, the timing of the data center twin couldn’t be better.
A complete digital twin increases in complexity as multiple components act and respond to changing conditions according to the laws of the real world. These twins can simulate and analyze how a machine, or an entire data center, will behave when the components are changed and upgraded, avoiding costly mistakes before the device or data center is even built.
Rationale for digital twins in operations
The fundamental problem in designing a complex data center is that IT loads diverge from the original design and change over time. The modern data center is a complex system that consumes power to enable the modern digital world, but the technology that goes into a data center is constantly changing as new types of processors, networking, and storage are introduced. A data center designed in 2010 was filled with CPUs, spinning disk drives, and low-bandwidth Ethernet. Now, that data center is filled with multi-core CPUs, power-hungry GPUs, and solid-state storage, with new equipment constantly being added. Adding to the complexity is that each data center is unique, so a model specific to each facility must be developed and calibrated.
Lost data center capacity is exactly analogous to what are often called “zombie servers” in data centers, which are servers using electricity but doing nothing useful. This time it’s part of the data center itself (the cooling and power infrastructure) that is costing money (and lots of it) but not enabling any useful computing. When IT loads deviate from the original data center design, stranded power and cooling capacity are the result.
A simple analogy helps explain the problem. Most people are familiar with the game of Tetris, in which blocks fall at a regular pace, and the player’s task is to place those blocks in the right orientation, filling up the space as thoroughly as possible.
In the simplest case, the blocks are of uniform size and shape (i.e., they conform precisely to what data center designers specified initially), and it’s easy for the user to fill up the space completely. The example on the left-hand side of the figure below illustrates this case. On the right-hand side, the Tetris player cannot make the shapes fit perfectly because their shapes are random, and they just keep coming. That leaves gaps between the shapes, representing lost capacity in the data center.
One challenge is that the rigorous use of a digital twin requires continuous updating to reflect changes in the facility. That means tracking new installations of equipment and then re-calibrating to reflect the new conditions. It’s an ongoing process that is essential to maintaining the accuracy of the predictive model.
A benefit of digital twins is that their proper use induces re-evaluation of incentives, institutional structures, and procedures so that they can more effectively assess the total costs, benefits, and risks of proposed IT deployments inside data centers. The digital twin provides a common language and framework for structured data center decision-making that simply doesn’t exist inside most organizations.
Real-world Applications
One case study of the application of a digital twin to a data center [22] showed that changing controls strategy, modifying air flows, and increasing water temperatures, all based on the results of the digital twin model, led to £380,000 per year energy savings and a significant reduction in stranded capacity, resulting in a simple payback time of less than a year.
The table below summarizes details of three companies currently using digital twins. The health care and financial companies use the digital twin for both design and operations, while the hyperscale company uses it for initial design and major redesigns every few years. A common theme across these three firms is that digital twins allowed them to move away from rules of thumb (like preserving a 20% capacity buffer) and towards more accurate physics-based assessments that allowed for higher capital utilization.
Three companies using Digital Twins
The rollout of the digital twin for one of the data centers owned by a financial firm illustrates the potential benefits of applying the digital twin to expanding useful capacity in existing data centers. The data center owner rolled out digital twins for more than 400,000 square feet of electrically active floor area in 2020 and early 2021, including one center with about 2.9 MW.
Of that 2.9 MW, about 2.4 MW of IT load was already deployed, implying that about 17% of the “as built” capacity was untapped. Analysis using the digital twin showed that deploying 0.5 MW of IT load would be possible with modest changes in airflow and cooling strategies (the existing deployment relied on rules of thumb, not predictive modeling, so it left some capacity unused).
The analysis also showed that an additional 0.25 MW of IT load could be deployed in the existing hall if the operators made more extensive airflow and cooling changes and added additional power delivery capability. In this case, modestly reducing air flows and adding blanking panels to change where air was supplied increased the available cooling capacity.
Conclusions
Digital twins are an example of how computing technology, combined with management changes, can drive continuous and rapid improvements in performance, energy efficiency, and profits. These tools are essential for the next phase of data center industry innovation and that they should be universally applied by those designing and operating data centers.
The future of digital twins is promising, filled with possibilities for revolutionizing how we design, build, and operate systems. By harnessing the power of virtual systems, we can unlock a new era of efficiency, reliability, innovation, and sustainability.