ARPA-E’s Peter de Bock Talks Data Center Cooling Obstacles, Innovations
Data Center Knowledge sat down with Peter de Bock, head of ARPA-E, to discuss thermal management, the government’s COOLERCHIPS initiative, and ways to tackle today’s biggest data center energy challenges.
July 2, 2024
Cooling is one of the biggest challenges for the data center industry today. As AI workloads exponentially increase the need for speed and CPU time, the resulting energy creates significant heat dispersal. How a data center handles that heat becomes the biggest knot to untangle, impacting everything from energy bills and wear and tear on hardware systems to actual physical space usage within the facility.
How a facility is cooled has become the biggest inflection point for industry advancement, as it impacts everything from grid and infrastructure to site selection to power density per rack.
Amid this growing challenge, Data Center Knowledge’s editor-in-chief Wendy Schuchart sat down with Peter de Bock, program director of the US Department of Energy’s Advanced Research Projects Agency – Energy (ARPA-E) to talk about thermal management in data centers, partocularly around the program’s successful Cooling Operations Optimized for Leaps in Energy, Reliability and Carbon Hyper-efficiency for Information Processing Systems (COOLERCHIPS) initiative, and ways to tackle today’s biggest data center energy challenges.
The COOLERCHIPS initiative currently has 19 concurrent projects looking to reduce the total cooling energy of a typical data center to under 5% of what is seen in the industry standard.
The program contains different tracks, such as cooling loops, software for monitoring and reacting to cooling fluctuations, cooling systems for smaller modular or edge data centers, and the support of the technology required in those ventures.
The following transcript has been lightly edited for clarity and length.
Data Center Knowledge: So, tell me a little bit about ARPA-E and the COOLERCHIPS project.
Dr. Peter De Bock: ARPA-E is the Department of Energy’s Advanced Research Project Agency-Energy. We focus on moonshot technologies that, if they work, will be transformational for an industry. COOLERCHIPS is a program and portfolio focused on making energy-efficient computing solutions for next-generation high-powered chipsets.
ARPA-E's Peter de Bock (left) and the DOE's Rakesh Radhakrishnan (right) at Data Center World on April 16, 2024.
The focus of our program is really to make the US lead in critical areas that we feel are important for the total energy landscape. Data centers are one of those, and developing technologies that make us the most energy-efficient at computing is very important for the DOE. I am enjoying supporting such a large program to make these technologies a reality.
DCK: Could you talk a little bit about your own experience? And what brought you into this role?
De Bock: ARPA-E as an agency recruits expert leaders from different industries to come work at ARPA-E. Before ARPA-E, I worked for 18 years at General Electric Research, where I was the principal engineer for thermal sciences and a platform leader for power thermal management systems. In that capacity, I worked first on electronic systems, a lot of them related to aerospace. I was also in ASME [American Society of Mechanical Engineers], as the chair of the Heat Transfer Committee in electronic equipment. With that, I learned about a wide variety of approaches, and ARPA-E invited me to work for them for a term, to explore what kind of energy efficiency programs I could launch within the ARPA-E Agency umbrella.
DCK: What is the competition like for organizations hoping to benefit from the projects that you have run so far?
De Bock: As a program director, I look at an entire sector, such as the data center sector, and see, hey, other mechanisms do it with more energy efficiency. And we look at that sector in more of a due diligence kind of way, and look at the laws of physics, what the maximum entitlement to do an operation like running large computing systems in the most energy-efficient way is.
We look at where we are today. We then identify what the gaps are to be bridged, to make that new reality a possibility. In the data center space, we felt there was still an opportunity to create much more efficiency, but it would require some significant transformational technology development. So, we opened up for technology proposals that could bridge that.
When we launched what we call a funding opportunity announcement, we set specific targets that people needed to hit. And we request a variety of proposals and receive those from national labs – from small businesses or large industry. It’s more important than an ARPA-E program, where it’s not a single entity that can solve such a large code problem. It’s really a combination of two or three startups, universities, and large companies coming together and saying, this is outside our normal commercial scope. But, if we work together, we can tackle this larger problem in a unique and effective way that our current commercial innovation scope cannot.
We do it together within a larger scope and solve the problem in a holistic way. We then select the best proposals. I wish I could support all of them. We received many proposals in this space, we selected the very best of the best to go and work on this challenge. With that, we set a target that’s very, very hard.
Often, there are multiple ways that people can try to achieve that. In the cooling space, many different cooling methods are being explored by different teams. And each of them has their own challenges and their own advantages. So, although we call it somewhat of a competition, it’s really a program about learning about and funding diverse methods. In a high-risk, high-reward scenario, we’re looking at technologies that are so high risk that they cannot be funded by the current industry, because they’re just really far out there, and thinking, if they would work, the reward would be very high.
That means by funding diverse approaches, we have many different cooling methods. We only need a small percentage, let’s say 20%, 10%, or 5% of those to succeed, because the ones that do will move the entire data center industry to a more energy-efficient space. So, although you call it the competition, it’s really, to me, a community that develops around testing some really high-risk, high-reward technologies. And as we go along, as a program director, I actively manage these projects in such a way that if we see a technology that’s struggling at some point to meet the final targets, we say halfway, well, thank you, we learned a lot. Maybe it’s better that we stop this particular effort, because it’s not on track to meet the program target, and we focus our attention on the ones that that do.
So, there is a mechanism within ARPA-E’s programs so that we can concentrate our focus on the most impactful projects, and I look forward to seeing that mechanism evolve as the program goes through its time.
DCK: Are you incentivized in some ways to take chances on leftfield ideas that just might work, by the fact that the industry itself does not necessarily reward things that are risky
De Bock: As you said – exactly. In addition, sometimes commercial businesses have a very limited scope of what they have under their control. Somebody who makes heat sinks might only think about how to make a better heat sink, or somebody who makes a cooling distribution unit, or CDU, might see that as their scope, or facility cooling system, ARPA-E programs like COOLERCHIPS allow all those units to work together. But what if we all work together and reimagine working from chip surface all the way to ambient or from chip to facility, and we work together on a combined solution for that, but at a larger scope? What can we achieve? There are two elements to this.
It’s so high risk, high reward that sometimes it cannot be found within their own agencies because it’s just too far out there. Then second of all, is the teaming arrangement that can be made, where you can pull in the university as a partner, you can pull in a national lab as a partner, you can pull in a large industry as a partner and try something very new. Those kinds of inventions are really exciting to see come together in a program like COOLERCHIPS.
DCK: For many years PUE has been the big conversation starter in sustainability and making sure that we’re being efficient. Should people still be using this metric?
De Bock: PUE has helped the industry focus on sustainability, and it’s been it’s been great for that. PUE also has its challenges. I think PUE works well when you have a very similar data center with very similar rack density in a similar environment, and you want to compare operational performance from one to the other. As a pure technology metric, it has a few drawbacks. In the definition of PUE, we sometimes use the fans in the denominator. That means that the fan power itself is seen as part of the IT load. In some ways, you can argue that you’re not sure if that’s the right way to look at the problem. In COOLERCHIPS, we’re trying to focus on more of a technology metric that’s diagnostic of the particular location, defends the rack density, as well as what part of the IT energies to use for computing.
So, we have within the program metrics that are a little bit more technology-focused. PUE has great value as an operational metric within the community. But I think other metrics are more focused on purely this technology. And I think those will slowly emerge as these programs develop.
DCK: Can you talk a little bit about what those metrics are?
De Bock: PUE is the total facility energy divided by the IT equipment energy. That’s the definition of Power Usage Effectiveness. In the denominator, IT equipment energy, people sometimes use power going into the server in the plug. Sometimes, it includes fans that are mounted on the server. So, one idea is that we could subtract the fan energy from the IT equipment, the denominator of the PUE equation. That already gives me a slightly better feel for what that would be. And sometimes that’s referred to as TUE, Total Usage Effectiveness.
The second thing we considered in the COOLERCHIPS program is that PUE is sensitive to the environment in which you’re built, as well as the rack density. So, if you’re building a data center for a very cold environment, you should take advantage of that cold environment, and it’s easier because your PUE will be lower.
In the COOLERCHIPS program, we fixed the environment so all the teams that are working on that technology are referencing themselves in the same environment. So, it’s an interesting race, where everybody’s within the same boundaries. People have to work in the same rack density, and we’re talking three kilowatts per unit or 126 kilowatts per 42 unit (42U) rack equivalent, and do that in the same environment.
The environment we chose as a reference for the COOLERCHIPS program is challenging. It’s essentially Phoenix, Arizona, in summer – 40 degrees Celsius [104 Fahrenheit] at 60% relative humidity. If you can work in that environment, the target for the program is to have total facility energy divided by cooling by IT energy only, without the fans, of 1.05. That means 5% of the energy to the data center or less is used for cooling only. And that will be a really hard target for teams to hit.
What I see so far in the proposal room is that technology is developing. It is technically possible, and we’ve evaluated ourselves, and the teams are on track to hit a target of 126 kilowatts per rack or more in Phoenix, Arizona, in summer environments with less than 5% of cooling energy use for their systems. And that is exciting. That will be a true breakthrough in energy usage, perhaps also in water usage.
COOLERCHIPS test environments are benchmarked against the challenging conditions of Phoenix, Arizona.
DCK: You’ve picked the single worst possible place you could have running a data center at that kind of scale. How does it work?
De Bock: The reason why it works is very simple. The inside of a computer chip runs at a temperature that’s much higher than Phoenix in summer. I looked up what the hottest point we’ve ever had in the United States is, and it’s in Death Valley, where they once recorded 134 degrees Fahrenheit. Our computer chips are running at temperatures much higher than that – 140, 160, 180 degrees Fahrenheit.
So, if something is hotter than the environment at all times, even in the worst we’ve ever had on our planet, we should be able to move heat from hot to cold in a very efficient way, as long as they can connect that with a very efficient connection. And that’s what the teams are working on. There are two parts to COOLERCHIPS. The first is making the thermal connection very efficient. This is hard, but the teams will achieve it. The second part they have to work on is very unique. They have to be able to do that with reliability that is similar to the air-cooled systems that are used in large data centers today. Large data centers use air-cooling because they consider it the most reliable option.
Air doesn’t short-circuit any electronics, it can just be pumped faster and can be refrigerated, so, the teams have to challenge to make this advanced cooling connection. Many of those are with liquids, and show, using statistical analysis, that such a system will reach the same reliability as the air-cooled baseline, as the one thing that operators do not want to sacrifice is uptime or reliability. They don’t want their data center to fail.
So, using aerospace methods, which is called a Markov chain analysis, and FMEA, or Failure Mode Effective Analysis, teams have to demonstrate at the 18-month midpoint of the program that their technology system is on a path to reach the same reliability levels as air-cooling, but at a performance that also a magnitude better than the best cooling system today.
DCK: What would be your prediction for getting to an industry-standard PUE of lower than 1.5?
De Bock: The targets of the program should lead to lower than 1.5 PUE, and they should lead to a PUE of around 1.05 with high-power chips. We are targeting the moonshot of chips of tomorrow, so, we’re thinking about three kilowatts per unit, 126 kilowatts per rack. That has a very high energy density and hits our targets with less than 5% of the energy for cooling.
DCK: What’s the story for the commercialization of these innovations?
De Bock: The ARPA-E is modeled after DARPA. DARPA is the Defense Advanced Research Projects Agency, which delivered amazing innovations like the internet and mRNA vaccines, as well as GPS satellites. DARPA has a customer built-in, it’s called the Defense Department, whereas ARPA-E is very unique because our technologies need to commercialize, but on their own. They don’t have a customer built in. So, ARPA-E has a very unique branch, called the Tech-to-Market group.
Every single program like COOLERCHIPS has not only a technical program director like me who focuses on the technical aspect, but also a Tech-to-Market advisor. A Tech-to-Market advisor works on the economic hypothesis of the program. So, when we develop a game-changing path to a new and more energy-efficient future that combines a technical hypothesis, it is developed by the program director, who is supported by an economic hypothesis by the Tech-to-Market advisor.
Now, when you’re able to reduce the energy of the data center, let’s say by 30%, because that cooling energy that you used before you don’t need anymore, suddenly, the economics from an operating point of view become quite attractive. Also, COOLERCHIPS has the potential to reduce the amount of mechanical refrigeration as well as evaporative cooling that we might need, and therefore that’s another saving that could be brought to the program.
When you look at the program, sometimes we talk about whether you wish to use your power if you’re in a power-constrained environment, let’s say Ashburn, Virginia, for computing or cooling, and I think most data center operators will easily answer, we want to use the power for computing. So being energy efficient on the cooling side might give you more power budget on the processing side, which is another important thing as we’re looking at data centers becoming more and more power constrained.
DCK: Would using less power for cooling have the potential to alleviate some of the concerns that the grids are being overloaded in places like Ashburn?
De Bock: Yes, in some of these environments, the grid is maxed out, so they only have a limited amount of power. So, if you have a 100 MW data center, do you want to use a large percentage of that energy for your cooling system, or do you want to use as much as possible for your computing system? I think it’s very clear what delivers value to the customer. It’s computing, it’s not the cooling itself.
Being able to be more energy efficient should lead to a very interesting commercial hypothesis. As the program evolved in the beginning, I was more involved in the technical guidance. I met with the teams every three months, and we discussed technically where the program was going. I tried to give technical guidance where possible, and we assessed whether the program was technically on track.
The goal for an ARPA-E project is to be commercially investable at the end of the project. When we’re looking at these technologies, sometimes they start on a very basic scale, but they need to, at the end of the project, demonstrate to us a single full rack with this advanced cooling system. A single full rack doesn’t necessarily mean you can sell thousands of these to data centers at the end of the program.
So, we do help them find partnerships, investors, and other mechanisms to scale up. We have a program for this as well. It’s called the SCALEUP Program, where teams can apply to us with an advanced business case, when they have completed their first ARPA-E project, to take the technology to a much larger volume production or other growth paths that will further accelerate the proliferation of the technology into the industry
DCK: What do you see as the biggest inflection point for the data center industry in the next 10 years?
De Bock: That’s a very tough question. We’re already seeing deflection emerging as AI increases the power density per rack. It’s commonly seen as the threshold. If the power density goes over 50 kW per rack, air-cooling is limited, and we need to look at advanced cooling systems. With more intense computing – AI is driving some of that – we’re focusing on providing more energy to the data center, and the energy that goes in needs to be cooled.
It will be interesting to see how this will evolve over the next year. If you’ve used AI, you know that it’s quite effective. We’re on the cusp of using it to its full potential. There’s an insatiable appetite for computing. My job is to make the US lead in the most energy-efficient computing using transformational technologies by US teams.
Read more about:
Chip WatchAbout the Author
You May Also Like