Xilinx Wants to Flip the CPU-Accelerator Relationship Upside Down
The new Versal HBM accelerator is a kind of self-adapting micro-supercomputer for servers and server clusters.
Many of the players in the fast-growing server accelerator platform space have been advocating an emerging framework that some academics have dubbed Data Center Resource Disaggregation (DRD). You’re familiar with the idea of moving software workloads off of CPUs and onto subsidiary processors. That’s not new, dating back at least to Intel’s 80387 coprocessor. DRD would build on the groundwork of GPUs and SmartNICs, creating whole new classes of chips devoted to doing “dirty work” — and in so doing constructing whole new verticals for harvesting by the processor industry.
And then there’s Xilinx, which on Wednesday premiered what could throw a giant monkey wrench into the DRD mechanism. Versal HBM is a kind of self-adapting micro-supercomputer for servers and server clusters, cramming onto a single card the same number of processing cores as 14 FPGAs and the equivalent memory of 32 DDR5-6400 modules.
If Xilinx has its way, Versal HBM will literally turn the tables upside down for servers, making the company’s ACAP (Adaptive Compute Accelerator Platform) into the workload processor and demoting the CPU to a kind of perimeter security guard.
“Silicon technology has passed this inflection point where we can’t just rely on going to the next silicon node (like 16nm to 7nm) in order to get the performance boost and the power savings that end applications really need,” Xilinx’s Versal senior product line manager Mike Thompson told DCK. “On top of that, there’s really a drive in many, many markets for an individual platform or device to be able to do more.”
It could be perceived as the opposite of disaggregation, and Thompson makes a compelling case. In edge computing environments, where engineers are still figuring out the right frameworks for micro data centers, space comes at a premium. In some environments where you might only have a few square feet of space – and that’s literally hanging on a wall by hooks – making what was one box into four or five boxes might not be an option. (A new form of “converged” infrastructure that situates those four or five boxes onto one substrate for good measure seems silly.)
It’s an interesting point of view, coming from a company that may yet become a division of processor maker AMD, should their merger deal announced last October be approved by European regulators. While Intel moves forward with its plans to make Xeon processors the center of a disaggregated architecture, where SmartNICs become a more empowered class of processor Intel calls IPU, this Xilinx move could take AMD in a different direction.
Scaling In
The theme of this different direction could be tighter integration. Conceivably, an enterprise-class server framework could build on what AMD has already established with its Instinct GPUs, designed to be tightly integrated with Epyc server processors.
What Xilinx’s Versal HBM (High Bandwidth Memory) would bring to such a platform is a new class of dedicated, high-speed memory that would not need to be disaggregated — as with, for example, the emerging CXL model championed by Rambus — to be useful in a high-performance setting.
“Especially on the high-performance computing side, and for AI/ML, memory is the main bottleneck,” remarked Thompson. “That’s the main thing we’re looking to alleviate here: the memory and access bottlenecks. We’re trying to solve big data, big bandwidth problems.”
The setting that might be most welcoming of Versal HBM’s entry into the market could be micro data centers for 5G wireless and content delivery networks. These are the highest bandwidth consumers on the planet, and edge computing is demanding that these processing tasks be pushed out closer to customers and data consumers.
Thompson’s contention is that a disaggregated memory architecture such as Compute Express Link could impose either a latency penalty or a power penalty, depending on how it’s set up. Increase the I/O bandwidth between separate components, he argued, and you boost the amount of power being consumed. On the other hand, stack memory chips together in a 3D arrangement, like Micron Technologies tried to do with hybrid memory cubes (HMC), and pay a significant cost in latency.
(There have been arguments for over five years now about whether HBM or HMC incurs higher latency penalties. Thompson’s argument is backed up by some, though not all, engineers in this space who argue that while HMC’s heavy buffering does enable a communications-like packet-based interface to the CPU, utilizing that interface incurs latency penalties.)
“I think HBM is certainly a long-term solution,” Thompson told us. A theoretical CXL disaggregated memory box could hold many terabytes. However, if in-memory databases engineered for such a box utilized a serialized interface, then as database size grew, the box would provide fewer and fewer advantages over today’s solid-state storage.
“Generally, integrating memory within a package is always going to be faster and lower power than having memory outside of package,” he said.
“The White-Box Dream”
Arm’s Cortex-A72 will handle conventional scalar processing, while a second Arm processor, Cortex-RSF, will be delegated real-time processing tasks. That’s all ordinary stuff. But these SoCs will be connected via network-on-chip to what Xilinx describes as “adaptable hardware,” which for now is covered up by the type of curtain typically reserved for wizards.
Is Xilinx seriously talking about a scale of heavy-duty workload processor that could end up loosening the requirements for CPU throughput, unwinding the tension, and conceivably extending its lifecycle? That’s the kind of impact that could affect its potential future corporate parent – and not necessarily in a good way.
“I certainly believe so,” responded Thompson. For workloads such as recommendation engines for e-commerce sites, he asserted, “it’s possible to eliminate 200 servers from a given workload... Maybe it’s possible to do all this orchestration instead of having some high-end Intel CPU, some more low-cost Arm CPU. Maybe it’s something similar to what’s running in our mobile phones today. Maybe that’s a little bit of an extreme example, but yes, I believe that it’s absolutely possible to turn down the performance requirements on the host CPU with what’s kind of like the white-box dream of accelerators.”
Xilinx scheduled documentation for Versal HBM software developers to be released today. Developers’ tools will become available sometime during the second half of the year, and sampling is slated to begin in the first half of 2022.
About the Author
You May Also Like