The proliferation and rapid deployment of AI, machine learning, and adjacent technologies over the last few years has spurred a multitude of concerns across various demographics and industries. The ability for AI to automate some jobs traditionally performed by humans has many concerned about potential job losses and their economic impact. Training large AI models also requires massive amounts of data, which raises concerns about plagiarism, data privacy, and the potential for misuse of personal information. The compute resources required for advanced AI systems can also be mammoth, leading to environmental concerns regarding sustainability and the potential impact on power grids worldwide.
All of these concerns are valid, but I think AI will ultimately help human productivity once we navigate these early waters and the concerns over potential data misuse are par for the course with large digital systems. Here’s hoping our leadership and legislators are consulting the right people and we ultimately get any necessary regulations in place to quell any fears and for the industry to thrive.
AI’s impact on energy consumption, however, is very real and requires immediate, prolonged consideration. Training and running AI models, especially the latest Large Language Models, requires huge amounts of memory, storage and compute – all of which require significant power. Those requirements result in high energy consumption, which can strain electrical grids and increase overall energy demands. Compounding this concern, is that the power consumed by the large data centers housing today’s advanced AI systems is often derived from non-renewable sources. Those same data centers often require substantial amounts of water for cooling as well. What all of this ultimately means is that the environmental impact of AI is a critical and growing concern. As AI continues to proliferate, addressing its energy needs is crucial to ensuring sustainable and environmentally responsible progress in the space.
How AMD Plans To Tame AI’s Energy Demands
It is with all of this in mind that I had a discussion with AMD regarding AI and the company’s plans to address AI’s current energy constraints. I recently had the opportunity to talk with AMD executives about AI energy demands and discuss the company’s plans to optimize its various acceleration platforms – from client to data center — for both optimal performance and energy efficiency.
In a conversation with AMD’s Mark Papermaster, chief technology officer and executive vice president for technology and engineering, I got a bird’s eye view regarding AMD’s philosophy and long-term sustainability and efficiency efforts. I also had a chance to connect with AMD’s Sam Naffziger, senior vice president, corporate fellow, for a deep-dive technical discussion detailing AMD’s past achievements and what it’s actively doing to maximize AI and HPC compute capabilities, though silicon and system-level optimizations and software co-optimization up and down the stack.
Our discussions were eye opening. Papermaster talked about AMD’s public goal – announced a few years back — of improving energy efficiency of AMD’s high-performance compute platforms by 30x by 2025. AMD calls the goal “30×25,” and to that end, Papermaster also talked about AMD’s holistic design approach, which aims to continually optimize many aspects of today’s advanced systems, from the silicon to the algorithms and software used. The company’s efforts aren’t solely focused on its chips, though.
To date, AMD has made significant strides in achieving its 30×25 goal, but the company isn’t quite there yet. AMD has achieved a 13.5x improvement versus the 2020 baseline when 30×25 was announced, using a configuration of four AMD Instinct MI300A APUs (GPU with integrated 4th Gen EPYC “Genoa” CPU). That may not seem particularly close to 30x at this late stage of 2024, but AMD is poised to release next-generations EPYC server processors soon, and its MI325X accelerators are in the pipeline as well, along with a myriad of software and framework updates as well. This particular combination may not solely be responsible for pushing AMD past their self-imposed finish line, but they will likely push the efficiency envelope versus current offerings. Keep in mind, last decade AMD announced a 25x improvement goal for its mobile processors by 2020 – aptly called 25×20 – and the company ultimately delivered a 31.7x improvement in energy efficiency versus a 2014 baseline.
Taking A Holistic Approach To AI Energy Efficiency
Papermaster explained that AMD is not only working on improving the efficiency of its own solutions, but working with partners and the larger ecosystem to optimize virtually every aspect of the AI pipeline. Optimization of its CPUs, GPUs, FPGAs and myriad of micro and macro connectivity technologies that link chips, systems and racks, will all help enhance efficiency, along with quantizing models, improving software, and tweaking algorithms. AMD’s holistic approach to optimizing power efficiency means continually addressing every link in the virtual AI chain to maximize perf-per-watt.
This is an important consideration because it means the power and energy requirements of products when they initially hit the market, typically improve over the lifetime of said product. To date, AMD has made double digit efficiency gains year over year, and supercomputers built using AMD technologies have earned top rankings on the GREEN500. At one point, the AMD-powered Frontier TDS (test and development system) at Oak Ridge National Labs actually topped the GREEN500 list. The GREEN500 ranks supercomputers from the TOP500 list, in terms of energy efficiency.
There’s a lot of proprietary special sauce that AMD won’t disclose regarding the specific methods it’s using to optimize its chips, but I did glean some very interesting information from talking with AMD’s Naffziger. One of the key areas where significant efficiency gains are possible relate to data movement. The largest AI models require huge amounts of data. And as bits move from the tiny register files inside GPUs or accelerator chips, to cache memory, out to High Bandwidth Memory, and to the CPU, and so on, energy consumption grows exponentially. As such, keeping as much data as close to the accelerator as possible is paramount to maximizing energy efficiency. It’s why AMD continues to increase the amount of cache and memory on its Instinct accelerators gen-on-gen, and why the company continually explores ways to optimize how the data is actually processed, from quantizing models, to partitioning GPUs, or tuning adjacent software and frameworks to optimally utilize the hardware.
If we look at typical, large-scale AI system today, roughly 50% of the total power required to run the system is consumed by the GPU’s HBM, but the other 50% is comprised of CPU, scale-up and scale-out networking, and various things like cooling and other data center facility overhead. AMD’s goal is to maximize system-level performance, while also minimizing total power consumption, not just from its chips, but from everything around them in the data center as well.
As pervasive as AI has become, we are still in the early days of the technology. What’s true today, may not necessarily be true tomorrow. More AI processing will likely move to the client and edge, as AI PCs and other low-power accelerators gain prevalence, which will alter the dynamic between clients and the cloud. How AI workloads are processed also continues to evolve. The immense amounts of compute resources required for AI today are a major concern – there’s no doubt about it. AMD seems to be doing its part to maximize the efficiency of its platforms though, and the company appears poised to achieve its efficiency goals.