
Loading summary
Chris Sosa
Foreign welcome. We have Chris Sosa, the director of engineering here at amd. What a pleasure.
Unnamed Engineer
Thank you.
Chris Sosa
Great to have you@infra AI 2025. And we're going to talk a little bit about scaling AI. And we are at the right place. Right. What are the workloads looking like today?
Unnamed Engineer
You know, they're looking faster and faster, really getting more and more out of these, these machines. It's really helping us move faster.
Chris Sosa
What are the challenges, the bottlenecks that AMD is trying to plow through?
Unnamed Engineer
So one of the biggest set of challenges is really about really maximizing utilization. So, like, especially if you're looking at building any kind of like sort of set of machines at scale, like you get, you know, you can optimize for a single machine, but once you have many, many machines, you have a lot of problems. So a lot of those problems deal with people. So, for example, if you're building a developer platform and you have a bunch of folks who are optimizing and tuning for different types of workloads to run on a platform, how do you actually get them to only make sure they're leveraging the gpu? For example, are they downloading a bunch of artifacts before they run them? How much of the time are they consumed downloading artifacts versus actually using the gpu? Because effectively, anytime you have a workload that isn't consuming the gpu, you're actually wasting money because these things are expensive and you want to really optimize what you paid for.
Chris Sosa
So how do you keep streamlining the process and keep it efficient?
Unnamed Engineer
It's a mix. Right? So there's a lot of technologies in the larger ecosystem that we really do rely upon. We make it easy to leverage Kubernetes, leverage Slurm. Those are very common ways to do workload orchestration, but they're not enough. On top of that, you have to build dashboards for tracking utilization. That way, if you have specific teams who are leveraging the GPUs in very inefficient ways, you can go chase them, because that's actually pretty. That's the human aspect. Like, hey, you don't control all the workloads. You have workloads that get put into the platform. You have to figure out how to maximize it.
Chris Sosa
So finding native requirements along the way.
Unnamed Engineer
Yeah, so there's a lot of native requirements. Like storage is one of the big ones that you really have to solve for. And you have to solve for in a way that's especially at scale, that's more cloud, native. Right. So especially for low level engineers who are used to optimizing and tuning workloads on a single machine, the concept of not being able to SSH into your machine is actually fairly challenging to grok. And so when you're trying to actually leverage a larger platform, you have to at one point try to meet developers where they are and try to make it as easy for them to consume, but you need to do it in a way that you actually optimize the utilization of all the GPUs that they have available to them.
Chris Sosa
So it's constant balance, power, performance, efficiency at scale.
Unnamed Engineer
Yes, at scale.
Chris Sosa
In a competitive marketplace.
Unnamed Engineer
Exactly.
Chris Sosa
All right, so that's an easy job.
Unnamed Engineer
Yeah. All right. And doing it as fast as possible, because the second you're building a platform and your cto gives you 100 machines, they expect you to be using all of it immediately. Wow.
Chris Sosa
Fun. So, biggest challenge, looking forward next 12 months.
Unnamed Engineer
I think the biggest challenge continues to be probably a combination of how do you do distributed inference and training really well combined with how do you do that with heterogeneous workloads? Because it's one problem to if you have one type of workload and you just do distributed inference, leveraging 2020 servers with eight GPUs, powers of two, so like 32 with 128, but doing that when you're not the only workload running. Like you have multiple distributed inference and the distributed inference is for two different stacks. How do you do smarter orchestration to actually optimize how those are placed, how those work well, while trying to optimize.
Chris Sosa
For GPU utilization and what's next for ROCM and Instinct development.
Unnamed Engineer
So for rocm, one of the big things that's actually really exciting that we're doing is we're really trying to meet developers. Like developers are a huge part of what makes AMD successful. And then we're not talking about internal AMD developers, we're talking about people contributing to Pytorch, contributing to a lot of the inference stacks that you really need to actually run your models, both to train your models and run your models. What we're really trying to do with ROCM is make it really easy for everyone to effectively access our AI stack. One key part of that is making it so that the chip, the GPU that you have, whether you have a Windows machine and maybe a three year old Radeon chip, if you can use that to leverage our AI stack, the better. So we're really trying to empower that. Our goal is to really Move back and actually be able to support the last 10 years with the ROCM stack and make it really easy for if there are any things where like hey, this doesn't quite work from a nine years ago Radeon trip. Like, hey, if I'm someone who wants to fix that and help amd, it's really easy to do that.
Chris Sosa
Great. So biggest opportunity next 12 months.
Unnamed Engineer
I think honestly it's a little hard to say. I think there is a mix of really kind of optimizing for the machines you have. Like, I think that, you know, I would say like the growth in models and like actually I'll take this back. Yeah, I don't know the answer to this question.
Chris Sosa
Okay, well, it's a hard one and I don't want to put you on the spot because you're a public company and all, but from the standpoint of what has you most excited over the next for the next 12 months, how about that question? I'll re ask it too. Is that okay?
Unnamed Engineer
Yes, I think that's good.
Chris Sosa
Okay, so what has you most excited? Looking Forward the next 12 months?
Unnamed Engineer
I think the next 12 months I think kind of like definitely the smarter workload orchestration, that's a huge component of it. But also making it easier to compose these platforms that work really well for training and inference. Because one of the things we're seeing with building a lot of these platforms is like for example, Slurm works really well, for example training. And we see a lot of the different training stacks actually work really well for Slurm. But when you're developing a platform for production workloads and you're running inference, Kubernetes is actually a really good solution or something more bespoke than that. Seeing more of a convergence about being able to leverage these stacks together as part of one platform. I'm really excited by that because it makes it really easy to build these platforms and really optimize on not two different platforms you have to build, but one platform that you're really optimizing for driving the overall utilization up.
Chris Sosa
Well, that's great. Chris, thanks for sharing this insight and really appreciate stopping by.
Unnamed Engineer
Thank you very much. Thanks for having me.
Podcast Summary: Liftoff with Keith Newman
Episode: Chris Sosa on Scaling AI at AMD: The GPU Efficiency Challenge That Will Shape the Future
Release Date: July 16, 2025
Host: Keith Newman
Guest: Chris Sosa, Director of Engineering at AMD
In this engaging episode of Liftoff with Keith Newman, former journalist and Silicon Valley dealmaker, Keith Newman, sits down with Chris Sosa, the Director of Engineering at AMD. The conversation delves into the intricacies of scaling artificial intelligence (AI) within AMD, focusing particularly on the challenges and innovations related to GPU efficiency. Drawing from Chris's extensive experience, the discussion offers valuable insights into the current AI landscape, AMD's strategic approaches, and the future of AI-driven technologies.
Chris Sosa kicks off the conversation by addressing the evolving nature of AI workloads:
"They're looking faster and faster, really getting more and more out of these machines. It's really helping us move faster."
(00:10)
He emphasizes the increasing demand for high-speed processing and the necessity for AMD's infrastructure to keep pace with the burgeoning computational requirements of modern AI applications.
A significant portion of the discussion centers around the hurdles AMD faces in maximizing GPU utilization across large-scale deployments:
"One of the biggest set of challenges is really about really maximizing utilization... Anytime you have a workload that isn't consuming the GPU, you're actually wasting money because these things are expensive and you want to really optimize what you paid for."
(00:34 - 01:22)
Chris highlights the complexities of managing multiple machines at scale, where optimizing a single machine differs vastly from orchestrating fleets of GPUs. The focus is on ensuring that every GPU is efficiently utilized to justify the substantial investment in hardware.
Addressing the need for streamlined operations, Chris discusses the balance between leveraging existing technologies and implementing custom solutions:
"We make it easy to leverage Kubernetes, leverage Slurm...but they're not enough...you have to build dashboards for tracking utilization."
(01:27 - 01:59)
He points out that while tools like Kubernetes and Slurm are foundational for workload orchestration, additional layers such as utilization dashboards are crucial for identifying and rectifying inefficiencies, especially when dealing with diverse teams and varying workloads.
The conversation underscores the perpetual challenge of maintaining an equilibrium between power, performance, and efficiency:
"It's a constant balance, power, performance, efficiency at scale."
(02:40)
Chris succinctly captures the essence of AMD's mission to deliver high-performance AI solutions without compromising on efficiency, all within a competitive and rapidly evolving marketplace.
Looking ahead, Chris delves into the complexities of distributed inference and training, particularly when handling heterogeneous workloads:
"How do you do smarter orchestration to actually optimize how those are placed...while trying to optimize."
(03:08 - 03:50)
He elaborates on the challenges of managing multiple distributed inference tasks across different AI stacks, emphasizing the need for intelligent orchestration to maximize GPU utilization across varied and simultaneous workloads.
A pivotal segment of the discussion focuses on AMD's ROCm (Radeon Open Compute) stack and Instinct GPUs:
"We're really trying to make it really easy for everyone to effectively access our AI stack...empower that."
(03:56 - 04:53)
Chris explains AMD's commitment to supporting developers by ensuring compatibility and ease of use across different hardware generations. By facilitating contributions to frameworks like PyTorch and enhancing support for older Radeon GPUs, AMD aims to foster a more inclusive and versatile AI development ecosystem.
When probed about upcoming opportunities, Chris expresses enthusiasm about advancements in workload orchestration and platform composition:
"I'm really excited by...being able to leverage these stacks together as part of one platform...optimizing on not two different platforms you have to build, but one platform that you're really optimizing for driving the overall utilization up."
(05:35 - 06:26)
He envisions a future where training and inference platforms converge, simplifying the development process and enhancing GPU utilization through unified orchestration frameworks.
Maximizing GPU Utilization:
"Anytime you have a workload that isn't consuming the GPU, you're actually wasting money because these things are expensive and you want to really optimize what you paid for."
(00:34 - 01:22)
Streamlining Workloads:
"We make it easy to leverage Kubernetes, leverage Slurm...but they're not enough...you have to build dashboards for tracking utilization."
(01:27 - 01:59)
Balancing Act:
"It's a constant balance, power, performance, efficiency at scale."
(02:40)
Future of Workload Orchestration:
"I'm really excited by...being able to leverage these stacks together as part of one platform...optimizing on not two different platforms you have to build, but one platform that you're really optimizing for driving the overall utilization up."
(05:35 - 06:26)
The episode provides a comprehensive look into AMD's strategic approach to scaling AI, particularly through the lens of GPU efficiency and workload orchestration. Chris Sosa articulates the multifaceted challenges of maximizing GPU utilization, managing distributed AI workloads, and fostering an inclusive developer ecosystem. Looking forward, AMD's focus on integrating orchestration platforms and enhancing the ROCm stack positions them well to navigate the complexities of AI scalability. The conversation underscores the importance of continuous innovation and strategic optimization in maintaining AMD's competitive edge in the AI landscape.
For more insightful discussions and stories from tech leaders, explore over 80 episodes of Liftoff with Keith Newman here.