Summary7 min read

Eye on A.I. Podcast – Episode #324

Guest: Sharon Zhou, VP of AI at AMD

Host: Craig S. Smith

Date: February 27, 2026

Episode Overview

In this episode, host Craig S. Smith sits down with Sharon Zhou, VP of AI at AMD, to delve into AMD's strategy for building self-improving artificial intelligence. Sharon discusses the cutting-edge efforts underway to use AI for automatic kernel code generation—crucial for boosting performance on AMD hardware—and how these self-improving systems could shape the ecosystem at large. The conversation covers kernel generation, continual learning and catastrophic forgetting, the boundary between automation and genuine self-improvement, and the broader educational outreach Sharon is piloting. The discussion situates AMD's efforts within the constantly evolving landscape of AI research, hardware, and global compute demands.

Key Discussion Points and Insights

1. Introduction: Sharon Zhou’s Background and AMD’s AI Focus

Sharon introduces herself as VP of AI at AMD, describing her transition from AI research at Stanford, through a startup focused on post-training language models on AMD GPUs, to her current role.
- “I am Sharon. I'm the VP of AI at AMD and I think about self improving AI, self improving LLMs which we'll get into later. ... And most recently ... we have transitioned now to AMD. So my team and I are there now and very excited to enable more people to use compute and to get access to compute because that really is the limiting factor.” (00:23)

2. Defining Self-Improving AI and Kernel Generation

Sharon explains "self-improving AI" as models that can edit any part of themselves for improvement—data, architecture, evaluation, or even the foundational kernel code.
AMD’s efforts are particularly focused on enabling language models to write their own low-level kernel code to optimize performance on GPUs.
- “It's the idea of these models being able to edit any part of themselves to improve themselves ... what I'm working on is ... how fast they actually run on the GPUs themselves. They are writing the kernel code that underlies these models to run faster on these GPUs ...” (02:00)

3. Kernel Generation: Industry Collaboration and Benchmarks

Sharon describes a landscape where multiple organizations—Meta, Google, DeepMind, Nvidia, Stanford—are working on AI-assisted kernel generation.
AMD collaborated on a NeurIPS tutorial to educate the community about AI-generated kernels and shared benchmarks/methods.
- “What we did most recently was in collaboration with actually a bunch of different institutions ... Was NeurIPS tutorial on generating kernels using AI ... how we're using AI agents [to] generate these kernels and how we're thinking about post training these models to generate kernels more effectively.” (02:51)

4. Technical Deep Dive: What Are Kernels and Why Optimize Them?

The conversation explores what “kernels” are: small, low-level programs that run specific operations (e.g. matrix multiplication) on GPUs and are key to AI model efficiency.
- “For a given piece of hardware ... you do have to write the software to connect the models of today and of tomorrow, ideally to that hardware. ... It's utilizing the GPU effectively, both the memory ... as well as the actual raw compute power.” (04:58)
Matrix multiplication is the most common and critical kernel operation for LLMs.

5. Evolutionary Strategies, Agentic Approaches & Bottlenecks

AMD combines evolutionary, agentic, and post-training methods to generate and refine these kernels.
Manually written kernels require rare expertise in both hardware and software, making automation/self-improvement a huge productivity booster.
- “Having those two areas of expertise in one person's head is quite rare. ... But of course we're also teaching the models about this. ... We have the Rockham Stack, which is open source. ... That being open actually helps us from a language model perspective because the models can read that data and train from that data.” (08:52)

Notable Quotes & Discussion Highlights

The Impact and Ambitions of Kernel Self-Improvement

“The part of self improving though we've been working on is using these language models to write low level kernel code. ... Get the model to actually write its own code to run even faster on the GPU so that it can learn even faster. This is the self improving loop…” (13:59)

On Continual Learning and Catastrophic Forgetting (12:22)

Sharon explains the risk of “catastrophic forgetting” during post-training, especially when access to pre-training data is limited:
- “Even if you include ... only 1% of the pre training data back in during post training, you can actually prevent catastrophic forgetting, basically enabling the model to actually like connect back to its representations back in pre training.” (12:22)

Distinction: Automation vs. Self Improvement (26:21)

Sharon sees kernel generation as more than rote automation:
- “I view it as self improvement, not automation ... the kernel itself is what, is what the model is running on. So it is the model's own code and it enables the model to run even faster on the gpu. ... but if we want to view it through a lens that's like more exciting ... that is just a different perspective on it.” (26:42)

The Reality of AI Autonomy (27:20)

On whether models will soon truly rewrite their own code:
- “Yeah, I definitely think so. I think more likely it will be a different model that does it.” (27:31)

Resource Saving and Industry Economics (34:30)

On the tangible impact:
- “A very, very complex kernel can take a, at least you know, let's say a non expert, but someone who might still be tasked to do it. Months to write ... like an expert would take a couple weeks ... if you can even shave off ... a tiny amount of time it takes for one matrix multiplication that occurs billions trillions of times inside of a model ... that can incur billions, hundreds of billions of dollars to a company.” (34:30)

Compute Demand and Infinite Chips (39:26)

On whether kernel optimization could ease the chip shortage:
- “I think people want infinite chips, Craig. So no, it doesn't relieve the pressure. I don't think they found a plateau where they are like, we're, we're done ... I, I don't see that right now.” (39:26)

Segment Timestamps

| Timestamp | Segment | |------------|------------------------------------------------| | 00:23 | Sharon’s background & her excitement at AMD | | 02:00 | Defining "self-improving AI" | | 02:51 | Industry collaborations on kernel generation | | 04:58 | Technical explanation: Kernels & GPU stack | | 07:01 | Evolutionary/agentic strategies for kernel gen | | 08:52 | Bottlenecks & value of open source Rockham | | 12:22 | Catastrophic forgetting & continual learning | | 13:59 | Models writing their own kernel code | | 19:53 | AMD's approach: research & product focus | | 21:39 | Recap of NeurIPS tutorial for community | | 26:42 | Automation vs. self improvement discussion | | 27:20 | Realistic prospects for model autonomy | | 34:30 | Economic impact and time savings | | 39:26 | Compute demand & market implications | | 41:54 | Use cases: Language models, diffusion, vision | | 45:46 | Sharon’s educational initiatives |

Educational & Outreach Efforts

Sharon teaches AI at scale, collaborates with DeepLearning.AI and Harvard University, and is developing practical courses for different levels, including a popular post-training class on RL and fine-tuning.
- "I teach online. I teach about a million people ... and lately we built up a partnership with Deep Learning AI ... launched a post training class on reinforcement learning and fine tuning of these models." (42:36)
- Courses are available for free on DeepLearning.AI; Harvard courses coming soon. (46:04)

Memorable Moments

On AI replacing kernel engineers:
- “If you're listening, I encourage you to go learn about it. But of course we're also teaching the models about this. ... That's the impact ... massive ...” (08:52)
State of AI self-improvement:
- “We're further than we think and closer than we think.” (15:33)
On the open/closed nature of GPU stacks:
- “We have the Rockham Stack, which is open source. That's the equivalent of Nvidia's Cuda Stack, which is not as open. That being open actually helps us ...” (08:52)

Closing Synthesis

This episode offers an insider’s view into how AMD is leveraging AI not only to speed up its own hardware, but also to automate critical software infrastructure, democratize access to high performance compute, and seed a broader community of AI practitioners. Sharon Zhou’s expertise bridges academic research, practical productization, and educational outreach, situating AMD at the intersection of hardware innovation and AI self-improvement.

Where to Learn More

Courses:
- DeepLearning.AI: Free courses on RL and model fine-tuning. (46:04)
- Harvard University: Upcoming pro courses for non-engineers and leaders.
Open source contributions from AMD: Kernels and benchmarks.

For more insight on the AI hardware-software frontier and the evolving balance between automation and autonomy, this episode is a must-listen for practitioners and enthusiasts alike.

Loading summary

Transcript72 lines

[00:00]
A
Catastrophic forgetting is definitely a problem, especially in post training and when you don't have access to the original pre training data.
[00:08]
B
How much of that is focused on improving the design of AMD hardware and how much of it is focused on general model development which AMD is not doing right? AMD does not develop its own models.
[00:23]
A
I think people want infinite chips, Craig. So no it doesn't relieve the pressure. I am Sharon. I'm the VP of AI at AMD and I think about self improving AI, self improving LLMs which we'll get into later. But my background comes from AI research so I used to be an AI researcher at Stanford where I did my PhD with Andrew Ng. I taught there as adjunct in generative AI back before all this ChatGPT stuff. And after Stanford I started a startup, an AI infrastructure startup doing post training of language models actually on AMD GPUs. This was started a couple months before ChatGPT launched and most recently over last several months we have transitioned now to amd. So my team and I are there now and very excited to enable more people to use compute and to get access to compute because that really is the limiting factor, one of the big limiting factors for developing AI and being able to enable more people to steer these models. So that's what I'm really excited about. And yeah, that's, that's why I'm here.
[01:37]
B
Yeah. And, and I do want to talk about self improving AI. Can you start by defining what we're talking about when we talk about self improvement? Are we talking about models that rewrite their own code or something like refining their own training data?
[02:00]
A
Yeah, I think that's exactly it. It is a broad category but essentially it's the idea of these models being able to edit any part of themselves to improve themselves, whether that be the data, whether that be the actual model architecture, whether that be how they evaluate themselves. Actually the part that I'm working on is below all of that and it's actually how fast they actually run on the GPUs themselves. They are writing the kernel code that underlies these models to run faster on these GPUs and to run effectively on them and on new hardware too. That's been really exciting to see.
[02:38]
B
Yeah, I just read about kernel evolve. Who is doing kernel evolve? I've forgotten. Is that Google? Is that your guys work?
[02:51]
A
So there's a lot of different pieces of work around kernel generation and being able to use LLMs to generate these kernels. We're doing some of that, but I think that Work might have been from Meta, but there's. Yeah, I think there's work across the board, across the industry that is very important I think towards this end because it enables more people to get on, get on different types of compute. What we did most recently was in collaboration with actually a bunch of different institutions like Meta, Google, DeepMind, ML, Common, Stanford, Nvidia etc. Was NeurIPS tutorial on generating kernels using AI. And so we presented that and basically goes through how we're using AI agents generate these kernels and how we're thinking about post training these models to generate kernels more effectively. Because what's really exciting about kernel generation and kernel development is actually we have the profiler, so we have the ability to actually see how fast the generated kernels are on the chip themselves. That's really exciting. My team, we're also working on a more robust production level benchmark to share with the community as well as well as different techniques to modify the models to, to, to do better on this task.
[04:16]
B
Okay. The. Yeah. And for again, I'm trying to make my, my podcast a little more accessible.
[04:24]
A
Oh yeah, of course.
[04:26]
B
So kernel in this sense because kernel is used in everywhere too many places. But kernel in this sense, you're talking about a small piece of software that, that lives on the processing unit that performs a specific task. Is that right? And what are the kernels that you're generating from AI?
[04:59]
A
So there are many layers of the stack for AI. There is the model layer where we, we kind of build out the model architecture. This is where people talk about transformers and attention and kind of write that out in Pytorch or Jax or TensorFlow. And so people are like building out models there and then they're using maybe hugging face on top or they're using different tools to leverage those models. Now underneath all of that are ways where the model is running on the GPU and there are many, many layers, but one of the layers is making those models run really fast on the gpu. And sometimes you can break that down into small different pieces because it's a lot of matrix multiplications, for example, it's a lot of different operations happening on the gpu. And so for each of those operations you can have a way of basically optimizing the speed of that on the GPU itself. And so for a given piece of hardware, it may have been designed to be of a certain speed. You do have to write the software to connect the models of today and of tomorrow, ideally to that hardware. And that's Kind of what that layer is doing. It's, it's improving the efficiency. So it's utilizing the GPU effectively, both the memory, so like what you know, the GPU has as memory and storage as well as the actual raw compute power. So being able to schedule all of that and use that effectively and maximally is, is what kernels do. And this is really important because it's expensive. Right. Like GPUs over a long time and over just a large capacity, when you're parallelizing it becomes expensive. And so you want to be able to eke out as much as you can on, on a, on each gpu.
[06:50]
B
Yeah, and I, I understand how. And you're using evolutionary strategies to, to develop these.
[07:02]
A
Yes, I would say it's a combination of evolutionary strategies, agentic strategies, as well as different types of post training strategies to be able to get these models, to actually be able to write that code effectively or at least assist our internal kernel engineers to do so. I think one of the most important things is to do so on useful kernels as well. The most important kernels for language models.
[07:29]
B
Yeah. Well, again, for listeners, what are the most useful? What's, what's an example?
[07:35]
A
Well, I would say matrix multiplication. So when you're doing, when you're learning about the math of a neural network, you'll see that there's a lot of matrix multiplications and a lot of those need to be optimized. And you know, at first you might think oh, isn't there just one, you know, these multiplying. But actually no, you can optimize different size matrices that multiply together and you can get like even better performance for something of a different size. Like you can optimize that. So there are thousands, possibly hundreds of thousands of of those kernels that you could optimize for for language models. And of course for a particular customer, particular like foundation model company, they may have a certain set that they particularly care about for their architecture and those are ones that are, are of high priority.
[08:25]
B
Yeah, and, and the, the, these kernels in the past have been written manually. Is, is that right? So, so having there's a tremendous productivity gain and having AI either do it autonomously or do it as an assistant. What, what sort of impact does that have on amd?
[08:52]
A
I think it's having larger and larger impact. I would say that historically they've been written manually. And just to give you a sense of how much knowledge this kernel engineer needs to have in their head, they need to know about the GPU architecture. So not just general GPU architecture, not just your average architecture class, but like that generation of the GPU and what changes have been made, how to actually write the code to run the software on it, and all the different possible new things that have been invented at the hardware layer, they have to understand all those things and then they also need to understand at least enough about what this matrix multiplication is doing. Right. And to actually then write that optimally in the code. And I would say that having those two areas of expertise in one person's head is quite rare. And as a result it's very, very valuable. But also it is bottlenecked at a lot of hardware companies, or actually a lot of different companies, including Frontier Labs. They also have people writing kernels to speed up their models as well, since they have more visibility on their own models that may not be fully visible to everyone else. And so I think it's a rare skill. And if you're listening, I encourage you to go learn about it. But of course we're also teaching the models about this. Right. And so what's exciting is we for amd, we have the Rockham Stack, which is open source. That's the equivalent of Nvidia's Cuda Stack, which is not as open. That being open actually helps us from a language model perspective because the models can read that data and train from that data and use that data to then learn about the Rockham Stack and then write those kernels for us. That's what we're doing here. That's the impact is, is massive from both a, I would say like direct customer lens of what it could do to improve that, but also I would say in the long tail, so there are researchers working on various different models that are not just ChatGPT or you know, just Grok or just, you know, just a few of these Frontier models, but they're working on a bunch of different other models and they're creating, they're inventing the next model. Right. And those also need to be optimized. And I think having something that is almost just in time, ideally just in time, but you know, like almost automatically give you a optimized kernel for is, is very essential to enabling just the entire ecosystem to be able to move over.
[11:31]
B
Yeah, and when you say almost just in time is this, this is being done on the fly or, or.
[11:43]
A
That is my goal, Craig. That is my goal.
[11:47]
B
Yeah, because you talk about self improving AI, that suggests that there are models that are evolving over time or improving over time. And that leads to the question of Continual learning, which is kind of the holy grail, but there's catastrophic forgetting and expandability of neural nets and all those problems. Can you talk about where the, the research stands on continual learning?
[12:22]
A
Yeah, so catastrophic forgetting is definitely a problem, especially in post training and when you don't have access to, to the original pre training data. Because I think we found in the literature at least that even if you include originally it was like if you include 20%, but actually if you include only 1% of the pre training data back in during post training, you can actually prevent catastrophic forgetting, basically enabling the model to actually like connect back to its representations back in pre training or at least help significantly for, for that. And I think it again depends on access in terms of who is doing some of that post training, but I think that's what the literature points to. So it's the extent to which you can access some of that data or use some of those online data sets to just bring back some of those data points from free training that could actually help your model significantly in preventing catastrophic, something like catastrophic forgetting. But of course I think this is something that needs to be continually monitored. So I think the Frontier Labs maybe don't have this problem as much. But as an individual who is probably doing some type of post training, that is something that you do have to consider especially as you get your, especially as your workload gets heavier and heavier, small little bits of fine tuning that doesn't really change as much.
[13:47]
B
But so self improving are you. Yeah, define self improving for me. I mean what are you improving in or attempting to improve in a model?
[14:00]
A
The part of self improving though we've been working on is using these language models to write low level kernel code. This is code that makes the language model itself run really fast on AMD GPUs. Get the model to actually write its own code to run even faster on the GPU so that it can learn even faster. This is the self improving loop that we're enabling the models to do here. That's really exciting because then that'll enable more people to access this compute and it'll enable more powerful models within a shorter amount of time.
[14:33]
B
And is that something then that you deploy with AMD hardware that people can use with their models or is it a separate model that you make available to users of the hardware?
[14:51]
A
Yeah, this is such a good question. Right now it's being used internally and we're releasing different datasets and evaluation benchmarks and methods out into the world. So we did a Neurips tutorial that presented a lot of this work and we're preparing for ways to actually package this and make it possible for our customers to use it as well.
[15:13]
B
Yeah. And then the idea of editing a model, editing its own training data or why would you want to do that? And can you give me an example and how is that being done?
[15:33]
A
Yeah, so we're editing the kernel code for the model in terms of editing training data. What that looks like is usually this is like a synthetic data setup. So the model is generating data for the next round of training. So yeah, self improving AI is more general than just having the models generate kernels and write these low level kernels to make the models really fast on GPUs. It could also touch on generating data for itself as well as generating evaluation for itself and also generating different architectures for itself as well. So basically using the models to improve itself at any stage of the pipeline. I think data generation is really interesting, just synthetic data generation overall, and there's been a lot of work and research and speculation on it. I think we have found that using synthetic data to train these models is very helpful, especially if it's actually not a frontier model and you're distilling from a frontier model. It's not quite distillation, sorry. But it is using synthetic samples from a frontier model to then supervise a smaller model that you might be training. That's very effective. And I think like there's also debate around synthetic data generation and how much that might may or may not lead to the collapse as the model trains since it's looking at its own kind of data. But I do think it is an important space to continue monitoring and continue understanding. Yeah. So I think what's really interesting is this holy grail that everyone's looking towards of having the model fully improve itself and create its own next generation. I think we're both further than we think and closer than we think. And so I think we're further than we think in the sense that there are. There is still a lot of human expertise that goes into, you know, how these things are laid out and what we should try next, for example, of what the model, you know, what we should add into the model, whether it be new data source or not. But I do think there's more than we think as well in terms of the use of AI to help us code in general. So the amount of code that AI is writing is pretty enormous and it is supporting a lot of engineers that are putting these together and creating the next generation of that AI. So I think it's both we're further and closer than we think. Basically we're further than we think from the like ultimate like, okay, there's only one person works at OpenAI now. It's only Sam there. And it's just the end with ChatGPT versus where we're closer than we think. Because actually everyone is actually using ChatGPT quite significantly to write that code. So I think it's in that state right now, but it's not exactly like autonomously improving itself fully. There are pieces that you can launch that feel more autonomous, where you could have it like reflect on its own output and then improve that. It can look at the profiler information, so it can look at how fast a kernel ran on the GPU and improve that. But it's not quite like I'm gonna let this go and come back in a month and it'll be fully solved.
[19:05]
B
Yeah, yeah. Well, let's, can we back up a little bit? As I said at the beginning, AMD's a hardware company. You're in charge of AI at, at AMD.
[19:18]
A
Part of it, yeah.
[19:20]
B
You're part of it, yeah. Is, is that, is, is that a research function? Because. And, and how much of that is focused on improving the design of AMD hardware and how much of it is focused on general model development which AMD is not doing right? AMD does not develop its own models.
[19:54]
A
We have actually pre trained some models and released them, but they are small in the single digit billion parameters range. And yeah, we're probably like not developing models in the sense that like not chatgpt, but, but yeah, we, we do have some that are open source models. Yeah. So I focus on, I almost call it like product research. I think there is a research element to it that we don't know how good the models are at solving this full task end to end. But there is a product component in that this is part of the loop with what becomes customer facing. Right. So then this becomes part of the loop of what we want to make available to our customers. And so it's shippable and there is a production component to it. And so I would say it kind of sits at that intersection which I know a lot of things today sit at. So yeah, I think that's one of my teams works on that and the other team works on, I would say like research and education. So we're thinking about how to stay at the forefront of research and engage with researchers everywhere, but also educate the world and evangelize, you know, educate the world about AI and evangelize AMD GPUs that are able to support all of those new AI workloads. And so yeah, that's the couple teams that I've been very grateful to be leading.
[21:27]
B
Yeah and I, we tried to connect it neurips and I'm sorry that we weren't able to and I missed the, the, the talk. Can you go through what the presentation that you gave at NeurIPS?
[21:40]
A
Yes. So it was a tutorial, two and a half hours and my team did a huge part of the presentation, so definitely not just me here. And it was a collaboration across a lot of different institutions. So like you know, Stanford, Google, DeepMind, Meta, Nvidia, ARM, etc. And so it was, yeah, a really, really great group of people. And what we presented in our tutorial was a few different things. First we, because it's a NeuroPS audience, AR researchers, they might not be actually as familiar on GPUs and hardware and how the GPU actually works. And so we kind of step through in hopefully very easy to understand ways, the actual GPU architecture and what pieces that an AI researcher might find interesting, like the memory and like how things are scheduled in like something I find interesting is, you know, when you change your batch size, why suddenly like your, your job takes forever, right? Your training job takes forever. Oh, it's because of, you know, X, Y, Z happening in, in the actual gpu. And so that's what we go through first and then we go through kind of different kernels. So like showing you what a kernel actually is in code and a very simple kernel and then showing you what it looks like when it's optimized and unoptimized. So you can understand like oh, when a kernel runs fast or slow, what it looks like. So if something runs really slowly, that means like ChatGPT could answer you in like minutes as opposed to seconds. Right. So that, that's ultimately what it feels like. But then when you go down into like just a single matrix multiplication, what does that look like? Or a single addition, you know, like X plus Y, what does that look like in terms of time and how do you profile it? And then we go into, okay, what are AI agents and methods that we're using for these self improving AI essentially. But essentially what are some of these methods that we're using and sharing, sharing a lot of that. So like what kind of, what kind of things can you do with agents? You can have them like you can write. Right, right. Me a more optimized kernel. Okay, it does maybe is or isn't optimized. You have to check the correctness of it. Maybe it's cheating. And then when you profile it, you can actually get the answer back. And you're like, actually it was only 1.2x faster. I want it to be even faster. And then you can continually evolve it that was based on Google's alpha evolve paper, but essentially like continually improve it after multiple calls. The profiling information also provides a really interesting way of collecting or creating an RL environment so that these models can learn from essentially a verifiable reward from the profiler itself. So it's able to give this number back as, okay, this is how fast your kernel was that you generated and tells the model in a very verifiable way. Kind of like when, when chatgpt they were training on MA math tasks, it was very verifiable whether the math was correct or not. And this helps the model improve in post training. And so sorry, I'm trying to make this accessible, but like, so basically we go through all those different techniques as well and we share that with the community.
[25:04]
B
Yeah. And this is done at what point in the process someone, a customer, is using AMD hardware and they have a model that they want to run on it, and so they're optimizing the kernel before deploying the model, or is it after deployment and they want to improve it?
[25:28]
A
Yeah, So I think it's a combination of those things. So I think you could use it definitely before you deploy something. So before you deploy something, you're like, okay, this isn't fast enough, given my budget with this many GPUs and we need it to hit these markers. Right. And so because that basically translates to tco, translates to the total cost of ownership. And so let me reduce the speed of all. Sorry, yeah, let me make these, all of these faster inside of the model so that the full model runs really quickly. So that could be before deploying. Something could also happen afterwards too. A lot of different kernels are written afterwards as well to just speed things up even further. So it could be something you've already deployed and now you want to speed things up to make use of the hardware you have even more. So I think it could occur in both of those settings.
[26:22]
B
Yeah, In my mind, this is more automation than self improvement. Right. You're automating the kernel writing process and that improves the model. But where does self improvement come in?
[26:43]
A
Yeah, so the self improvement is. So I view it as self improvement, not automation. But I view it as self improvement because the kernel itself is what, is what the model is running on. So it is the model's own code and it enables the model to run even faster on the gpu. But I guess when broken up, it could be seen as just like, oh, automation. And I do think, I honestly think everything is technically automation, but if we want to view it through a lens that's like more exciting where it feels like a, you know, autonomous. That is just a different perspective on it.
[27:20]
B
Yeah. But can the day come where a model will autonomously go in and rewrite its kernel?
[27:30]
A
Yeah.
[27:30]
B
To speed things up.
[27:32]
A
Yeah, I definitely think so. I think more likely it will be a different model that does it. Yeah.
[27:40]
B
An external model other than the inference model. So what you mean?
[27:45]
A
Yeah, or. I mean it still could be that model, but I guess it depends on how you view what that model is, whether it's the same instance of that model or not or a different prompt. But yeah, it will, it will be optimized whether by that own model or by another model.
[28:04]
B
Yeah. Are there things other than kernel optimization that you're working on? I mean, you mentioned a few things.
[28:11]
A
Yeah, so another area has been RL research. So reinforcement learning inside of post training and so that's been really exciting to track and to work on. It is related to kernel generation since a lot of those techniques can be ported over. But more generally we've been exploring those internally for different use cases as well as for ways to actually make sure that all of this does run on AMD's hardware. The type of RL research that we're looking at includes kind of the techniques that started with ChatGPT. So the techniques that got us ChatGPT included something called RLHF RL with human reinforcement learning from human feedback, and that used an algorithm called ppo and so used a different, a certain type of algorithm that now has, I think found successors as well as, you know, compliments to like grpo, which came from Deep SEQ earlier last year. But basically there's a huge set of research in this area to more effectively use reinforcement learning in the post training of these language models. And what reinforcement learning really does is a couple things. One area that's really been exciting is that human feedback piece. Right. So that you could actually take human preferences, whether it be like rankings of things or pairwise comparisons of things of your own preferences, and you just give a lot of examples of those. You can actually teach the model to mimic those preferences and those preferences could turn into A more helpful model, it could turn into a safer model so that it doesn't like, say, harmful things. And so, yeah, and I think another application of reinforcement learning has been around reasoning. So if you're familiar with thinking in all these models that you're using, that's what reasoning is effectively doing, using far more tokens. The model is basically writing down its thoughts before it responds to you. And that process is done using. Using. Well, using post training, definitely part of it being rl. And I would say like that extended to, let's say math or coding tasks has become really exciting because they're verifiable, is what we say in the research world. And what that means is that these tasks, you can verify a math proof or you can verify the code. And that becomes really interesting because then you don't need to collect that human feedback, which might take time, it might be brittle because you can't cover every possible use case, but instead you have something that checks it, that can cover everything and that provides feedback back to the model. And that's called RL with verifiable rewards instead of RLHF from human feedback. That's been a really exciting direction as well, and been a direction that we've been looking into a lot because we were getting these models to write code. And we have verifiable rewards from the profiler of our GPUs that tell us how fast these kernels are actually running on the gpu. And that's a number, and that's verifiable. And it's true, it's not subjective. And so that can easily go back into the model as it's learning. It's a lot of different algorithms going in that direction to improve that process and make it more effective.
[31:44]
B
And how the kernel writing takes place at the customer. On the customer side, when someone is deploying a model onto AMD hardware or any hardware, and they want to optimize the kernel or is it taking place? Do you do this work at AMD and then make those optimized kernels available to people depending on the model that they're running?
[32:18]
A
It's the latter. Right now we are basically doing a lot of this internally and then making it available open source. All our kernels are open and that's what we're doing. I think the hope is to actually get it out there. Hopefully sometime this year we'll see. But the goal is to get it out there and make it possible for our customers, users, or any user really, to be able to leverage that Very easily. Now today I think what's really effective is you can go into cursor, you can go into any of these AI coding agent environments and you can do some of this on your own as well. And there are companies doing this as well. Yeah, yeah.
[33:04]
B
And you're presumably also working on code assistance for use within amd. Beyond writing kernels, are there other parts of the stock that you're focused on?
[33:21]
A
Oh, there's so many parts of the stack that need code assistance so I, I only, I mainly focus on kernels and the actual models but I think there's so many layers of the stack like we have to integrate with open source libraries so that things just work by default on AMD when a developer in one of those higher level stacks is using us. There's a lot of code at so many different layers of the stack, which I think I honestly fail to appreciate before joining amd, that there are so many layers even underneath the shiny model layer that everyone knows about.
[34:04]
B
Yeah, but you said you do work on models as well.
[34:08]
A
Yes, it's for the kernel generation.
[34:10]
B
Right, Right, yeah. And, and can you give me a sense of how, how much time is saved or how many man hours are saved by having a kernel generation software like this?
[34:31]
A
I probably can't give you direct stats, but what I can share is that, that these kernels could take a very long time to write. A very, very complex kernel can take a, at least you know, let's say a non expert, but someone who might still be tasked to do it. Months to write like an expert would take a couple weeks but that's still substantial amount of time. It's not like it's a, you know, a couple minutes, a couple hours. Right. Type of task. So it is a more complex task than throwing up a website and certainly more niche and therefore not as represented in pre training data so that the models don't just as by default know everything about it. But it's a highly valuable task and one that is both rare and there's like not enough expertise in any company I think to do it and, and very just necessary and urgent today because if you can even shave off like if you shave off like a tiny amount of time it takes for one matrix multiplication that occurs billions trillions of times inside of a model and that can incur billions, hundreds of billions of dollars to a company. And so that's, that's a lot of money if you're doing a frontier model of course, but like even if you're not, it could, it could Save a lot of money. And I think what people don't realize is behind a lot of these APIs, for example, like the together AI or fireworks AI or base 10, like they are writing kernels actually to make these models run even faster than if you were to try to do it on your own. And they're, they're writing a lot of those. Right. Make them faster. So we're doing that. OpenAI Frontier Labs are doing that. A lot of people are doing that so that they could get far more out of their compute. Because the equivalent is, if you can 10x the speed of your kernel, that's the equivalent of buying 10x more compute. And as we've seen, buying 10x more compute is interesting for the scaling laws for scaling these models to get to the next level of intelligence. This is I think, actually a very ripe layer to continually optimize and actually be able to get scaling even from optimizing this existing layer.
[36:46]
B
That's interesting. Yeah, it's something that people don't think about.
[36:52]
A
Yeah, I didn't think about before either. Yeah, yeah.
[36:56]
B
And the kernels that you're writing, are they optimized for AMD hardware or could you then take them for other. To other hardware?
[37:09]
A
Yeah, excellent question. So it is optimized for AMD hardware. And that's because our hardware has, you know, specific, specific things in it so that we need to optimize the kernel to run, you know, specifically use all of our hbm, for example, we have higher hbm so more memory than the other gpu. And. And as a result, how do we utilize that effectively? Right. For certain tasks. And so I think that's wise. We need to be very specific to this GPU and what we're eking out. I think there are people working on kind of more general kernels. For example, Flash Attention was a famous one. It basically takes the usual attention mechanism inside of all transformers and they do some math to make it so that scheduling it on the GPU doesn't require as many trips to grab from memory, from storage. And as a result speeds things up significantly. And I think that is an algorithm, kernel type algorithm that has now been used across the different hardware providers. So that's like more of a general invention which I think kernel generation could get to, which is very exciting. But yeah, but today we're definitely very focused on making sure that these kernels can run very effectively on AMD gpu.
[38:39]
B
Yeah. You know, there's been a lot written recently about over capacity of, or a looming over capacity of hardware Both chips and data centers. And as you speed up the compute using these strategies like kernel optimization, is it, is it going to relieve some of that pressure right now for chips?
[39:16]
A
Good question.
[39:17]
B
Yeah, yeah. I mean do you have any sense of that how impact the macro landscape?
[39:26]
A
I think people want infinite chips, Craig. So no, it doesn't relieve the pressure. I don't think they found a plateau where they are like, we're, we're done. Like if you just 100x this, we're done. I, I don't see that right now.
[39:45]
B
Yeah, that's right. And are AMD chips, are you focused on a particular use case?
[39:55]
A
So I guess the use case is language models. What's I guess crazy is that that's a vertical because people are like what else could there be? So what else could there be? So there could be, you know, computer vision models, there could be these GPUs. Now they're all optimized for AI, but before they were originally developed for high performance computing, hpc, which includes like weather modeling and prediction. Right. So like a lot of different other tasks and even within or like adjacent to language models, there's a bunch of other types of models as well. So I would say that we're very focused on, or I'm very focused on the AI side of things and the language model thrust and kind of what people are developing there, what kind of diverse. So it includes for sure the, what the frontier labs are doing, but also what leading startups are doing. And some of those startups, like I just met with Yann Lecun regarding me and they're developing jepa and that might look a little bit different in certain ways. It might not care about that low latency token inference that autoregressive language models like ChatGPT care about, but something else that might look different from a hardware perspective. We also care about leading AI startups and what they're doing and making sure that all of those workloads do work very effectively on our chips and that we're very knowledgeable about them. So like the latest type of diffusion models, for example, like those run smoothly on, on amd. And so I would say that like that's, that's something that's really important. So that's kind of where the focus is. And I know it doesn't feel like a focus because everyone's focused on it, but it, it actually is a focus because GPU is fairly general type of compute.
[41:46]
B
Yeah, yeah, that's interesting. Yann Lecun Japa so you were talking with him about optimizing kernels to run JEPA on AMD hardware.
[42:01]
A
We were talking more generally than that probably, but understanding what JEPA does is really important for us to optimize. We can take some of their models today and be able to actually run it on, on AMD hardware. A lot of it's open source.
[42:19]
B
Yeah, yeah, I just had Sergey Levine or Levin. He pronounces it on just before. That was high.
[42:29]
A
Right.
[42:30]
B
And are you working with robotic foundation models at all?
[42:37]
A
AMD definitely is. AMD definitely is. I, I'm personally, I'm not working directly with some of those startups though I've chatted with some of them, but we're, we have folks very focused on that. So I teach online. I teach about a million people online, including developers, but also executives and different professionals. And lately we built up a partnership with Deep Learning AI and during platform to teach different courses. And we launched a post training class on reinforcement learning and fine tuning of these models. But actually the first module I think is A. Is accessible to everyone and enables everyone to just like double click and get one level deeper on how these models actually learn how to behave, how to chat with us, how to act safely and add guardrails, how to hallucinate a little bit less and how to stay more focused in a long conversation. And so I think that's, you know, that's something that's important. And I'm actually working on a book that matches that class too. And then the other courses that I've been preparing kind of jointly between AMD and, and Deep Learning AI also are around like why Compute Matters and a more general course on Transformers and understanding that before a more general audience. I've actually been working at and spending time at Harvard University to produce courses as well. So this is a collaboration where we are, where I'm. I'm teaching a few fundamentals around AI for a professional audience. So that's anything ranging from vibe coding. So basically enabling more people to write code, whether they, especially people who can't write code today and be able to build things that might help them in their professional careers, whether that be building a board deck or like analyzing, you know, something for M and A. And so these are things that are not like, you're not, these people don't. If you don't view yourselves as a software engineer, this is like the right class for you. And then of course just generally like AI for leaders and leadership and any kind of like manager who's thinking about how to change behavior within their organization with AI and understand things a little bit more fundamentally and, and how to use AI that's more practical. For example, if the models are producing different outputs and you find that frustrating and you find it not trustworthy, maybe we can flip the script a little bit and think about it from the other side of okay, actually this is why they act this way, this is why they produce variation. And now that you understand why, you can actually take advantage of this and you can run the model multiple times around multiple models and, and be able to then take that analysis with you and use that in your day to day more as a tool. And so it's doing that but like 10, 20 times and examples and being able to just disseminate that knowledge that I know, I have as an AI researcher but want more people to have when they're using this technology so that they can view it more as a tool that's useful to them rather than something that they're a little bit worried about and don't trust. And so, so yeah, that's where the other kind of stuff I'm working on as well that might be helpful to your audience.
[46:01]
B
Yeah. Where would people access those courses?
[46:05]
A
Yeah, so the deep learning AI courses are available for free on deep Learning AI. If you want a certificate, I think you do have to pay, but without that they're free. And then the Harvard ones have not been released yet, but we've been working on them.
[46:23]
B
Wow, that's exciting. Yeah.