
NVIDIA’s Matthew Nicely joins to decode kernel authoring, GPU optimizations, and how you can win big in the latest GPU Mode competition.
Loading summary
A
Welcome to Reshaping Workflows with dell Pro Max PCs and Nvidia, where innovation meets real world impact in high performance computing.
B
Welcome back to another episode of Reshaping Workflows with Dell Pro Max and Nvidia RTX Pro GPUs.
A
Logan.
B
I'm Logan Lawler, your host. So today we have a guest from Nvidia, Matthew. Nice to meet you. Thank you for coming on the podcast today.
A
Thank you for having me, really appreciate it.
B
Of course, I mean, I've got to bring on some Nvidia folks. I mean, your name's in the title. But before we normally kick it to the guests, let them introduce themselves, we'll get into conversation. There's actually something that Matthew wants to be able to share with you that is somewhat time sensitive if you're interested and you have a little bit of time. So Matt, going to go ahead and kick it to you. Tell everyone a little bit about what's happening with the competition on GPU mode, how they can get involved, the time that is left so everyone who's listening can have a chance to go participate.
A
As Logan said, we are in the middle of a kernel competition with GPU Mode on Nvidia GPUs. Specifically Blackwell interested in getting access to those GPUs. The focus is on optimizing for GEM kernels, so the focus is on kernel authoring. There are two problems left. We're in the middle of the third one, hence the time sensitive, you know, want you to go check it out, try it, go bang on the new software, new hardware, and you know, good luck to all the contestants.
B
I love that. So, and when does the contest, when does the actual, like the third part end?
A
The third part ends, I believe within the next week. And then we'll release the fourth problem, which is going to be a harder one, probably the hardest of the four. And it should run for, I believe, three to four weeks. I need to double check the exact dates, but the competition should end around the middle to the third week of February. There's no reason you can't. If you know you're hearing this for the first time, you weren't aware of it. There's no reason you can't go and check out the first two problems. Give them a shot. You just, you know, you can't submit and get any credit for those. But does it mean you can't, you know, win the third or fourth problem? And I believe there's a prize for the top performing kernel for each problem.
B
That's correct. So I Would challenge everyone who's listening to come in and just start swinging big and just take away everyone's glory and just win the last two challenges is what I would say. If you've got the time, why not, you know, so check it out. If you haven't, we'll put a link in the description. It's gpumode.com check it out, you'll see it right there. And yeah, participate, get involved, win some cool prizes and so on. Shameless. Plug over. So Matthew, not only are you involved with this, you know, GPU mode contest, but you also work Nvidia. Tell us a little bit about what you do at Nvidia.
A
I am a product manager on our AI software platform team. So at a high level our team works on optimizing the inference stack, training stack frameworks, inference stack such as trt, LLM, vlm, sg lang on the framework side, that's, you know, PyTorch, Jax, Megatron, and then I'm over our kernel and communication libraries. So I sit on the bottom of the stack and try to make sure that they have everything they need. And then anybody outside of those verticals that want to optimize code so used.
B
A couple of acronyms. I think I know what they mean, but the audience may not, unless they're super technical, which of course there's probably a few. So you said trt, so I'm assuming you're talking about tensorrt, is that right?
A
Yes.
B
Okay, so let's start right there. How would you, how would you describe Tensorrt to say, my 11 year old daughter? Exactly what is it, what does it do and how does it fit in the stack?
A
TRT and TRT LM are frameworks for optimized inference flows. So the, the idea is you put your, you give it input, you give it a model and it's going to make sure that it outputs the answer as quickly as possible. On Nvidia GPUs, that is at a high level, it takes care of everything for you, you know, just click and run. And that's the intent.
B
Perfect example. So let me, let me give a hypothetical here. So you know, I am running, let's say chat, GBP, OSS, you know, the 120 billion, you know, whether it's locally in a Docker container, et cetera, doesn't matter. For, for example, the inference, I'm just making numbers up, complete hypotheticals. But say it's going, I don't know, 80 tokens a second. So the idea is with TensorRT, that would come on top as an additional product would come on top. That speeds up that inference when you're running on any Blackwell GPU At a high level, yes.
A
The idea is that when you run TRT lm, it's going to have optimized kernels for that particular model for the GPU you're running on to get the most out of it. Yes.
B
Okay, and is that something that comes, you know, like. Because with Blackwell, right. We launched a GTC last year in 2020. I'm getting on my dates wrong. 2025. Is that something that comes native? Is that something that you actually have to download from, you know, build.Nvidia.com where does that fit? You know, kind of within Nvidia. Meaning, like, is it, is it native in the GPU? Is it something you have to download.
A
From a software point of view, you will have to download and install trtlm and you need to make sure that the model is available, available and that your GPU is supported. And that stuff would be provided in the documentation and the support matrix online. When you say native to the gpu, my mental model is that, you know, when you get the latest version of trtlm, which supports, you know, our goal is to support these key models on day zero, as soon as they're available in the ecosystem, you know, you can update TRTLM and you get, you know, the optimizations required for your gpu. So in the sense of GPU native, if you want the best performance of the software, you want to use the newest gpu. So we try to ship those hand.
B
In hand, if I'm hearing this correctly. So basically a new model. So really a lot of what TensorRT does is really working with it. Sounds like kind of the big models, right? Whenever they were released, being able to have tensorrt ready to go, kind of like almost like a Nim in a sense, where it's ready to go, it's updated based on releases. So you'll be able to take advantage of the speed up in the. In the increases in inference that you get by running TensorRT.
A
Absolutely.
B
Okay, so what you said, Matthew, so far makes a ton of sense, right? And I see the connection in GPU mode and kind of kernel authoring. Right. And I'm getting a little bit out of, in front of my ski tips a little bit. But I've heard the term kernel authoring. First off, let's start with this high level, what is actual kernel authoring? What does that mean?
A
The way I think of kernel authoring, it's basically you as a developer you are going to write a kernel to do a set of operations yourself versus running an API. So for example, you can go use Kublas and say, give me a gym. Here are my inputs. I want this output and it just runs black box. Everything works great. You move on. Kernel authoring in the sense of I want to do something a little more nuanced, I want to add a tweak, a modification, or maybe it's, you know, I know enough about the problem that I can remove guardrails. And if you remove guardrails, you can move size, you can move, remove operations and you can go faster. Sometimes it's more of a. You want more performance. It's. Sometimes it's as simple as learning, you.
B
Know, in this, like, for example, for me, like, I am definitely not, you know, a data scientist by trade. I'm a marketing and sales guy who's learned how to become a little more technical over time, because I've had to, right? So kernel authoring, we're talking like extremely kind of advanced. I mean, not advanced, but like technical level of expertise to be able to go in. And for example, if you say, hey, I want to remove, I'm going to use a bad example, right? But in Ubuntu, every time I want to install something, it's always asking for my pseudo password. And it, and I understand you can remove that, which I don't really want to do that because I like to have those permissions there so I don't screw something up. So kernel authoring in a sense is having that technical ability to go in to. For example, when you were saying to speed things up, remove things, add things, optimize things to make things go faster, in a sense.
A
Exactly. The caveat that I would change there is. I wouldn't say it's. It's always advanced. Depending on what you want to do, it can be very simple as a few lines of code.
B
Okay.
A
I think the hello world, at least when I was learning GPU code, you know, parallel programming, it was Saxby. And that's like six lines of code. And you can get a huge speed up going from a CPU to a gpu. Where you, where things get complex is as you fuse more operations, as you make that kernel, that design, more complex to squeeze every little bit out of the GPU. New GPUs come out, they have more features, the code gets more complex. For the most part, I guess there's some nuance there. So it can range from, I'd say nowadays there's tools, you know, you can be in middle school and kind of like hack on a gpu, write a kernel and then you can go to where the extreme it takes 2000 lines of code and 3/4 of that is to squeeze out, you know, the last 10%.
B
Okay, so with the curls, it's really optimizing GPU performance. And let me give you kind of an example. And I don't know if I've ever talked about this show, but I think it was at least, I don't know, maybe, let's call it not last year, but maybe the year before in 2024. I was working with a couple of fellows that Nvidia had set me up with and their names escape me. They were working on GPU acceleration for pollers. And so in sense, and they were like, hey, you put in one line of code, it goes 100% faster or whatever. In essence that is a kernel.
A
What I heard is one line of code and it goes faster. You add an API call and somebody has written optimized kernel. They've done some kernel authoring and they've done the work for you.
B
Yeah, I mean everyone's aware of Pandas and then the Polar's data science Library, right. Is that they kind of wrote code to accelerate to go much faster and run on an Nvidia RTX Pro gpu. Right. So really what a kernel is, if you just break it down, is being able to run fat whatever operation is run faster on a GPU by Nvidia using Cuda, et cetera, and trying to speed that up as much as humanly possible. And that's really at the essence of what it is.
A
I would say that's the intent. That's why you would spend the time outside of learning to write your own kernel. I say that because, you know, one of the things is we want people to have quicker time to science, quicker time to solution. Basically. If I can promise you that I've done the job you want that you're trying to do, I've done it as fast as it can possibly be done and I can guarantee that it's going to work on the next gpu, my suggestion or position is why don't you start there? If you need something new that I don't have or I can't provide, I give you the tools to go write the kernel and if you beat me, we're happy running on Nvidia GPUs, we're happy.
B
You kind of said something. Comes to my next question, it being open sourced. Right. And that's kind of the whole Linux Ubuntu world, it's open source, you know, it's way outside of a Windows world. So with it being open sourced, how much you know, for example, you're doing the competition with GPU mode, what percentage of it is, you know, optimizations to kernels or kernel authoring comes from Nvidia directly versus the community. And it doesn't have to be an exact answer. I'm just curious, does it mostly come from Nvidia or is it mostly come from the community?
A
In this case, I would say the bulk of it is coming from the community. Especially for this competition. Yes, that's, that's the intent. Like we give you the tools for you to write the optimized kernel. Now some of the compiler stuff is closed and you know, it's on us to make sure there's no performance bugs. But everything you need should be out there for you to be successful. If not, we have, we have done a poor job.
B
So kind of a question about the open source, right. So Matthew, unrelated to the competition, right, not talking about GPU mode, let me give you kind of a hypothetical. I'm Logan Mahler. I wake up on a random Tuesday in Austin, Texas and Dell's corporate headquarters and I'm feeling a little spunky. I want to work on an optimization on a kernel for accelerating X, Y, Z, ABC doesn't really matter. You'd kind of mentioned before previously that, you know, once someone submits something that has to kind of be reviewed and tested, et cetera, but in this example I'm, you know, submitting something kind of random Tuesday and I'm submitting it to, you know, for approval to one of Nvidia's libraries. How does that process work from with your team or yourself to review, to understand like, hey, this actually works. Hey, oh my God, this is the best thing I've ever seen. Oh no, this is not going to work. How kind of talk us through that process of someone who actually submits something how it works once they do hit the submit button.
A
That's a good question and you know, I'll caveat my answer in the sense that every Nvidia library you submit to may have its own quirks on, you know, what it suggests. But a good example that I could think of is Flash Inferior. It is our open source kernel library for inference kernels and you know, we want everybody to contribute. So in this case on the GitHub, you should, should find some kind of contribution guidelines to code style and so forth. But you have a new let's say, sampling kernel for a model you care about and you want to put it in the library. It's simple as writing the kernel, opening a pr, submitting the kernel, putting as much context into the rationale of why, what should Nvidia be looking for? You know, API changes and that kicks it off from there and it should be smooth. There may be some guidelines to the kernel authoring tool in the sense that a lot of the tools in FlashInfo would be CUDA Cutlass are of that nature. So, you know, it fits with the build system and so forth, but outside of that, there shouldn't be much nuance. And then again, like you said, if it's, you know, you show the perf, you run it, we see it. Fantastic. Some of the other things to think about that you, you know, as a developer may not be aware of, that, you know, Nvidia has to keep in mind and can slow some of these contributions down sometimes is we're responsible to make sure that our library works on the CUDA platform or the Nvidia platform across the ecosystem. It's quite common. Sometimes, you know, I'm on a 5080 or I am on a Pro 6000, I write my kernel and it works great for my thing in my test. And we may do a test and say, hey, this broke in the data center, this broke here. And we'll work with you to try to get those optimizations and those fixes and then get rolled into the library.
B
Okay, that makes sense. So the idea is, with this, you're trying to standardize this kernel across whether it's, you know, GeForce cards, RTX Pro cards, server cards, it works end to end.
A
That is the goal. Sometimes we'll just put a caveat, say, hey, this works here, this only works here.
B
Okay, so kind of given that example, you know, you're. How long have you been at Nvidia?
A
March of 2019. So what is that? Six years?
B
Six years. In your six years, have you ever seen a submission where you were like, oh my God, this is the best thing I've ever seen. How did I not think of this from an outside person?
A
Yes.
B
Like, which one? Give me, tell me a story. I want to hear about one. You have to name names or give the specific thing, but maybe put in something like around, you know, oh, we've been trying to solve this and then someone came through with something that's 90% faster or something like, I, I think I. What I want to do in this question is highlight the fact that, I mean, I think everyone thinks at the end of the day, and I would agree is Nvidia is very good at what they do. But there's also people out in the world that are very smart as well. There maybe are outside of Nvidia that help contribute to the ecosystem. And I kind of want to highlight an example of like where something came in that maybe people are listening to this, are hearing every single day that might use this and might not even be aware of it, where they're like, wow, I didn't know that a single person trying to get people excited, right?
A
No, no, that makes sense. And I'm trying to take an example. That's not Flash attention because that is probably the gold standard right now. When we think of people writing optimized kernels on our stack especially. I'm the product manager for Cutlass. It's my baby. I care about it and I get excited. And that's kind of the gold standard of somebody using everything that is available in the library in the CUDA stack and then, you know, not only revolutionizing a domain, but basically this is what you had before and that's what's cool about authoring kernels. In the math behind it. You basically say, okay, this is how things have always been done in the past. I'm just going to rearrange it. You know, how far can I rearrange this problem without messing up the accuracy and then, and then squeeze things out and sometimes you, it's counterintuitive. Especially today when a lot of our stuff is, you know, serial. You get approximation and you move on. Sometimes you have to go back to old math where it's a 10 times as much work, but you put that on a GPU and it's a hundred times faster. So let me get back to your question into what I get excited about. So yeah, let me just dig on Flash attention because it's easy. This is one where, you know, treedao was able to take the attention kernel and basically optimize. I want to butcher this, but just optimize the data transfer. You know, that's kind of the slow part of parallel computing on a gpu. You know, the tensor cores are fast and just moving data is your bottleneck and you want to minimize that as much as possible. He used everything from cuda, PTX and Cutlass to write Flash attention and then he's been able to use the open source solutions to rewrite that and optimize it for each GPU since it's, since its creation. So you know, that is the gold standard of somebody writing something brand new that and, you know, nobody has seen at the time and then revolutionized the industry.
B
Okay. I mean, it's, it's pretty important. I mean, that's, that's the thing, right? Like, I mean, exactly is the, the ability to, how quickly can you move data around is kind of like, is the key when it comes to parallel processing and AI and data science, etc. So I mean, that's a great example. I mean, I want to talk about this really quick and I know we've got a few minutes left, but you've referenced it a couple of times. You've said Cutlass, tell us a little bit about this. You said it was your baby. So exactly what is it?
A
Cutlass is our open source library for programming Tensor course. It, you can think of it as a set of building blocks that we've kind of abstracted away some of the nuances and you can use those building blocks to design, design your kernel. You know, I guess the most popular example is a gym or a matmul. You know, I would two years ago, the answer would be it is our C template library. But at gtc, we also released our first, you know, Python DSL to allow you to do this. Cutlass is layered in the sense of, you know, abstractions all the way down to, you know, let's say a simple copy operation and then you work your way up to, you know, call in a gem. And then we're, we're taking that abstraction layers and moving it to Python. You know, at a high level you should get the same performance. But the nice thing about it is I'm trying to get the team to design this new Python layer with kind of like the Python developer experience. There's, you know, a lot of hand holding it. It takes a lot of the ugliness of C templates out of it and it allows you to write this kernel faster and in an environment that a lot of the students that are graduating with today are most familiar with, makes total sense.
B
So within kind of Cutlass, I'm checking out kind of the GitHub page. So is it the only kind of, you know, library that used to, you know, optimize Tensor kernels? Is it like aligned to it directly? Are there others that fit into that? I'm just curious.
A
That's a good question. That there, there are tons inside and outside of Nvidia. Inside of Nvidia you have, you know, you can use CUDA directly in ptx, you can use the Cutlass C or Python There was announced and released recently Cutile, which is. You can think of it as a tile distraction. Uh, I would say, you know, my mental model is it fits in between like library calls from Cudina and Kublas and then Cutlass. You know, I'm telling most of you know, my Cutlass customers, like, hey, why don't you check out Kutile? This is a better starting point. If this works for you, great. You know, it's got a compiler backend, it'll handle a lot of the nuance for you. You don't have to worry about a bulk of the optimizations for the tensor core. And then if you again, like I said earlier, if you know better and you want to, you know, do things yourself or throw out guardrails or maybe you need to design around a perf issue, you know, quickly until, you know, we get something fixed in the libraries, Cutlass is your go to. And then outside of that, there are numerous products in the ecosystem which you can use to target the tensor cores from Open the Air, Triton and Gluon, Google, you've got the Jax ecosystem, you have palace and Mosaic. Again, we support all of these, want them to be successful on Nvidia GPUs. So you. There are a handful of, well, actually much more than a handful of kernel authoring tools in the ecosystem and they all have their pros and cons. I wouldn't say there's.
B
There's not like the gold standard that's like perfect in every way physically.
A
You ask, you ask the designer and developer of that tool, they'll tell you it's perfect, you know, for what they, for what they need. I would say that find the one that feels best for you, your build system, how you think, and then use it. And then you need more perf. Come to us and we'll make sure that we can put it into our software stack. Like, my goal is to make sure that Cutlass is the greatest thing next to sliced bread when it comes to perf and functionality.
B
I love that. Greatest thing said sliced cheese is what I always say. So, last question and then we'll. We'll wrap up because we've been on for almost 30 minutes. Is, I mean, obviously your team very involved with, you know, our tics cards. Do you get involved? I mean, obviously Nvidia in the server space more a little bit longer ago. But I mean just this year or I guess six months ago and about to be, you know, with the Grace Blackwell kind of system on a chip design with GB10, GB300 which is Bolton Spark. Does your team cover those optimizations, like from a kernel perspective? And I'm assuming those, since the design is different and it's not like a dedicated gpu, the optimizations are fundamentally probably different. Right? Because it's a different architecture of the.
A
Gpu, I would assume to some extent, yes. So yes, it is the same folks who work on Cutlass for Data center or Thor. Any gpu, you know, Cutlass is required. I use Cutlass as an example. But you could talk about any Nvidia kernel authoring library. It's intended to work across the platform. The nuance is taking a kernel that's been hyper optimized for B300 and then running that on Spark. Sometimes it will just, it will not work. Especially if you use an instruction that's not compatible. And you need to go in there, you know, you'll get the error. You need to either rearrange the isa, rewrite the kernel a little bit. I mentioned earlier, you know, we introduced ctile. Ctile supposed to take some that nuance out. You write the kernel, it's going to work across all the different Tensor core variants and be, you know, good perf. And if you can do that in, you know, 24 hours and you've got 90% peak, a lot of customers are going to move on. Then you can dive into, you know, hyper optimizing a kernel for B300 spark and then go from there. But again, we try to give you as much information to, to know what's coming when you, when you sign up for that.
B
Okay, now that makes total sense. So we're kind of up against it real quick. I know that we did this in the beginning of the episode, but can you give everyone who had just tuned in, maybe in the middle or wasn't paying attention to the beginning? Just a quick recap of the closing out of kind of the part three and the part four of the GPU mode competition that'll be wrapping up here in the next couple of weeks.
A
To reiterate, Nvidia and GPU Mode are collaborating on a kernel competition, targeting Blackwell on four particular problems. These problems are focused on NVFP4, which is exciting because that's the, you know, kind of the new data type kind of gold standard on Nvidia GPUs. There are tons of resources out there. Go look at problem 3. Problem 4 should be out there pretty soon and give it a shot. I think it's exciting and you'll learn a little bit about the gpu, about the software stack the tools to debug and optimize. And if you find an issue, please let us know.
B
Perfect. So yeah, go to gpmode.com when you log in or when you have to log in. But once you see there, I think the one that's currently it says it's ending on 2020 is NV FP4 dual gym, I'm assuming is the one you're talking about and it has all the information right there. So come in, use some knowledge, maybe win a cool prize. We'll see. So with that, Matthew, I really appreciate you having it on, having you on the podcast and tell everyone where they can find you on LinkedIn or, you know, any closing thoughts that you want to share.
A
So you should be able to just Google or yeah, Google Nvidia, Matt, nicely. And you should find me on LinkedIn. Feel free to reach out, tell me you heard me from the show and send me a friend request. I guess my closing thoughts are. I know kernel authoring, especially when it comes to tensor cores, can seem daunting at first. We are actively working to make things easier, lower the learning curve and I would suggest giving it a shot. Use the examples and then I think you'll be an expert in no time.
B
I love that. Well, Matthew, really appreciate the time today explaining a complex, I mean, to be honest, a complex, you know, topic and making it easy for everyone to understand. So, you know, appreciate the time. And as always, you know, we try to have different guests here on reshaping workflows kind of spanning the hill gamut. So today you learn about kernel authoring. If you have some time, go check it out, go to Garage up Comm, look at the current project still running. You can click on, you know, the ranking, the reference submission, get all the details and give it a shot yourself, which I'm going to do. Which I probably won't have the skill set to do it. But that's neither here or there. So with that, this is Logan from Reshaping Workflows. We'll see you on the next one.
A
Do what you want. Do what you want. This podcast was produced in partnership with Amaze Media Labs.
Reshaping Workflows with Dell Pro Max and NVIDIA RTX PRO GPUs
Episode: Cracking the Code: NVIDIA GPU Optimization with Matthew Nicely
Host: Logan Lawler (Dell Technologies AI Factory with NVIDIA)
Guest: Matthew Nicely (Product Manager, NVIDIA AI Software Platform Team)
Date: January 29, 2026
This episode dives deep into the mechanics of GPU optimization and kernel authoring, spotlighting NVIDIA’s community-driven efforts and new hardware advancements. Host Logan Lawler sits down with Matthew Nicely, a product manager at NVIDIA overseeing kernel and communication library development, to discuss the real-world implications of open innovation in high-performance computing, the accessibility of kernel writing, and how tools like TensorRT, Cutlass, and more are reshaping AI workflows for everyone from students to professionals.
"We're in the middle of a kernel competition with GPU Mode on NVIDIA GPUs. Specifically Blackwell... The focus is on optimizing for GEM kernels." — Matthew Nicely [01:09]
[10:12] Logan references a collaboration with NVIDIA engineers where a single line of code achieved dramatic speedup in the Polars data science library by invoking an optimized kernel.
Submission processes are accessible, with contribution guidelines on GitHub.
[13:28] “It’s as simple as writing the kernel, opening a PR... Nvidia has to keep in mind... that our library works on the CUDA platform across the ecosystem.” — Matthew Nicely
Contributions are tested across the full hardware stack, sometimes leading to further collaboration to ensure broad compatibility.
[16:48] Logan asks for standout community contributions:
New and alternative libraries: CUDA directly, PTX, CTile, OpenAI Triton, Google JAX, MosaicML, and more—each with unique strengths.
[22:39] “Find the one that feels best for you, your build system, how you think, and use it. If you need more perf, come to us and we’ll put it into our software stack.” — Matthew Nicely
“Most of the [GPU Mode competition] is coming from the community... we give you the tools for you to write the optimized kernel.” — Matthew Nicely [12:07]
“It can range from, I’d say nowadays there’s tools... middle schoolers can hack on a GPU, write a kernel...” — Matthew Nicely [08:49]
“You show the perf, you run it, we see it. Fantastic.” — Matthew Nicely [13:28]
"Not only revolutionizing a domain, but... this is what you had before and that's what's cool about authoring kernels... it's counterintuitive. Sometimes you have to go back to old math... you put that on a GPU and it's a hundred times faster." — Matthew Nicely [17:44]
“My goal is to make sure that Cutlass is the greatest thing next to sliced bread when it comes to perf and functionality.” — Matthew Nicely [23:14]
For further details, examples, and technical resources, visit the NVIDIA and GPU Mode documentation, GitHub repositories for Cutlass and FlashInfer, or connect with Matthew and the NVIDIA team on LinkedIn.