Podcast Summary:
Reshaping Workflows with Dell Pro Max and NVIDIA RTX PRO GPUs
Episode: Cracking the Code: NVIDIA GPU Optimization with Matthew Nicely
Host: Logan Lawler (Dell Technologies AI Factory with NVIDIA)
Guest: Matthew Nicely (Product Manager, NVIDIA AI Software Platform Team)
Date: January 29, 2026
Episode Overview
This episode dives deep into the mechanics of GPU optimization and kernel authoring, spotlighting NVIDIA’s community-driven efforts and new hardware advancements. Host Logan Lawler sits down with Matthew Nicely, a product manager at NVIDIA overseeing kernel and communication library development, to discuss the real-world implications of open innovation in high-performance computing, the accessibility of kernel writing, and how tools like TensorRT, Cutlass, and more are reshaping AI workflows for everyone from students to professionals.
Key Discussion Points & Insights
1. GPU Mode Kernel Competition – Invitation and Context
- [01:09] Matthew introduces an ongoing NVIDIA-backed kernel optimization competition focusing on Blackwell GPUs and GEM kernel authoring, hosted via GPU Mode.
- The competition is currently in its third stage with a fourth, most challenging phase approaching. Prizes are awarded for top-performing kernels in each phase.
- Quote:
"We're in the middle of a kernel competition with GPU Mode on NVIDIA GPUs. Specifically Blackwell... The focus is on optimizing for GEM kernels." — Matthew Nicely [01:09]
- Participation call: Even newcomers can join and attempt the active and upcoming problems at gpumode.com.
- Timeline:
- Stage 3 ends within a week
- Stage 4 (the hardest) runs until mid- to late February
- Prior problems can be attempted for practice, but submissions are closed.
2. Matthew’s Role at NVIDIA
- [03:00] As product manager on the AI Software Platform team, Matthew oversees optimization across inference and training stacks (including frameworks like PyTorch, JAX, and Megatron) and is directly responsible for kernel and communication libraries.
- Focus is on ensuring optimal performance at the lowest software levels for a wide range of users.
3. Demystifying TensorRT and Kernel Optimization
- What is TensorRT?
- [04:02] “TRT and TRT LM are frameworks for optimized inference flows... It takes care of everything for you, you know, just click and run.” — Matthew Nicely
- Designed to maximize inference speeds for models running on NVIDIA GPUs.
- Deployment
- [05:31] Users must download and update TensorRT to benefit from new optimizations as models and hardware evolve.
- [06:15] Optimizations are typically available day-zero for top models when new GPUs launch.
- Analogy:
- Just as “NIM” is ready-to-go upon model release, TensorRT stays updated to provide instant performance benefits.
4. Kernel Authoring: Fundamentals and Accessibility
- What is kernel authoring?
- [07:01] “It’s basically you as a developer... writing a kernel to do a set of operations yourself versus running an API.” — Matthew Nicely
- It allows for customized, nuanced, or high-performance GPU operations beyond standard libraries.
- Is it only for advanced users?
- [08:38] “I wouldn't say it's always advanced… It can be as simple as a few lines of code.”
- Ranges from easy “hello world” style kernels, to highly complex, multi-thousand-line optimizations squeezing peak performance out of hardware.
- Tools and abstraction layers (like Python DSL for Cutlass) are lowering the barrier to entry, making kernel writing more approachable to students and newcomers.
5. Real-World Example: Accelerating Polars
-
[10:12] Logan references a collaboration with NVIDIA engineers where a single line of code achieved dramatic speedup in the Polars data science library by invoking an optimized kernel.
- [10:54] “If I can promise you that I’ve done the job you want as fast as it can possibly be done...why don’t you start there? If you need something new, I give you the tools to write the kernel... If you beat me, we're happy.” — Matthew Nicely
6. Community vs. Corporate Innovation in Kernel Development
- [12:07] For open competitions like GPU Mode, most kernel optimizations come from the community, guided by tools and some closed compiler components from NVIDIA.
- Open-source contributions
-
Submission processes are accessible, with contribution guidelines on GitHub.
-
[13:28] “It’s as simple as writing the kernel, opening a PR... Nvidia has to keep in mind... that our library works on the CUDA platform across the ecosystem.” — Matthew Nicely
-
Contributions are tested across the full hardware stack, sometimes leading to further collaboration to ensure broad compatibility.
-
7. A Notable Kernel Success Story: FlashAttention
-
[16:48] Logan asks for standout community contributions:
- [18:00] “FlashAttention... is probably the gold standard... treedao was able to take the attention kernel and basically optimize the data transfer... it's what revolutionized the industry.” — Matthew Nicely
- The power of open kernel engineering: radical improvements can come from individuals outside NVIDIA, fundamentally shifting the performance landscape for all users.
8. Cutlass: Making Kernel Programming Accessible
- [19:23] Cutlass, an open-source template library for programming NVIDIA Tensor Cores, is called “Matthew’s baby” and serves as a foundation for custom GPU kernels.
- Now includes Python DSL for rapid, user-friendly kernel development.
- [20:42] Designed to abstract away complexity and speed up time-to-solution for newcomers and experts alike.
- Other tools and ecosystem
-
New and alternative libraries: CUDA directly, PTX, CTile, OpenAI Triton, Google JAX, MosaicML, and more—each with unique strengths.
-
[22:39] “Find the one that feels best for you, your build system, how you think, and use it. If you need more perf, come to us and we’ll put it into our software stack.” — Matthew Nicely
-
9. Kernel Portability and Next-gen Hardware
- [23:58] Discussion of optimizing across architectures (Grace Blackwell, GB10, GB300, server GPUs vs. RTX Pro, etc.)
- “The nuance is taking a kernel that’s been hyper-optimized for B300 and then running that on Spark. Sometimes it will just, it will not work...”
- New libraries like CTile aim to reduce friction and allow high performance across hardware generations with minimal rewrites.
Notable Quotes & Highlights
- On Community Power:
“Most of the [GPU Mode competition] is coming from the community... we give you the tools for you to write the optimized kernel.” — Matthew Nicely [12:07]
- On Kernel Authoring Accessibility:
“It can range from, I’d say nowadays there’s tools... middle schoolers can hack on a GPU, write a kernel...” — Matthew Nicely [08:49]
- On Open-Source Impact:
“You show the perf, you run it, we see it. Fantastic.” — Matthew Nicely [13:28]
- On FlashAttention:
"Not only revolutionizing a domain, but... this is what you had before and that's what's cool about authoring kernels... it's counterintuitive. Sometimes you have to go back to old math... you put that on a GPU and it's a hundred times faster." — Matthew Nicely [17:44]
- On Platform Philosophy:
“My goal is to make sure that Cutlass is the greatest thing next to sliced bread when it comes to perf and functionality.” — Matthew Nicely [23:14]
Timestamps for Key Segments
- 01:09 – GPU Mode competition overview and invitation
- 03:00 – Matthew's role at NVIDIA
- 04:02 – Explaining TensorRT
- 07:01 – Kernel authoring demystified
- 08:38 – Accessibility of kernel writing
- 10:12 – Real-world Polars example
- 12:07 – Community's role in kernel optimization
- 13:28 – Open-source submission and review process
- 16:48 – FlashAttention as a community success story
- 19:23 – What is Cutlass? New Python abstraction
- 22:39 – Kernel authoring ecosystem tools
- 23:58 – Optimizing for next-gen hardware and portability
Final Takeaways
- Kernel authoring and GPU optimization are more accessible than ever, with powerful open-source tools and rich community support.
- NVIDIA encourages and benefits from outside innovation, highlighting real-world instances where individual contributors have made industry-shaping advances.
- Competitions like GPU Mode foster learning, experimentation, and tangible contributions to NVIDIA’s software ecosystem.
- Tools such as TensorRT and Cutlass (now with a Python API) are bridging the gap between performance and ease-of-use, empowering users at all skill levels.
Closing & How to Get Involved
- Competition:
- gpumode.com – Participate in the ongoing Blackwell kernel optimization contest.
- Contact:
- Find Matthew Nicely on LinkedIn by searching “NVIDIA Matt Nicely”.
- Encouragement to Try Kernel Authoring:
- “We are actively working to make things easier, lower the learning curve and I would suggest giving it a shot. Use the examples and then I think you'll be an expert in no time.” — Matthew Nicely [26:41]
For further details, examples, and technical resources, visit the NVIDIA and GPU Mode documentation, GitHub repositories for Cutlass and FlashInfer, or connect with Matthew and the NVIDIA team on LinkedIn.
