Summary4 min read

Podcast Summary: Reshaping Workflows with Dell Pro Precision and NVIDIA RTX PRO GPUs

Episode: Live from GTC: Train Smarter, Not Bigger with NVFP4
Date: March 18, 2026
Host: Logan Lawler
Guests: Max (AI Platform Software Team, NVIDIA) & Hiro (Solution Architect, NVIDIA)

Episode Overview

Live from NVIDIA’s GTC 2026, host Logan Lawler dives into the heart of high-performance AI workflows with Max and Hiro from NVIDIA. The conversation centers on the unveiling of NVIDIA’s new FP4 format, the role of Dell Pro Precision workstations, and breakthrough advancements enabling large language model (LLM) training and inference to be smarter—not just bigger. The focus: how NVIDIA and Dell’s innovations are enabling customers to train leading-edge models more efficiently, with greater performance, less energy, and without sacrificing accuracy.

Key Discussion Points & Insights

Guest Introductions: Roles and Background ([00:19]–[00:56])

Max (NVIDIA AI Platform Software Team)
- Supports training/inference frameworks to optimize performance and accuracy on NVIDIA hardware
- Formerly worked on NVIDIA chip design
Hiro (NVIDIA Solution Architect)
- Solves customer problems by tailoring NVIDIA solutions to real-world workflows

NVIDIA Blackwell and NVFP4: Revolutionizing Training Precision ([01:19]–[04:45])

Demonstrating NVFP4 on Blackwell GPUs
- Showcased training with the new 4-bit NVFP4 (FP4) format using the JAX framework
- FP4 achieves significant performance gains, energy and memory savings, all without impacting model accuracy
Performance and Efficiency Metrics
- Max: “We gain 27 to 40% gain on performance using FP4 comparing to FP8. And also we save the memory by using the lower precision and... over the years you save 50% compared to Hopper generation and 210,000 times compared to Kepler which is 12 years ago.” ([01:36])
- FP4 format retains accuracy across code, math, and multilingual benchmarks
Technical Breakthroughs Making FP4 Possible
- FP4 uses just 4 bits to encode sign, exponent, mantissa
- Prone to bias due to limited bit allocation, so NVIDIA introduced key accuracy-preserving techniques:
  - Stochastic rounding
  - Random Hadamard transform
  - Two-level scaling
  - Higher precision reserved for the last few (accuracy-sensitive) layers
- Max: “Although the techniques look comprehensive, but the usage is very easy... you can just change a single argument, which is quantization, and it will call the transformer engine to help you get the performance gain.” ([03:13])
JAX Framework and Accessibility
- Modern AI frameworks (e.g., JAX) and synergies with partner labs (Google Gemini, XAI)
- NVIDIA publishes quantization recipes—users simply change a setting to leverage FP4
Real-World Demo Results
- On GB300 nodes, FP4 achieved 20% overall throughput gain over FP8
- Training converges without loss in model quality
- Max: “Each step, the FP4 is 0.6 second and FP8 is 0.7 second. Accumulates to 10k step.” ([04:12])
- FP4 completes jobs in time-constrained scenarios where FP8 cannot—vital for enterprise productivity

Customer Impact: Why Adopt FP4? ([04:45]–[05:27])

Core Advantages for Users
- Hiro: “You use much less resource but without losing the accuracy... we use like two-level scaling to maintain the accuracy as much as possible compared to the standard FP4.” ([05:14])
- Substantial energy, cost, and time savings for both training and inference workloads
- Drops hardware/energy requirements for enterprises and large labs without trading off outcome accuracy

Notable Quotes & Moments

Host Logan Lawler (joking about GTC):
“Nvidia is always solving problems. The only problem you're not solving is the lack of hotel rooms inside San Jose. Unfortunately that's just a joke.” ([01:05])
Max (on performance):
“We gain 27 to 40% gain on performance using FP4 comparing to FP8... and over the years you save 50% compared to Hopper generation and 210,000 times compared to Kepler which is 12 years ago.” ([01:36])
Hiro (on user benefits):
“You use much less resource but without losing the accuracy.” ([05:14])
Logan Lawler (on FP16 becoming outdated):
“FP16 is yesterday's news.” ([05:27])

Important Timestamps

00:19: Episode kick-off, Meet Max and Hiro
01:19: Demo introduction: Blackwell GPU, NVFP4, and JAX
03:13: How NVFP4 techniques work – accessible quantization in JAX toolbox
04:12: Demo results: FP4 vs FP8 performance
05:14: Customer benefits explained in plain English
05:27: Takeaway: FP16 is now obsolete; FP4 is available on Dell Pro Max, GB10, GB300

Tone & Style

The conversation is direct, technical yet approachable, and punctuated by friendly rapport and humor. Logan guides the discussion with clear questions, while Max and Hiro break down advanced topics in accessible language for a broad audience.

Key Takeaways

NVFP4 format & Blackwell GPUs represent a generational leap, enabling much faster, more efficient training/inference with no loss of accuracy.
The implementation is made seamless via frameworks like JAX and turnkey quantization recipes, making next-gen precision accessible to practitioners and enterprises.
Major reductions in energy and hardware costs pave the way for democratizing AI at scale, redefining what’s feasible for both massive labs and mainstream businesses.
Dell Pro Precision workstations with NVIDIA GPUs are ready—FP16 is yesterday’s news; FP4 is the future for efficient high-accuracy AI workflows.

Loading summary

Transcript11 lines

[00:00]
A
Foreign.
[00:05]
B
Welcome to Reshaping Workflows with Dell Pro Precision and Nvidia. Where innovation meets real world impact in high performance computing.
[00:20]
A
This is Logan Reshaping workflows live from GTC 2026. I'm here with Hero and Max, both with Nvidia talking about building and how to build the next generation of LLMs. So let's start with you Max. First tell me a little bit about you, what you do at Nvidia.
[00:38]
B
I'm currently working at the AI platform software team. We support the training and inference frameworks to gain better performance and accuracy on Nvidia platform. And previously I've worked on the chip design at Nvidia.
[00:53]
A
Amazing. Thank you very much. Hiro, for you. What do you do at Nvidia?
[00:56]
C
I'm a solution architect. I'm helping customer solving their problems. Whatever problem the customer bring, we gotta solve.
[01:05]
A
I love it. Nvidia is always solving problems. The only problem you're not solving is the lack of hotel rooms inside San Jose. Unfortunately that's just a joke. Neither here or there. So you guys are running the demo. Tell me a little bit about what you're showing. We'll start with you, Max.
[01:19]
B
Yeah. Let's show on Nvidia's latest Blackwell GPU. If I have NVIP 4 format supported. Let's show the demo to train with an NV4 on JAX framework. So the NV4 we get a better higher throughput. And also the beauty is we do not lost the accuracy. We gain 27 to 40% gain on performance using IP4 comparing to IP8. And also we save the memory by using the lower precedent and the energy overall over the years you save 50 comparing to Hopper generation and 210,000 times comparing to just Kepler which is 12 years ago. And the amazing thing is we got these benefits of performance. Memory and energy savings is still reserved accuracy and you see across the benchmark of code math and the multi language. So then let's see how MP4 works. So for the 4 bit format we only have the 4 bits represent the sine exponent and mantissa. And it's very easy to get prone to some bias by using this limited amount of bit to represent the number. We introduced some key techniques to compensate the training with MVP4. So the four key techniques used for MP4 pre training is the stochastic rounding, the random hard mark transform and the two level scaling. And also for the last few layers which is the accuracy sensitive, we still use the higher precedent that helps to recover the accuracy by Using that and through this demo we are showing the training with jax, which is a very popular framework. And some frontier labs like XAI and Google Gemini use this to train their models. You can imagine you have a python code. This tool, the compiler of XA will help you to translate the code to the low level code to help you optimize it. Although the techniques look comprehensive, but the usage is very easy. We publish this recipe on the JAX toolbox. You can slice it publicly online and to use the different quantization recipe, you can just change a single argument which is a quantization. It will call the transformer engine and to help you to get the performance gain. So then let's show the demo of the chart. We collected Data on a GB300 nodes and compare the IP4 and IP8. We still amortize. They have a 20% gain and also the loss that are overlapped, which proves that the MP4 still keeps a residue without loss entity. The interesting part is I see when you go to this part is the flat. Why it's flat because I reserve certain amount of time for the machine. I feel it couldn't complete it with that time, but IV4 could complete it. That comes from each step. The NV4 is 0.6 second and IP8 is a 0.7% accumulates to 10k step. Let's introduce this difference. So overall we trying to. When you have a black valve machine, you can try this mp4 for your pre training.
[04:46]
A
That's amazing. So I mean I think we've learned at gtc, I mean we launched a Dell, the Dell Pro Max with GB300, right. And we've seen, you know, frontier models being trained at full precision. So the last question for you Hiro is like understand that there's a lot of advantages with, you know, the Nvidia FP4 version, but in plain like English. What, what are the two advantages for customers? You know, either running inference or running training at FP4 versus like an FP16 or an FP32.
[05:15]
C
So the advantage is that you use a much less resource but without losing the accuracy. So we use like two level scaling to maintain the accuracy as much as possible compared to the standard FB4.
[05:28]
A
I love that. So really appreciate the time Max Hero. I really appreciate you taking the time key takeaway. FB16 is yesterday's news. Unfortunately, if you haven't checked it out, Nvidia FB4 runs on the Delpro Max for GB10 or Spark or the GB300 are also the data Center Products. And this is Logan, where I'll see you on the next one. This podcast was produced in partnership with Amaze Media Labs.