
What if you could train faster, use less compute, and not lose accuracy? Live from GTC 2026, Logan sits down with NVIDIA Solution Architect Hirofumi Kobayashi and Senior Software Engineer Max Xu to break down NVIDIA’s latest leap in LLM training.
Loading summary
A
Foreign.
B
Welcome to Reshaping Workflows with Dell Pro Precision and Nvidia. Where innovation meets real world impact in high performance computing.
A
This is Logan Reshaping workflows live from GTC 2026. I'm here with Hero and Max, both with Nvidia talking about building and how to build the next generation of LLMs. So let's start with you Max. First tell me a little bit about you, what you do at Nvidia.
B
I'm currently working at the AI platform software team. We support the training and inference frameworks to gain better performance and accuracy on Nvidia platform. And previously I've worked on the chip design at Nvidia.
A
Amazing. Thank you very much. Hiro, for you. What do you do at Nvidia?
C
I'm a solution architect. I'm helping customer solving their problems. Whatever problem the customer bring, we gotta solve.
A
I love it. Nvidia is always solving problems. The only problem you're not solving is the lack of hotel rooms inside San Jose. Unfortunately that's just a joke. Neither here or there. So you guys are running the demo. Tell me a little bit about what you're showing. We'll start with you, Max.
B
Yeah. Let's show on Nvidia's latest Blackwell GPU. If I have NVIP 4 format supported. Let's show the demo to train with an NV4 on JAX framework. So the NV4 we get a better higher throughput. And also the beauty is we do not lost the accuracy. We gain 27 to 40% gain on performance using IP4 comparing to IP8. And also we save the memory by using the lower precedent and the energy overall over the years you save 50 comparing to Hopper generation and 210,000 times comparing to just Kepler which is 12 years ago. And the amazing thing is we got these benefits of performance. Memory and energy savings is still reserved accuracy and you see across the benchmark of code math and the multi language. So then let's see how MP4 works. So for the 4 bit format we only have the 4 bits represent the sine exponent and mantissa. And it's very easy to get prone to some bias by using this limited amount of bit to represent the number. We introduced some key techniques to compensate the training with MVP4. So the four key techniques used for MP4 pre training is the stochastic rounding, the random hard mark transform and the two level scaling. And also for the last few layers which is the accuracy sensitive, we still use the higher precedent that helps to recover the accuracy by Using that and through this demo we are showing the training with jax, which is a very popular framework. And some frontier labs like XAI and Google Gemini use this to train their models. You can imagine you have a python code. This tool, the compiler of XA will help you to translate the code to the low level code to help you optimize it. Although the techniques look comprehensive, but the usage is very easy. We publish this recipe on the JAX toolbox. You can slice it publicly online and to use the different quantization recipe, you can just change a single argument which is a quantization. It will call the transformer engine and to help you to get the performance gain. So then let's show the demo of the chart. We collected Data on a GB300 nodes and compare the IP4 and IP8. We still amortize. They have a 20% gain and also the loss that are overlapped, which proves that the MP4 still keeps a residue without loss entity. The interesting part is I see when you go to this part is the flat. Why it's flat because I reserve certain amount of time for the machine. I feel it couldn't complete it with that time, but IV4 could complete it. That comes from each step. The NV4 is 0.6 second and IP8 is a 0.7% accumulates to 10k step. Let's introduce this difference. So overall we trying to. When you have a black valve machine, you can try this mp4 for your pre training.
A
That's amazing. So I mean I think we've learned at gtc, I mean we launched a Dell, the Dell Pro Max with GB300, right. And we've seen, you know, frontier models being trained at full precision. So the last question for you Hiro is like understand that there's a lot of advantages with, you know, the Nvidia FP4 version, but in plain like English. What, what are the two advantages for customers? You know, either running inference or running training at FP4 versus like an FP16 or an FP32.
C
So the advantage is that you use a much less resource but without losing the accuracy. So we use like two level scaling to maintain the accuracy as much as possible compared to the standard FB4.
A
I love that. So really appreciate the time Max Hero. I really appreciate you taking the time key takeaway. FB16 is yesterday's news. Unfortunately, if you haven't checked it out, Nvidia FB4 runs on the Delpro Max for GB10 or Spark or the GB300 are also the data Center Products. And this is Logan, where I'll see you on the next one. This podcast was produced in partnership with Amaze Media Labs.
Episode: Live from GTC: Train Smarter, Not Bigger with NVFP4
Date: March 18, 2026
Host: Logan Lawler
Guests: Max (AI Platform Software Team, NVIDIA) & Hiro (Solution Architect, NVIDIA)
Live from NVIDIA’s GTC 2026, host Logan Lawler dives into the heart of high-performance AI workflows with Max and Hiro from NVIDIA. The conversation centers on the unveiling of NVIDIA’s new FP4 format, the role of Dell Pro Precision workstations, and breakthrough advancements enabling large language model (LLM) training and inference to be smarter—not just bigger. The focus: how NVIDIA and Dell’s innovations are enabling customers to train leading-edge models more efficiently, with greater performance, less energy, and without sacrificing accuracy.
Max (NVIDIA AI Platform Software Team)
Hiro (NVIDIA Solution Architect)
Demonstrating NVFP4 on Blackwell GPUs
Performance and Efficiency Metrics
Technical Breakthroughs Making FP4 Possible
JAX Framework and Accessibility
Real-World Demo Results
Host Logan Lawler (joking about GTC):
“Nvidia is always solving problems. The only problem you're not solving is the lack of hotel rooms inside San Jose. Unfortunately that's just a joke.” ([01:05])
Max (on performance):
“We gain 27 to 40% gain on performance using FP4 comparing to FP8... and over the years you save 50% compared to Hopper generation and 210,000 times compared to Kepler which is 12 years ago.” ([01:36])
Hiro (on user benefits):
“You use much less resource but without losing the accuracy.” ([05:14])
Logan Lawler (on FP16 becoming outdated):
“FP16 is yesterday's news.” ([05:27])
The conversation is direct, technical yet approachable, and punctuated by friendly rapport and humor. Logan guides the discussion with clear questions, while Max and Hiro break down advanced topics in accessible language for a broad audience.