Transcript
A (0:00)
Foreign.
B (0:05)
Welcome to Reshaping Workflows with Dell Pro Precision and Nvidia. Where innovation meets real world impact in high performance computing.
A (0:19)
This is Logan Reshaping workflows live from GTC 2026. I'm here with Hero and Max, both with Nvidia talking about building and how to build the next generation of LLMs. So let's start with you Max. First tell me a little bit about you, what you do at Nvidia.
B (0:37)
I'm currently working at the AI platform software team. We support the training and inference frameworks to gain better performance and accuracy on Nvidia platform. And previously I've worked on the chip design at Nvidia.
A (0:52)
Amazing. Thank you very much. Hiro, for you. What do you do at Nvidia?
C (0:56)
I'm a solution architect. I'm helping customer solving their problems. Whatever problem the customer bring, we gotta solve.
A (1:05)
I love it. Nvidia is always solving problems. The only problem you're not solving is the lack of hotel rooms inside San Jose. Unfortunately that's just a joke. Neither here or there. So you guys are running the demo. Tell me a little bit about what you're showing. We'll start with you, Max.
B (1:19)
Yeah. Let's show on Nvidia's latest Blackwell GPU. If I have NVIP 4 format supported. Let's show the demo to train with an NV4 on JAX framework. So the NV4 we get a better higher throughput. And also the beauty is we do not lost the accuracy. We gain 27 to 40% gain on performance using IP4 comparing to IP8. And also we save the memory by using the lower precedent and the energy overall over the years you save 50 comparing to Hopper generation and 210,000 times comparing to just Kepler which is 12 years ago. And the amazing thing is we got these benefits of performance. Memory and energy savings is still reserved accuracy and you see across the benchmark of code math and the multi language. So then let's see how MP4 works. So for the 4 bit format we only have the 4 bits represent the sine exponent and mantissa. And it's very easy to get prone to some bias by using this limited amount of bit to represent the number. We introduced some key techniques to compensate the training with MVP4. So the four key techniques used for MP4 pre training is the stochastic rounding, the random hard mark transform and the two level scaling. And also for the last few layers which is the accuracy sensitive, we still use the higher precedent that helps to recover the accuracy by Using that and through this demo we are showing the training with jax, which is a very popular framework. And some frontier labs like XAI and Google Gemini use this to train their models. You can imagine you have a python code. This tool, the compiler of XA will help you to translate the code to the low level code to help you optimize it. Although the techniques look comprehensive, but the usage is very easy. We publish this recipe on the JAX toolbox. You can slice it publicly online and to use the different quantization recipe, you can just change a single argument which is a quantization. It will call the transformer engine and to help you to get the performance gain. So then let's show the demo of the chart. We collected Data on a GB300 nodes and compare the IP4 and IP8. We still amortize. They have a 20% gain and also the loss that are overlapped, which proves that the MP4 still keeps a residue without loss entity. The interesting part is I see when you go to this part is the flat. Why it's flat because I reserve certain amount of time for the machine. I feel it couldn't complete it with that time, but IV4 could complete it. That comes from each step. The NV4 is 0.6 second and IP8 is a 0.7% accumulates to 10k step. Let's introduce this difference. So overall we trying to. When you have a black valve machine, you can try this mp4 for your pre training.
