Transcript
A (0:00)
Hello, I'm Andrew Main, and this is the OpenAI podcast. On today's episode, we're discussing how to make supercomputers better at training models. Joining me are Mark Handley from the Core Networking team and Greg Steinbrecher from Workload Systems. They'll discuss how a breakthrough has made training more efficient so everyone gets smarter models faster.
B (0:20)
This has really allowed us to remove one of the key barriers to continuing to scale.
C (0:24)
We're talking about a lot of the world's fastest GPUs and making them all work together on a single task.
B (0:30)
We know we've won when researchers stop needing to know what network protocol this particular cluster is using.
A (0:39)
So tell me a bit about your background.
B (0:41)
I started out doing physics and math in undergrad, wanting to basically understand how complex systems work. I always like the part of physics that's about, how do you take this thing that is unknowably complicated and build a simple model that is a complete nutter lie but tells you something about that system, and then build your intuition on that and kind of build more complex models? I ended up doing a PhD trying to build quantum computers.
A (1:07)
Ambitious PhD.
B (1:08)
You know, little things, little things. Unfortunately, what I liked is big, complicated systems. And you'll note that quantum computers don't work, and therefore they don't scale yet. They will someday, but they don't yet work. And so I kind of took a look at the chips we were designing to control light for quantum computers, and I went, huh, that kind of looks like a network switch. What if we use this as a network switch? What I found out pretty quickly was that academia does not know a whole lot about what real data center workloads look like. You get a whole bunch of very toy models, but they're not very informative. And so I ended up kind of pitching an industry company to get a fellowship. They paid for the last two years of my PhD, ended up working there for a while, building out, kind of doing initial network hardware just to try to understand what is it that we actually need from datacenter networks. What I found out was that there's a huge amount of headroom on just conventional datacenter networking hardware. Lots of room for optimization. We did not need my little optical chip. We did not need to do anything fancy like that, but that there's all sorts of really fun problems. Then around that time, the whole AI boom started to kick off. We decided we needed to build big GPU clusters, and in particular, we needed to build networks for those GPU clusters. And so I got roped in on trying to build kind of simulations of those so that we could figure out what to build. And then in the process of trying to build a simulation of these systems, you learn a lot about how they have to work. And at some point I said, well, why don't I just go build the actual thing? And so I transitioned from writing software to build simulations to just writing the software that allows GPUs to communicate with each other. And then a little over a year ago, ended up coming here to OpenAI to do some of the same stuff, but to get even closer to the actual model training. So the team I'm on is responsible for more or less making sure that we'd use the GPUs efficiently. So are the models training as quickly as they can be? Are we not bottlenecked on the network? What do we do when something fails? Are we restarting efficiently? How do we work around quirks in the hardware? And, yeah, now I get to play with some of the most fun hardware in the world and, like, try to make it. Try to kind of squeeze every last ounce of performance out of it.
