OpenAI Podcast Episode 18: Why AI Needs a New Kind of Supercomputer Network

Date: May 6, 2026
Host: Andrew Mayne
Guests: Mark Handley (Core Networking Team), Greg Steinbrecher (Workload Systems Team)

Episode Overview

This episode delves into the next generation of high-performance networks powering AI supercomputers, with a focus on the recent breakthrough implementation of the multipath reliable connection (MRC) protocol. Host Andrew Mayne is joined by Mark Handley and Greg Steinbrecher from OpenAI’s infrastructure teams. Together, they unpack how MRC enables scaling and speeds up model training, explore the unique problems faced by AI cluster networking at massive scale, and discuss why and how OpenAI is releasing this technology as an open standard to benefit the broader ecosystem.

Key Discussion Points & Insights

1. The Unprecedented Scale of AI Clusters

Traditional vs. AI Networking Needs:
- Conventional datacenters were built for many independent workloads, but today's AI clusters require vast numbers of GPUs to act in tightly synchronized fashion.
  - Mark Handley: "We're talking about a lot of the world's fastest GPUs and making them all work together on a single task, which is why this stuff gets hard." [05:06]
Synchronization Amplifies Network Bottlenecks:
- In AI training, one slow GPU or a minor network hiccup can stall thousands of others because they must process together in lockstep.
  - Greg Steinbrecher: "It's not about how fast can the average pair of GPUs talk to each other. It's always what is the absolute worst case that occurs." [09:58]
Explosive Infrastructure Growth:
- The amount of traffic inside datacenters now vastly exceeds user-facing traffic and is rapidly growing.
  - Greg Steinbrecher: "The traffic inside of their data centers was just exploding, even while the kind of amount of traffic that they were sending to end users was staying constant. And this is way before GPU clusters and AI. So this is what AI does, is it takes all of the systems challenges that people were having previously and it cranks them up to 11." [14:40]

2. The Challenge of Scaling and Reliability

Failure is Inevitable at Scale:
- As the system size increases, the mean time between failures drops rapidly, necessitating robust failure-handling protocols.
  - Mark Handley: "You have literally millions of optical links within the same building. So it's a huge scale." [14:15]
Datacenter Networks: The Limits of Past Approaches
- Internet-style statistical multiplexing ("smoothing out" with large numbers) breaks down when every GPU needs to synchronize. The 100th percentile worst-case link becomes the bottleneck.
  - Greg Steinbrecher: "We are subject to the tail of the tail. We call P100 the 100th percentile statistics. And that leads to very different systems requirements." [09:58]

3. Multipath Reliable Connection (MRC): The Breakthrough

Fundamental Innovations:
- Load Balancing with Multipath: Traffic is "sprayed" over many network paths, balancing the load and reducing congestion.
- Packet Trimming: Instead of just discarding a failed packet, the system trims its payload but forwards the header so the receiver can request a quick and unambiguous retransmission.
  - Mark Handley: "If you do manage to cause congestion and cause loss, it's a little bit difficult to figure out whether you got loss... So the second piece of this is a technique we call packet trimming..." [15:26]
Immediate Failure Recovery:
- No need to wait for consensus via old-school routing protocols; endpoints independently and instantly detect and stop using bad links.
  - Greg Steinbrecher: "Everyone within generally speaking, milliseconds notices and just stops using that link. This is a very big deal because previously the link goes down and the whole job stops for a few seconds as we wait for the network to stabilize." [20:19]
Simplifies Network Design:
- With MRC handling failures and congestion at the edge, network routing can be set static, reducing complexity and points of failure within the system.
  - Mark Handley: "We just decided that we would turn off the routing protocols...some paths are broken, who cares? MRC will find the broken ones that still work and keep going." [21:43]

4. Real-World Impact

Better, Faster Models for End Users:
- Reduced downtime and bottlenecks directly translate into faster iteration and release of AI capabilities and more reliable job execution.
  - Greg Steinbrecher: "You're going to get better models, more intelligent models, faster from OpenAI...We are trying to scale everywhere, including our velocity." [17:01][24:13]
Researchers Freed from Infrastructure Drudgery:
- MRC takes network worries off the research teams’ plates, letting them focus on model advances rather than low-level troubleshooting.
  - Greg Steinbrecher: "We have heard nothing but universally positive feedback about how stable the clusters with MRC are, how well they're working, how basically researchers don't have to think about this anymore." [23:55]

5. Openness, Standardization, and Industry Collaboration

Open Approach:
- MRC is being released as an open standard via the Open Compute Project (OCP), and OpenAI collaborates with Microsoft, Nvidia, Broadcom, AMD, Intel, and others on hardware and spec development.
  - Mark Handley: "We've been working with Microsoft who build our Fairwater data centers for us...and with all of these guys to actually build our hardware for our new supercomputers." [22:47]
  - Greg Steinbrecher: "I think it is a very good thing that we are open sourcing this and kind of bringing everyone along." [25:44]
Riding Industry Momentum:
- By leveraging and building atop Ethernet and standardized protocols, MRC can ride decades of global networking innovation—and bring future generations along.
  - Mark Handley: "Ethernet now is not what Ethernet was 10 years ago....And what we're doing is we're taking advantage of all that development by the whole world's networking industries." [28:45]

6. Technical Details

MRC Builds on Ethernet with Segment Routing:
- Uses IPv6 segment routing—each packet lists its own path, making switches simple “dumb” forwarders.
  - Mark Handley: "We're using a technique called IPv6 segment routing, which allows each individual packet's address to list the precise set of switches the packet goes through as it goes through the network. And that means that the switches themselves can be really dumb." [28:45]
Efficiency and Simplicity:
- Less hardware and energy required thanks to flatter, simpler networks, leading to cost and power savings.
  - Greg Steinbrecher: "We're able to build networks that are much flatter and basically have many fewer layers of switches and use much less power. They also cost a lot less." [30:49]

Notable Quotes & Memorable Moments

On Reliability at Scale:
- Mark Handley: "You have literally millions of optical links within the same building. So it's a huge scale." [14:15]
On MRC’s Innovation:
- Greg Steinbrecher: "This has really allowed us to remove one of the key barriers to continuing to scale." [24:13]
- Mark Handley: "The idea came out of a lot of research work that we've had over the last few decades. We're not fundamentally inventing anything new...but pulling the combination together into a set of features." [17:52]
On Openness and Industry-wide Progress:
- Andrew Mayne: "It seems like it's beneficial too, because everything is becoming very collaborative...what we're going to benefit is going to be so much better." [26:28]
On Future Limits:
- Greg Steinbrecher: "There will always be more work to do. There's fundamental limits on networks. Specifically, the speed of light is a known speed limit...But we're going to keep making each of those links faster and faster." [27:23]
On Space-Based Training:
- Mark Handley: "It's hard to envisage doing the sort of training that we do in our Stargate data centers in space. Just the latency would be a huge problem and just the background rate of failures would be a problem." [35:29]

Important Topic Timestamps

Networking for AI scale vs. web-scale [04:34-12:04]
The need for new network protocols—limitations of legacy Internet approaches [05:06-09:58]
How failures increase with scale [12:04-14:15]
How MRC works: multipath, packet trimming, and instant failure recovery [15:19-21:43]
Industry openness and hardware partnerships [22:41-25:44]
Standardization, Ethernet, and open source [25:05-28:45]
Design simplicity: segment routing, static routing [28:45-30:09]
Power efficiency, simplified networks [30:36-32:06]
Training vs. inference, edge limits, and the dream (and challenge) of space-based supercomputers [34:59-37:22]

Episode Takeaways

MRC (Multipath Reliable Connection) fundamentally changes how massive AI clusters use networks, enabling not just reliability but velocity of innovation.
As OpenAI and partners open-source and standardize these technologies, the broader AI and compute industries can avoid fragmentation and benefit collectively.
Despite enormous gains, physical laws (like the speed of light) and ongoing system complexity ensure that infrastructure innovation will be a never-ending challenge.
The heart of breakthrough AI is not just smarter models—but smarter, more robust, and more open infrastructure beneath them.

For those who want to keep up with the cutting edge of AI infrastructure, this episode offers a candid, approachable look into the practical and collaborative engineering efforts transforming AI datacenters from the inside out.

OpenAI Podcast Episode 18: Why AI Needs a New Kind of Supercomputer Network

Date: May 6, 2026
Host: Andrew Mayne
Guests: Mark Handley (Core Networking Team), Greg Steinbrecher (Workload Systems Team)

Episode Overview

Key Discussion Points & Insights

1. The Unprecedented Scale of AI Clusters

Traditional vs. AI Networking Needs:
- Conventional datacenters were built for many independent workloads, but today's AI clusters require vast numbers of GPUs to act in tightly synchronized fashion.
  - Mark Handley: "We're talking about a lot of the world's fastest GPUs and making them all work together on a single task, which is why this stuff gets hard." [05:06]
Synchronization Amplifies Network Bottlenecks:
- In AI training, one slow GPU or a minor network hiccup can stall thousands of others because they must process together in lockstep.
  - Greg Steinbrecher: "It's not about how fast can the average pair of GPUs talk to each other. It's always what is the absolute worst case that occurs." [09:58]
Explosive Infrastructure Growth:
- The amount of traffic inside datacenters now vastly exceeds user-facing traffic and is rapidly growing.
  - Greg Steinbrecher: "The traffic inside of their data centers was just exploding, even while the kind of amount of traffic that they were sending to end users was staying constant. And this is way before GPU clusters and AI. So this is what AI does, is it takes all of the systems challenges that people were having previously and it cranks them up to 11." [14:40]

2. The Challenge of Scaling and Reliability

Failure is Inevitable at Scale:
- As the system size increases, the mean time between failures drops rapidly, necessitating robust failure-handling protocols.
  - Mark Handley: "You have literally millions of optical links within the same building. So it's a huge scale." [14:15]
Datacenter Networks: The Limits of Past Approaches
- Internet-style statistical multiplexing ("smoothing out" with large numbers) breaks down when every GPU needs to synchronize. The 100th percentile worst-case link becomes the bottleneck.
  - Greg Steinbrecher: "We are subject to the tail of the tail. We call P100 the 100th percentile statistics. And that leads to very different systems requirements." [09:58]

3. Multipath Reliable Connection (MRC): The Breakthrough

Fundamental Innovations:
- Load Balancing with Multipath: Traffic is "sprayed" over many network paths, balancing the load and reducing congestion.
- Packet Trimming: Instead of just discarding a failed packet, the system trims its payload but forwards the header so the receiver can request a quick and unambiguous retransmission.
  - Mark Handley: "If you do manage to cause congestion and cause loss, it's a little bit difficult to figure out whether you got loss... So the second piece of this is a technique we call packet trimming..." [15:26]
Immediate Failure Recovery:
- No need to wait for consensus via old-school routing protocols; endpoints independently and instantly detect and stop using bad links.
  - Greg Steinbrecher: "Everyone within generally speaking, milliseconds notices and just stops using that link. This is a very big deal because previously the link goes down and the whole job stops for a few seconds as we wait for the network to stabilize." [20:19]
Simplifies Network Design:
- With MRC handling failures and congestion at the edge, network routing can be set static, reducing complexity and points of failure within the system.
  - Mark Handley: "We just decided that we would turn off the routing protocols...some paths are broken, who cares? MRC will find the broken ones that still work and keep going." [21:43]

4. Real-World Impact

Better, Faster Models for End Users:
- Reduced downtime and bottlenecks directly translate into faster iteration and release of AI capabilities and more reliable job execution.
  - Greg Steinbrecher: "You're going to get better models, more intelligent models, faster from OpenAI...We are trying to scale everywhere, including our velocity." [17:01][24:13]
Researchers Freed from Infrastructure Drudgery:
- MRC takes network worries off the research teams’ plates, letting them focus on model advances rather than low-level troubleshooting.
  - Greg Steinbrecher: "We have heard nothing but universally positive feedback about how stable the clusters with MRC are, how well they're working, how basically researchers don't have to think about this anymore." [23:55]

5. Openness, Standardization, and Industry Collaboration

Open Approach:
- MRC is being released as an open standard via the Open Compute Project (OCP), and OpenAI collaborates with Microsoft, Nvidia, Broadcom, AMD, Intel, and others on hardware and spec development.
  - Mark Handley: "We've been working with Microsoft who build our Fairwater data centers for us...and with all of these guys to actually build our hardware for our new supercomputers." [22:47]
  - Greg Steinbrecher: "I think it is a very good thing that we are open sourcing this and kind of bringing everyone along." [25:44]
Riding Industry Momentum:
- By leveraging and building atop Ethernet and standardized protocols, MRC can ride decades of global networking innovation—and bring future generations along.
  - Mark Handley: "Ethernet now is not what Ethernet was 10 years ago....And what we're doing is we're taking advantage of all that development by the whole world's networking industries." [28:45]

6. Technical Details

MRC Builds on Ethernet with Segment Routing:
- Uses IPv6 segment routing—each packet lists its own path, making switches simple “dumb” forwarders.
  - Mark Handley: "We're using a technique called IPv6 segment routing, which allows each individual packet's address to list the precise set of switches the packet goes through as it goes through the network. And that means that the switches themselves can be really dumb." [28:45]
Efficiency and Simplicity:
- Less hardware and energy required thanks to flatter, simpler networks, leading to cost and power savings.
  - Greg Steinbrecher: "We're able to build networks that are much flatter and basically have many fewer layers of switches and use much less power. They also cost a lot less." [30:49]

Notable Quotes & Memorable Moments

On Reliability at Scale:
- Mark Handley: "You have literally millions of optical links within the same building. So it's a huge scale." [14:15]
On MRC’s Innovation:
- Greg Steinbrecher: "This has really allowed us to remove one of the key barriers to continuing to scale." [24:13]
- Mark Handley: "The idea came out of a lot of research work that we've had over the last few decades. We're not fundamentally inventing anything new...but pulling the combination together into a set of features." [17:52]
On Openness and Industry-wide Progress:
- Andrew Mayne: "It seems like it's beneficial too, because everything is becoming very collaborative...what we're going to benefit is going to be so much better." [26:28]
On Future Limits:
- Greg Steinbrecher: "There will always be more work to do. There's fundamental limits on networks. Specifically, the speed of light is a known speed limit...But we're going to keep making each of those links faster and faster." [27:23]
On Space-Based Training:
- Mark Handley: "It's hard to envisage doing the sort of training that we do in our Stargate data centers in space. Just the latency would be a huge problem and just the background rate of failures would be a problem." [35:29]

Important Topic Timestamps

Networking for AI scale vs. web-scale [04:34-12:04]
The need for new network protocols—limitations of legacy Internet approaches [05:06-09:58]
How failures increase with scale [12:04-14:15]
How MRC works: multipath, packet trimming, and instant failure recovery [15:19-21:43]
Industry openness and hardware partnerships [22:41-25:44]
Standardization, Ethernet, and open source [25:05-28:45]
Design simplicity: segment routing, static routing [28:45-30:09]
Power efficiency, simplified networks [30:36-32:06]
Training vs. inference, edge limits, and the dream (and challenge) of space-based supercomputers [34:59-37:22]

Episode Takeaways

MRC (Multipath Reliable Connection) fundamentally changes how massive AI clusters use networks, enabling not just reliability but velocity of innovation.
As OpenAI and partners open-source and standardize these technologies, the broader AI and compute industries can avoid fragmentation and benefit collectively.
Despite enormous gains, physical laws (like the speed of light) and ongoing system complexity ensure that infrastructure innovation will be a never-ending challenge.
The heart of breakthrough AI is not just smarter models—but smarter, more robust, and more open infrastructure beneath them.

wavePod

Episode 18 - Why AI needs a new kind of supercomputer network

Summary

OpenAI Podcast Episode 18: Why AI Needs a New Kind of Supercomputer Network

Episode Overview

Key Discussion Points & Insights

1. The Unprecedented Scale of AI Clusters

2. The Challenge of Scaling and Reliability

3. Multipath Reliable Connection (MRC): The Breakthrough

4. Real-World Impact

5. Openness, Standardization, and Industry Collaboration

6. Technical Details

Notable Quotes & Memorable Moments

Important Topic Timestamps

Episode Takeaways

Transcript

Summary

OpenAI Podcast Episode 18: Why AI Needs a New Kind of Supercomputer Network

Episode Overview

Key Discussion Points & Insights

1. The Unprecedented Scale of AI Clusters

2. The Challenge of Scaling and Reliability

3. Multipath Reliable Connection (MRC): The Breakthrough

4. Real-World Impact

5. Openness, Standardization, and Industry Collaboration

6. Technical Details

Notable Quotes & Memorable Moments

Important Topic Timestamps

Episode Takeaways