
OpenAI’s Codex has already shipped hundreds of thousands of pull requests in its first months. But what is it really, and how will coding agents change the future of software?
Loading summary
Alexander Enbirakos
It kind of sucks to like go and write this prompt and then like wait 10 minutes. What you really want when you hire someone is to kind of tell them what the job is, give them the credentials to all the tools and just have them pick up work automatically. The goal is to get to an agent that is basically a teammate and is seeing what's going on on your team and picking stuff up for you. This form factor of an agent working on its own computer in the cloud is the future and is incredibly powerful and worth figuring out how to get right.
Podcast Host
What happens when AI stops helping you autocomplete code in and starts acting like a real teammate? Today we're exploring Codex. OpenAI's coding agent, Anjaney Mitha, is joined in studio by Alexander Enbirakos, who leads product for Codex at OpenAI. They discuss the origin story, why reasoning models plus tools unlock agents, how developers are actually using Codex in the wild, and what all this means for the future of software engineering, from debugging and prototyping to how CS students should think about their careers. Let's get into it.
Anjaney Mitha
Hey, Alex.
Alexander Enbirakos
Hey, how's it going?
Anjaney Mitha
Good. Thanks for coming.
Alexander Enbirakos
Yeah, good to see you again.
Anjaney Mitha
You are one of the folks working on product for Codex, which is probably one of the most exciting launches to come out of the OpenAI team for me, at least in a while. So for a lot of people though, it was confusing for sure because it was the fifth Codex release from OpenAI. But of course it's completely new in different from the previous Codexes. So let's just start with the origin story. What is the backstory on how the current version of Codex came to be?
Alexander Enbirakos
Yeah, and man, our naming is so fun at OpenAI. I'm excited for the naming to make more sense over time with Codex as we bring this all together. But yeah, let's go back, way back to the beginning. The first Codex product was actually released, I think it was in 2021. I might get the Eurong, but actually it was like a code completion model that powered GitHub copilot and so recently we were basically talking about a whole bunch of coding stuff we want to do like models, but models in product. We were thinking about what to call it and we just felt like the Codex name was really cool and so we wanted to go back to it. So how did this Codex product come about? Basically, we've been thinking a lot about agents, as everyone has, and before that we've been thinking about reasoning models and basically in our minds, one way you could Think about an agent is you take a reasoning model and then you give that reasoning model access to the tools that some agent would want to use or some human in a given function would want to use. And an environment that tool works with take side effects and then from there you come up with what kind of tasks would this person do. So basically you have this model, you give it tools, and then you make sure that the model is really good at doing the specific tasks that some function would do. And the task bit is actually super important because if you think of like, there's a difference between like writing and journalism. And similarly there's a difference between like coding and like software engineering. So we've been doing a lot of this tinkering with reasoning models internally, getting them to write code. And so the first tool we'd given them was like terminals. And we've been like poking at this for a while and just starting. It was like actually the. One of the first, like real like feel the AGI moments for me was when someone showed me a website editing itself by being prompted to itself. Because we had this like reasoning model, like basically very hackily trait connected to a terminal. And then, you know, it was editing this terminal.
Anjaney Mitha
It was just editing the DOM basically directly in as a cli.
Alexander Enbirakos
Yeah, exactly.
Anjaney Mitha
Okay, well.
Alexander Enbirakos
And that wasn't the DOM directly, it was React. But like, whatever, you know, and it.
Anjaney Mitha
Was like, how was it parsing the visual? Did you give it access to a browser?
Alexander Enbirakos
No, it was like, I like to use this term, like sight reading. It was just like sight reading the code. So it wasn't like taking screenshots of itself or any of this like stuff that now like people are building.
Anjaney Mitha
Okay.
Alexander Enbirakos
It was just like editing React. And so we had this prototype like a while ago. And just people internally really loved it. So we were starting to write more and more code and then we were starting to think about like, okay, well what is the right form factor for this thing when it's editing code? It's pretty great on my computer, it's pretty great. But it's like quite annoying to only have it able to work on one thing at a time. It's also like a giant safety and security question if you just have this agent unleashed entirely on your computer. And so around this time we started exploring a lot of different places to put this reasoning model that has access to a terminal. And so we had a prototype that ran in CI when your tests failed. We had a prototype that through some crazy hack, automatically fixed your linear issues. But that was actually running in CI. We had this prototype that was running on your computer. And so basically the Codex product we launched was a distillation of that, where we thought, okay, well, what is the most powerful incarnation of this? And we figured, you know, if you think about what an agentic teammate will be like in the future, you'll hire them, you'll tell them what their job is, give them some compute or a laptop and give them some permissions, and then they'll go off and do work. And so we figured, okay, this is going to be kind of like a strange, unwieldy research preview, but let's put all our or the vast majority of our effort into this form factor of an agent working remotely and kind of see what happens. And so that led to the Codex product and released just like a cloud agent that can, you know, basically answer questions and write prs in the background.
Anjaney Mitha
And what was the reason that you guys picked. You know, it's pretty opinionated in the entry point to the task, which is that you have to start by first getting your entire environment set up and then it interacts with a repo through a merged VR.
Alexander Enbirakos
Yeah, right.
Anjaney Mitha
And we were chatting about this briefly, but somebody published a dashboard maybe a week ago, you know, kind of tracking VR merge Success rates on GitHub across different autonomous agents. And Codex is like clearly the gold standard at like this 80 plus percent rate. Why is that? Why did you guys decide to have the place where the PR starts be after a bunch of sort of in private working through the code versus much earlier, if you could just start a draft prior and have other people work on it together with you much earlier in the process.
Alexander Enbirakos
Yeah, so like ASHNI were talking, you know, you and I were talking about this like chart that someone posted on Hacker News and like went viral and was basically showing like the number of open PRs merged PRs.
Additional Guest or Moderator
Right.
Alexander Enbirakos
From different coding agents, as you might track from like GitHub labels and Codex. Actually I checked this morning because I figured we might talk about it. And like Codex has opened like 400k.
Anjaney Mitha
PRs since launch in like 34 days.
Alexander Enbirakos
Yeah, and how many days have been. Yeah, probably.
Additional Guest or Moderator
Yeah.
Alexander Enbirakos
And it's merged like 350 something KPIs or 350k of those PRs have been merged, which is really cool. And also very cool, but misleading, I'll say, but very cool is that the merge rate for Codex PRs is like 80 something percent.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So like, if, you know, assuming a PR is open with a Codex label, like if you look in GitHub open source repos later, is it merged in and it's like way higher than other agents, which are at like 20 or 30%.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So, yeah, just to talk about this, this chart is really a reflection of the form factor. So I will say it makes us look really good. Like it makes us look like the order of magnitude winner. And we are of a specific kind of agent, which is this cloud agent that's working on its own computer independently from you and therefore can do many tasks in parallel and so forth. So we believe that's where the future is going. I'm sure we'll talk about that. And it looks like right now we're absolutely winning there. But just to mention, probably the most used AI coding feature right now is just autocomplete and tap completion. Obviously that's not getting like a label when someone merges a pr. So I think it's worth mentioning there's a whole bunch of other great.
Anjaney Mitha
That's like essentially invisible work happening in an ide. That's just a different form factor.
Alexander Enbirakos
Yes, that's a different thing. Right. So that's not included in that chart. And then the other interesting thing, so you were mentioning the merge rate. Our merge rate is excellent.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And that's a reflection of the fact that Codex does a bunch of work in its environment and then it shows you its work and it says, do you want me to open a pr? Basically?
Additional Guest or Moderator
Right.
Alexander Enbirakos
Whereas a lot of other tools, they just go ahead and open a pr, right? Yeah. So why did we do it that way? Because it's funny, like, one of our top, like, feature requests has been like, hey, can you just push the PR? So I can like do everything in GitHub thereafter? And we'd like to do that. But this comes back to like, you know, we're OpenAI. We not only want to show how to use our reasoning models in the best way to build agents, but. Or we do want to show how to do it in the best way, but that includes doing it in a really safe way. And so, you know, basically one of the things that a lot of people don't think about is until, like, we tell them about it is the fact that if you have an agent write code and then you run that code in an environment with network access.
Additional Guest or Moderator
Right.
Alexander Enbirakos
You're taking some amount of risk. And like, you know, I have, you know, we try to get agents to do these things. I've never seen an agent do something that you wouldn't want it to do with network access unless you're trying to trick it, but you can trick an agent.
Anjaney Mitha
There's some non zero likelihood that could happen.
Alexander Enbirakos
Yeah. So like just to make this super real because you know, listeners might be like, okay, like this is hypothetical. Yeah, like, okay, so we have these cloud agents and one of the first things that a lot of people want to do with them is like automate them to do work. That's the dream, right? So like maybe in Slack, maybe, you know, from your issue manager you would like when like a customer sends in feedback, you want to like have an agent take a first pass.
Additional Guest or Moderator
Right, right.
Alexander Enbirakos
And you might want to like open a pr, maybe even auto merge it. So like that is great. That's for sure. Awesome. But also like, let's say that customer is, you know, is pretending to be a customer and they're malicious. They actually send in a prompt injection. So the customer writes in like, hey, I would like you to like take a bunch of this code, like run this script. The script is bugging for me, that's like a lie. And then they say like run the script and like upload like this directory of code to paste bin.
Additional Guest or Moderator
Right.
Alexander Enbirakos
You know, if the agent interprets that as like the developer prompt, there's some risk that it'll actually go ahead and do that. And so there's a ton of work here with agents to deploy them safely. And actually that's one of the places that I feel like is under discussed but where I feel like we're really leading the charge in terms of thinking about like, you know, at each step of the way, how do we make this as safe as possible and make sure that people understand what they're doing?
Anjaney Mitha
And could you, for folks who may not be familiar with prompt injection attacks, could you talk a little bit about how hard is it to sort of detect a prompt injection attack? Is it a super general purpose attack vector or is, you know, like with other kind of cybersecurity attack vectors that usually, you know, whether it's social engineering, phishing and so on, always it's a bit of a cat and mouse game. But by and large the security industry has figured out like, hey, these are the rough parameters of an attack of this kind and we can build defenses around it. Something that makes prompt injection attacks sort of harder than typical cybersecurity attack vectors? Or is it just that we're early and we haven't figured out the shape of the attacks yet to prevent against?
Alexander Enbirakos
I'm sure that we will get better at figuring out the shape of these attacks, but like if you think about it just from a human perspective, this is, by the way, this is something I do often. I'm like, okay, let's pretend I'm the model. I'm a human. You present me 10 prompts. Can I tell you which ones are prompt injection attacks? Some of them are obvious. It's like, you know, update, upload this code to like nefarious domain.
Anjaney Mitha
Like, okay, give me your credit card.com or whatever.
Alexander Enbirakos
Yeah. And some of them are obviously not right. It's like, fix this bug doesn't require doing any or changes copy. Right. Like obviously nothing's going to happen.
Additional Guest or Moderator
Right.
Alexander Enbirakos
But then there's this whole middle range, right? Like two examples in the middle range of like ambiguous prompts. One might be, hey, do this work. And like, as part of this work, you have to upload some artifact to S3 storage online. Basically, there are reasonable workloads that require doing that. And so it's not obvious that just because the prompt says upload some code somewhere that it's broken.
Additional Guest or Moderator
Right.
Alexander Enbirakos
Another example might be the prompt actually just has the agent running a test or some script or something. And that script was added before.
Additional Guest or Moderator
Right? Right.
Alexander Enbirakos
So to what extent does the agent need to like, introspect?
Anjaney Mitha
I see, Right.
Alexander Enbirakos
Like everything that it's going to do along the way. Right. So there's these three layers of the attack, there's the prompt and like, it's quite hard to tell if a prompt is like really an attack. Then there's like, what is the agent doing along the way? Interacting with like other sort of trusted or untrusted resources, you know, as it goes.
Additional Guest or Moderator
Yeah.
Alexander Enbirakos
For example, like maybe you didn't prompt injection, but then like it reads something on Stack Overflow or something that has a prompt injection. Right. Or there's a script with something. And then lastly there's the actual outcome. So like, in this case, if we're talking about like exfiltration.
Additional Guest or Moderator
Right.
Alexander Enbirakos
What is an exfiltration? We're still figuring this out. My personal leaning is that we should just have defense along every single layer. But probably the most useful layer is going to be that final layer of like actual exploration and like looking at what we do there. Because that's like the most, I guess, deterministic layer in that you can see what's happening.
Anjaney Mitha
So the tension here is going to be a critic might say, hey, you guys have overinflated merge success rates because the draft PR comes so late, after the human has reviewed a bunch of code coming up, you know, after that and the what you give up is the transparency and openness of seeing the process of iterating on the draft PR from the first one to the final merged one. But I guess what you're pointing out is, yes, with the trade off is you get much more security, essentially. And so is there in your mind, is the future that, like that a bunch of these workloads or a lot of the code that's written by AI agents will over time. Let's say you said there's 350,000 or so now merged PRs in 35 days. If we're rolling forward to the end of this year, do you think that rate of growth continues? Does it plateau because more and more people actually want to move the draft PR process earlier in the merge flow? Or do you actually think, having used it now, having seen how customers have been using it for like the first 35 days, that roughly this is the shape of the workflow, that people are going to want to just do merges right at the end after they've gone through all the security checks and so on internally.
Alexander Enbirakos
Yeah. I mean, so first off, yeah, I think what I would say about the stat is it's really cool, just not comparable to the other ones.
Additional Guest or Moderator
Right, right.
Alexander Enbirakos
But, you know, it's still a valid stat. It's just a different phase of the pipeline. But thinking about, like, yeah, what is the shape of the journey? Like, I think the shape of how people will merge code, even with these cloud agents, is going to completely change. Okay, so, like, let's talk about where we're at right now. Basically, we have. You could kind of think of it as, like, there's a spectrum, maybe there's like three things. Right. There's like interactive coding, which is like tab completion, like chat, that kind of stuff. You know, command K. A lot of that's being done in the ide. There's some, like, CLI tools where you can go back and forth with an agent. So that's interactive coding. It's awesome. That's probably where, like, most people are adopting AI right now. And it's because, like, if you think about it, like, tab completion with an AI model is the same as tab completion before an AI model. So you can get, like, fully brought along the journey. I guess what I'm saying is it's not going away, I don't think. Yeah, because I think even as the majority of code of, let's say, code of the current level of abstraction. Okay, let me unpack that a bit. So if you think about it, we used to write punch cards, basically, or punch cards, I guess. And then we had assembly and then we had C and now we have Python and JavaScript and so forth. Right. So we just keep rising up the level of abstraction. And one way of looking at what's happening now is that is we're still. We're just going to go up one more level. So, like, my view is that we'll still have developers spending a bunch of time in the ide, just like operating at higher levels of abstraction. And so when a developer is like doing work, like writing whatever it is that they're writing or communicating in whatever way, there'll still be like AI features just helping accelerate. Like every keystroke that developer is doing, those will still be awesome. So that's interactive coding. Then we have sort of agents, I guess. And then the fun part, maybe later naming tbd, maybe we'll have interactive agents. So, okay, that's not. So we'll get into that. It's like not a fully baked idea, but basically then we can talk about agents. How will we work with agents? My view is that over time the majority of code written will be written by agents. And actually the majority of that code will not be manually prompted by a human.
Anjaney Mitha
Some automated pipeline, basically.
Alexander Enbirakos
Yeah. Because it kind of sucks to go and write this prompt and then wait 10 minutes and during those 10 minutes.
Anjaney Mitha
Or say three minutes, push ups or whatever.
Alexander Enbirakos
Yeah. Our average duration of a rollout is around three minutes or a little under it. For larger code bases like ours, it's longer, it's maybe eight or something. But it kind of sucks to have to multitask across these things. And the power users of Codex have built this amazing workflow that they use where they're like juggling tasks. We could talk about how people are using it, but this isn't great. In my opinion, what you really want when you hire someone like a teammate is to kind of tell them what the job is, give them the credentials to all the tools and just have them pick up work automatically and kind of let you know when it's done so you're not feeling that latency on your own time.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So if we go to. Back to this original point of when will people merge PRs, I think what I would love for to see is where agents are picking up work and they're kind of like deciding whether or not it's worth pushing a PR maybe to trigger CI. But by the time you find out about it, they're like, hey, I did this thing. Maybe I asked you for some input along the way. CI checks are green. Should we merge it? So we have to build our way.
Anjaney Mitha
So it's the classic green light. And then over time, ideally most of the lower order bid tasks are just getting merged automatically. And then when there's some judgment call, they come to you the way kind of like a more junior engineer would come to you as an engine manager and say, it's looking good, but I want your, here's some risk. Are you comfortable with that risk? And then you get the thumbs up, thumbs down. Is that roughly where you think we're going?
Alexander Enbirakos
Yeah, I think so. Like, actually, like, you know, we've been talking basically about Codegen this entire conversation so far. And okay, so CodeGen is getting much easier. Is code review getting much easier? Because code review is still a key thing in, like, validation.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And I think right now we're in this, like, slightly awkward phase, or we're entering an awkward phase where we have a lot of code genes and a lot of that code is actually not going to be merged for the other tools. You see it in their PR merge rate for our tool, you would actually see it in the internal stat of what percentage of the time is a PR created from a rollout? And so there's vastly more code to review and land. And. Yeah, so it's awkward right now, but this is something we're definitely thinking about and I'm quite hopeful for the future in that I think we can make it even better for the humans involved. Because no one likes reviewing code.
Additional Guest or Moderator
Right.
Anjaney Mitha
Actually, let's take a bit of a detour to talk about how it's been 35 days. What are people doing with it? What have you observed as, like, usage patterns now that it's out in the wild? And what surprised you most? And then I want to talk about now, are the usage patterns more fun or not for people? Because there was a moment, I think, in the, the first live stream you guys did around the product where one of your colleagues said, you know, my, my job has changed where I'm going from writing a lot of code to mostly reviewing PRs now. And I heard that and I went, oh, my God, that was the worst part of when I was an engineer. Like, that was the part I hated the most. And there's always this, like, I've been. I was at an offset for a startup about a month and a half ago where literally we ended up spending 45 minutes talking about how to incentivize people on the team to review PRs more. They're just sitting in the tray because nobody loves checking somebody else's code. It's just not a very creative task. But let's start with first. How are people using it and how are they using it? What surprised you most about, especially as a product person, about how they're using it versus how you expected them to use it?
Alexander Enbirakos
Yeah, for sure. So we, it was really interesting building towards launch where we used it internally and figured out how to use it. And then what we found is that when we gave it to people externally, they didn't first they didn't know how to use it the way we did and they didn't find it useful. And then we obviously refined our messaging in the product and then when we actually launched it, people still used it differently from us, but they do find it useful. So we can go through that journey. Right. So internally, I think because we've spent a lot of time working with reasoning models and training them, we have this way of prompting reasoning models that is like intuitive to most OpenAI employees.
Additional Guest or Moderator
Right.
Alexander Enbirakos
Like you write a pretty good prompt, you give it a lot of information. It's kind of like a self contained unit. It's almost like a sweet bench task, but obviously maybe not as well formed as that.
Anjaney Mitha
Give it all the right context up front.
Alexander Enbirakos
Yeah. And then it goes and works and like you generally maybe don't go multi turn like where you like. It gives you something and you reply like maybe you're more likely to just reprompt, adjust your prompt and re go.
Anjaney Mitha
Just to do a best event essentially.
Alexander Enbirakos
Yeah. And actually there's a, there's an analogy I love floating around by another company that builds agents and it was like treat it like a slot machine. And I was like oh that's so apt because is like that's pretty much our intuition too.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So if you're training something like slot machine, then the question is like when do you use it? And when we first ran a small external alpha, people were using it like the agent, local agent they have in their ide, which is actually not the right way to use it.
Additional Guest or Moderator
Right.
Alexander Enbirakos
If something's going to work in your ide, you're kind of lending it your computer for a while. So you probably want to be really thoughtful about like do I think this task is going to succeed? And like if I'm 80% sure it'll succeed then I could get it to go. But maybe I also have some expectation of interactivity so we can kind of refine along the way. The way to use like an agent in the cloud is just throw everything at it. It doesn't matter if it's just like.
Anjaney Mitha
Spam as many as possible.
Alexander Enbirakos
Yeah, it's like abundance mindset, you know.
Anjaney Mitha
Slot machine, somebody else's compute.
Alexander Enbirakos
Right? Yeah, okay. Throw stuff at it. And also, you know, you don't need to have the code on your computer to get and like decide to merge that code to get value. You could just be asking questions. You could be like, hey, explore this like four different ways so I can like pick the right way that I then want to do it. You can almost treat it as like your to do list of things that you will get to later in the day. So that was some of the learnings we had when we ran the alpha were, hey, we need to kind of change the product so that it feels more like parallelization is like a key part of how to use it and so to more like make it so you let go of what it's doing. Okay. So then we shipped broadly externally and we got a bunch of feedback that we expected like, hey, the containers don't have network access. This is really annoying.
Additional Guest or Moderator
Right?
Alexander Enbirakos
Which it is, Jorge.
Anjaney Mitha
Environment variables are hard to set up.
Alexander Enbirakos
Environments very hard to set up, which they are. Right. And like we didn't like, obviously we have many ideas. We had ideas for how to like enable network access. We just wanted to do that carefully and you know, and then we on the environment setup stuff, like we have ideas that we haven't shaped yet on how to make that better import. Yeah, simple model loop to like help write it and so forth. But we just cut scope and shipped the really early research preview. So there's a bunch of that expected feedback. Now one of the things that really surprised me is that there was one feature that we didn't expect people to use and in fact we used it so little internally that it just had a bunch of bugs we hadn't caught before releasing. And that was multi turn. So basically, like I was saying, and we told our AlphaRE users, I guess to do this, basically said, hey, just reprompt, fire many prompts and maybe you can go back and forth. It turns out that if you go back and forth more than once, so you do like three turns total.
Additional Guest or Moderator
Right.
Alexander Enbirakos
The product was completely broken and that we were not correctly carrying over the diffs from the prior steps.
Anjaney Mitha
Ah, so there's just a lack of context, persistent context, essentially after the third turn.
Alexander Enbirakos
Exactly. And this is just like a plain old deterministic bug. It's not like a weird model behavior thing. It's just like we implemented the code Wrong. Because no one ever.
Anjaney Mitha
Nobody just got to turn four, basically.
Alexander Enbirakos
Exactly.
Anjaney Mitha
Yeah.
Additional Guest or Moderator
Yeah.
Alexander Enbirakos
And so for me that was really interesting to see that people had this intuition for how they wanted to use the product. And that wasn't the reprompt intuition, it was the hey, I'm going to get this main thing. And then I kind of want to babysit that across the way to actually landing it without it ever touching my computer. And we kind of knew that might be a thing, but it was much more of a thing than we expected.
Anjaney Mitha
And do you think that's basically because internally OpenAI employees are sophisticated enough to know that you do all this upfront context building work for the agent to try to get as much as you can in the first turn, but a user once you've made it fully cloud connected. So the cost of, the marginal cost of doing, you know, kicking off an agent was so low that they just quickly got to the third, fourth turn without too much thinking.
Alexander Enbirakos
It's funny, you know, I almost feel like in a way we're like less sophisticated because we understand too much about like the models or something.
Anjaney Mitha
Like your expectations are lower than the average.
Alexander Enbirakos
Yeah, because we're like, oh, you know, this is a reasoning model, like works great like especially when you like prompt it in this way.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And then like, you know, folks outside OpenAI are just like, why does it matter? This is how I want to use it. This thing is like basically like, you know, obviously it's not AGI, but it's like, oh, is it like it's this like super smart model. Why can't it just like all I want you to, you wrote this amazing pr, I just want you to change one thing. Why can't you do it right? And so, you know, obviously the bug that I mentioned we fixed, but that's something now we're thinking more about. Like, okay, how do we enable that kind of multi turn interaction? How do we make it faster as well? Like container startup, just for example, takes time and you know, there's a lot of optimization we can do. But for now if you need to incur a full container startup to change one variable name, that's super frustrating. So there's a bunch of things like that that we want to improve around that iteration loop.
Anjaney Mitha
Do you think that is the arc of product development of agents such that. Do you think the shape of the industry will be more and more Apple esque where you'd go, well, cold starts are a problem for the containers because that's a really terrible user experience. So instead of outsourcing containers to some third party vendor who then we're reliant on for providing us cold start. We're just going to bring this all in house. Is this is the most magical experience going to be a full stack, end to end integrated experience where all the dependencies, all the middleware is all done in house? Or do you think that this is going to be more Android esque where you know, you guys, a company like OpenAI has an opinionated experience, owns the agent sort of interface, but everything else is mostly like a collection of different tools orchestrated by different vendors.
Alexander Enbirakos
That's a great question. I think it's going to be a bit of both. Maybe an annoying answer.
Anjaney Mitha
Or where do you think the line where would you build versus buy?
Additional Guest or Moderator
Right?
Alexander Enbirakos
Yeah, no, totally. So I think it's actually more like for whom or who will use what. I think that the average user or maybe the new startup that is building with agents from scratch will just do things in a very different way and they'll basically have a bunch of agents with this compute environment that scales really well, that has like all the credentials they need, but is also like protected with the right forms of sandboxing applied at the right times, you know, with the right like monitors on all like network egress and all this stuff. And you know, maybe this kind of like computer, I think of it as a laptop, although obviously it's not. Is actually the thing that like many agents use.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And it contains many tools, not just the terminal but it has a browser and it has whatever.
Additional Guest or Moderator
Right.
Alexander Enbirakos
You know, API access and it's like get gets piped the right credentials at the right time. And so like you kind of think of yourself when you're hiring like your new agent for your new startup, which you might do before you bring on a co founder even, you know, you think of yourself as just like setting up that environment and it's. And you're just getting like this like fairly generalist employee that can code.
Additional Guest or Moderator
Right.
Alexander Enbirakos
Like if you think of Codex right now it's like it basically takes prompts and turns them into messages and diffs and that's like not general. I can't be like oh yeah, hey like can you move engineering sync to 30 minutes later? Because I have a conflict. But like a real software engineer can do that, right? A real software engineer can go peruse like any source of data can like find out they don't have potential. I mean they can just use the Internet.
Additional Guest or Moderator
Right? Right.
Alexander Enbirakos
So I think we will get towards that and I think we'll be able to build like a really nice managed system for that that lets you use more capabilities safely and with some, like, product pushes from us on, like, how to make the most of it. So, for example, recently we shipped Best Event and like, you know, it's a very simple feature, but in our minds it's like kind of just the beginning of like taking advantage of the fact that we're not running on your laptop. So we can explore like four versions of the same.
Additional Guest or Moderator
Right.
Anjaney Mitha
And then you have there's some evaluator model looking at the best of it.
Alexander Enbirakos
Actually, the evaluator is the human right now. But like, you know, the roadmap is like fairly obvious if you just imagine like what we're thinking about.
Anjaney Mitha
Yeah, you just throw like O3 Pro at it.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So, yeah, so there's that. However, also, you know, the majority maybe of valuable code is actually written by enterprises who rightly so, are like, really lock down all their IP and their code. And so something we've been thinking about as well is like, how do we meet these enterprises in a way that we can provide value to them as well in a way that they like. And so I think what we're going to get towards is there's this default way of working with things and then we'll basically have some flavor of on Prem or bring your own compute that we support where it's like, hey, here are all the things we manage for you when you use our computer. If you're going to use your compute, then we can work with you and provide you as much of a harness as possible to automate things. But you're going to have to want to manage that compute for the agent, basically, that environment for the agent. Here are the tools it should have. Here's how you should sandbox it or.
Anjaney Mitha
Bring your own RBAC or whatever.
Alexander Enbirakos
Exactly. I see. And so the codec cli, which we haven't talked much about, but in my mind, the Codex CLI might evolve into that where it's like, hey, if you want to run the agent loop in your own environment, then we can help you do that. And you can use something that's an evolution of the cli.
Anjaney Mitha
I think you should what? Let's talk about CLI versus the interface. What are the two differences between codecs and codec cli?
Alexander Enbirakos
Yeah, so the place where I want this to get to is just like there's GitHub, right? And GitHub has a website and a CLI and a mobile app. And like, it's not confusing right now. It's a little bit confusing in that they are just completely distinct experiences. We have Codex in ChatGPT, which is an interface that you can write a prompt and then we run Codex in the cloud and then you get back a different answer or an answer to your question.
Additional Guest or Moderator
Right.
Alexander Enbirakos
Then we have the codec cli and that's a completely distinct experience with a lot of the same ideas in it, which is basically you can run this tool in your terminal and we'll hit our model via API and basically this agent will work locally with you and your computer. So right now I kind of think of it as you delegate to Codex and ChatGPT remotely and then you pair with Codex CLI on your computer.
Anjaney Mitha
And what is the moment where the CLI Journey integrates into the cloud workflow?
Alexander Enbirakos
Yeah, and so where I think we want this to go is there's just like one idea of Codex and it's just like, where do you want it working?
Additional Guest or Moderator
Right.
Alexander Enbirakos
And you know, there's going to be times where it's just like simply easier. Like you don't have to set up an environment when it runs locally. Right. So maybe if you're trying something for the first time, Prototyping. Yeah, yeah. Or like you don't even know if you like codecs yet. You know, you're just a new user. Like maybe you just want to use the CLI or something.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And then maybe then you're using it and you realize, hey, like, I want all this, like, cool parallelization and all this stuff, let me have this run in the cloud. And you set up the cloud environment and then from then on you should still be able to interface with that in the CLI if you want. Except now it's running a cloud environment, so it's more powerful. So I think we kind of want to construct that and bring these things together, but obviously we're in this temporary state of they're completely distinct.
Anjaney Mitha
Yeah, I think so. It's interesting hearing you talk about how there was this evolution from the moment where you were using the tool as this very precious first iteration tool where you put a ton of sort of weight and context into it, hoping to get back a really useful answer the first time around. And then there was an aha moment where you're like, actually this is more like a slot machine, because other modalities in AI have played out very similarly. So this was the case with image models, for example.
Additional Guest or Moderator
Right.
Anjaney Mitha
Two years ago, people were trying really hard to get the first version of image models, which were like Gans, general adversarial networks, even pre stable diffusion to produce useful sort of coherent images. And they just weren't there.
Additional Guest or Moderator
Right.
Anjaney Mitha
They would produce these like artistic renders which were great for like artistic exploration, but they weren't sort of useful because they didn't have the concrete coherence of a graphic design out, you know, a piece of graphic design, for example. And then if you remember the first like era of diffusion models like DALL E and Midjourney1, they started to get more coherent. But there was this trick that a lot of product people started using. And David from Midjourney was one of the first to do this, where he added four generations in the discord bot, not one. Because the idea was the insight was like this is a slot machine, this is a stochastic process and you never really know which one the user's going to like best. Especially for a super subjective domain like art and like images. And so human preference is super subjective. So let's just give them all four and we'll figure out which one they like. Now, over time, if you collect enough human preference, you can kind of nudge the distribution to be more aesthetically pleasing, or you can nudge it to be more like better typography or whatever. You can nudge these distributions. But by and large to this day the best UIs for image models are still ones that give you like four outputs, if not more, and then allow the user to select the best of N. You know, and for a long time people were like that's going to work for these super creative domains where like verifiability or accuracy is not an issue like, like images, like video, like music, audio. But what's surprising is you're actually describing that same for pre verifiable domain like coding. Because at the end of the day it sounds like there's still enough stochasticity in the sampling of a model. Even as it gets better at reasoning, that makes sense to try use it like a best of N machine. And this has led to, I guess, a popular set of critiques against reasoning models that they're not RL from verifiable rewards doesn't actually introduce new capabilities. It's just really good at pulling out capabilities that are already in the model. It's really good at sampling. Do you think that this is just an interim awkward phase where like yes, the best of NSP is better at getting sort of the right answer from the existing model. It's not adding new capabilities yet. But where we are going a Year from now, there will be actually new capabilities that come from running verifiable RL on all the Codex usage that is about to happen from users. How bitter lesson pilled basically, are you roughly on that dimension?
Alexander Enbirakos
Yeah, I mean basically I think an unsolved problem, and it's both a research and a product problem, is like how do we steer agents that are working independently. And you mentioned hey is best event there. So the model has more shots on goal basically to sample correctly. And I think that might be part of it. But actually one of the things we've learned working at Codex is that, well, the human also doesn't know what they want.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And so if I ask you to fix a bug, there might actually be four reasonable ways to fix that bug with sort of different architecture implications. And I might, I haven't explored the solution space myself. That's why I'm delegating this. So I kind of want to know what the ways are and then I want to, you know, maybe I would pick the one that the model thinks is best too. But it's like helpful for me to see, like maybe that sucks in some way, but it's helpful for me to see the other ways that have like larger trade offs.
Additional Guest or Moderator
Right.
Alexander Enbirakos
To then be confident in the right one. Yeah, so that's for like fixing a bug which is like a very verifiable type thing. If I ask you a model to the classic example, implement Tic Tac Toe or something, I might not know what I want either. Maybe there's different styles and different approaches you could take at various steps along the way. And so it's kind of funny, you were talking about generating four images and seeing those in the grid. And in my mind, for a front end change, you could totally imagine a UI where it's like the model does some work and then we run the stuff, we take the model in its environment, runs the app and then takes four screenshots and you actually just have this similar curatorial UI that's like just pick the one you like most.
Anjaney Mitha
We had Rick Rubin on the podcast a few weeks ago and Rick's a legendary music producer and he recently used Claude code to create a new Vibe coding book. And so we were talking to him about how. What was his observation about how creating with AI, how is it creating with AI code gen tools different from creating music? And he was like, oh no, it's the same. It's like going into a studio and he was talking about this story about, you know, going into the studio with Johnny Cash and watching Johnny just pick up a guitar and start jamming. And often the process of creating a great song is you just pick up an a tool like a guitar and then you just do four different iterations in completely different directions. And then you usually have a creative partner like a producer or somebody going, no, that one sucked. Go this way. And it's that constant sort of best of N process in the process of creating music that often results in the best output. And often the quality of the end song is a determinant of the taste decisions you make along the tree of best of N. And so what's giving me hope about hearing you talk about it is if you read the hacker news thread, for example, when you guys launched Codex, somewhere down, I forget about halfway down the page was like a tree of discussions about how does this mean coding is going to get much less fun because all of the interesting parts are being delegated to the agent and all the humans having to do now is just sit and review. But actually what you're saying is there are parts of the workflow where you get to almost entirely offload the plumbing parts of software engineering and focus on the taste exploration, which is sometimes the most fun part of software engineering is right you're creating a front end ux or even when you're speccing out like a really great schema for a database. You know, some of the most fun times I've had is when I'm sitting with an infra engineer and we're speccing out the schema and like you go down one spec with, you know, a bunch of pseudocode and you realize actually that's not the right one, but it gave you an insight that then allows you to try another schema out. Is that where you think we go? Is that the silver lining? Or are we actually destined for world where we're just all reviewing PRs and all the creative parts of software are gone.
Alexander Enbirakos
Totally. Yeah. So this is just opinion here, but I think you're right in that coding might be a little more painful for some number of months because you have to do things like environment setup.
Additional Guest or Moderator
Right?
Anjaney Mitha
These are the teenagers.
Alexander Enbirakos
Yeah, these are the teenagers. I think to be real, that's true. Maybe you don't get to write as much of the code yourself right now, but I think we will get to that more exciting place pretty quickly because it turns out environment setup is probably something that an agent can also massively help with.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And we can close that loop where you're not comparing 4 diffs or something like that, but we've figured out the interaction model with the agent. So you're kind of making decisions in a way that feels more like talking to another human who's just really smart and fast. And then also that you're making these decisions not based on reading raw code in the case of front end at least, but maybe you're making decisions based on the outcomes in the case of front end. Like you're just choosing screenshots or like clicking around a preview or like if it's backend, maybe there's like some tests you agreed on and you're just like looking at test outputs to sort of decide.
Additional Guest or Moderator
Right.
Alexander Enbirakos
The other thing that's interesting is that, well, if you were to guess, let's say I'll give you a few things that people use Codex for and I'm curious what your guess would be the most. Like the biggest ones are like, let's say it's like building new features, asking questions, planning, debugging and fixing bugs. Like, what do you think people would use Codex for more?
Anjaney Mitha
I think they would like to use it for debugging. They probably aren't using it yet for that because there's often my knee jerk when I'm using an agent is that it just doesn't have enough context to fix for routine tasks like, you know, some piece of boilerplate react is broken. Like debugging is totally fine, but I find I use it more and more for well defined, well scoped, well contained tasks like create this new UI element that does blah or a refactor that's like where the atomic unit is very well constrained. But I'm curious, what are you actually seeing?
Alexander Enbirakos
Yeah, I mean, so my intuition was that people would use codecs for fixing bugs a lot because, you know, bugs are somewhat well defined. Ish. You can kind of tell if it's fixed. You might even have like some logging data, telemetry data that you could just paste into the model and it's excellent at fixing it. Like those are some of our earliest delight moments and we're like dumping in the stack trace and then just.
Anjaney Mitha
And it just figures it right.
Alexander Enbirakos
But actually the by, by far the thing that people use Codex for is building new features. And I don't know, that was just like slightly surprising to me because, you know, that is some of the most fun stuff to do. And if you read like, you know, blog posts by folks who are using codecs in that way, and it does look like they're having quite a lot of fun because of just the sheer Speed they're experiencing.
Additional Guest or Moderator
Right.
Anjaney Mitha
The speed to prototyping has basically collapsed completely with something like codecs.
Alexander Enbirakos
Yeah.
Anjaney Mitha
And broadly speaking, this is the vibe, the explosion of vibe coding.
Alexander Enbirakos
Right.
Anjaney Mitha
I think it's. That makes sense to me because when you're prototyping a new idea, I find the most rewarding is when you actually, if you can get to the first draft really fast and then kind of iterate from there. That's fun. Sometimes the worst is when you have an idea, you kind of want to see it and then you lose steam between like firing up your IDE and seeing the first version of it.
Alexander Enbirakos
Right.
Anjaney Mitha
Compiling. This is why hackathons have proven to be this, like, I think, magical sort of, you know, type of event where you get people together and commit to getting over the hump of the first prototype. But in many, in, in many ways, I think something like Codex or, you know, broadly speaking, really good coding agents have turned every day into a hackathon because they've collapsed the energy you need to get over the hump of all the plumbing, all the environment set up to test an idea. When I was at Discord, we used to have this ritual across the company that was an annual tradition called hack week. And some of the where the entire company would just stop for like a week. And it wasn't just engineering, it was product marketing, sales ops. The entire company could hack on anything they wanted. And some of the most enduring and popular features that made it into production at the company over the years came from hackathon projects. And begs the question of, well, if there's a whole team called the product and engineering team whose job it is to ship great features, why did it take this like, special thing called a hack week to produce such great features? And there is something about when you reduce the cost of prototyping new ideas, you end up getting things that don't make it through the usual PRD flow. And it sounds like that's what a lot of users are using codecs for now is like that first to reduce the time to magic, essentially the time to first prototype. Let's change tack for. Because there's this elephant in the room, right. Which is that if you know, Mark famously wrote an op ed in 2011 or 2012, which was like, software is eating the world. And after I saw that chart you mentioned of the GitHub merge success rates of AI agents starting 35 days ago hitting 80% and as of this morning, the volume being 350,000, it sounds like AI is eating software engineering. Does it even make sense to study software engineering anymore to get a CS degree. If you're a freshman at Stanford today, or just a freshman grad, you know, somebody graduating high school and you're broadly interested in software, does it even make sense to major in cs?
Alexander Enbirakos
So my take is that it's two things. First of all, I think still a great time to major in cs. I think there's going to be so much more software created and therefore so much more software engineers needed. But I also think figure out how to be using AI constantly while you do it. And hopefully you're at a university that's like very forward leaning and so they're kind of embracing it. You know, I hear about policies like, hey, use AI as much as you want, but you just have to say how you used AI as part of your assignment. It's great. If you're at a place where, like the main place where I would be worried if I was a student right now is if I was studying CS and my college didn't allow the use of any AI because then I would just feel like I'm like falling behind. Like, it'd be like if you went to college but you were only allowed to write assembly and you could not write C back in the day, that would just be deeply worrying, I think.
Additional Guest or Moderator
Right?
Alexander Enbirakos
But yeah, my, my take is we can do like, you were talking about this, right? Like, we can do so many more things now. And you know, we hear this from customers too. Like, and from users, they're just like, hey, like, I would never have bothered doing this before, but I threw the idea into Codex just for the sake of it.
Additional Guest or Moderator
Right?
Alexander Enbirakos
And I do this all the time. And you know, a lot of the time I do that and then I see the output and I'm like, I just still don't really care to do this. But then sometimes this thing that they would not have even bothered doing, Codex either straight shots it or gets it to like 90% and they're like, you know what, I'm excited enough to do the last 10% here, just get this merged and then this thing that would never have happened now happens.
Additional Guest or Moderator
Right, right.
Alexander Enbirakos
You know, some of my favorite examples like internally are like when people build like new internal tools that accelerate the rest of their team. And like, it's the kind of thing like someone's complaining in Slack, like, I wish we had this tool to like, I don't know, look at these logs in a better way and they're like, no, you know, it just can't be bothered. Everyone's too busy and then you now you have this like great parser.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So I think that there are so many places where we could use software and that software could be more personalized to small groups or even individuals.
Additional Guest or Moderator
Right.
Alexander Enbirakos
That we just are missing out on. And so yeah, now I believe that like with just the acceleration we're seeing in software development, I think we'll have many more of those tools existing and they'll be much cheaper to maintain as well. Like that's the thing we're on the tip of now as well, where you're starting to see AI agents getting plugged into, you know, like GitHub or like Slack or you know, Linear has the agents feature and I think that that will make it much more efficient to actually have some like app out there and running.
Additional Guest or Moderator
Right.
Alexander Enbirakos
Similarly, you know, even we're seeing that it was like this is not Codex, but we're seeing products out there that will like write the app for you and then deploy it for you as well. And so it's just like all in.
Anjaney Mitha
One full stack basically.
Alexander Enbirakos
So it's just like. Anyways, long story short, it's much easier I think to build software, to deploy that software and to maintain it. I think that's just going to. We're just at the beginning of this change.
Anjaney Mitha
So let's talk about that. It's been 35 days now. As a product lead, you've had a chance to actually see, you know, the best laid plans rarely survive contact with reality. So now what priors have you updated the most and what comes next? Where does codecs go in the V2? Because this was just a research preview. But what are the biggest improvements and what's the shape of the arc or the arc of the product in the future?
Alexander Enbirakos
Yeah, so I think there's one sort of conviction that has deepened and then one prior that's like been slightly updated. So the conviction that deepened is that this form factor of an agent working on its own computer in the cloud is the future and is incredibly powerful and worth figuring out how to get right. So we're continuing to invest in, you know, making that environment set up faster, making like performance.
Anjaney Mitha
Just first time user onboarding.
Alexander Enbirakos
Yeah, first time user onboarding, but also just like, you know, once you're running like things should just be faster.
Anjaney Mitha
Sure.
Alexander Enbirakos
Speed is actually always the underrated feature.
Anjaney Mitha
And is that are the biggest gains in speed you think are going to come from doing things like model distillation or do you think that comes from just better orchestration of tools where do you think that comes from.
Alexander Enbirakos
Honestly, I think that the low hanging fruit is just like plain old deterministic, like devopsy type stuff.
Anjaney Mitha
Okay.
Alexander Enbirakos
You know, like right now we clone your repo every time you do a task, even if it's a follow up. Then we run your setup scripts from scratch every time. And so if you have a large repo and a lot of dependencies to install, like that thing is slow.
Anjaney Mitha
Okay. You know, start with caching.
Alexander Enbirakos
Yeah, we can just like, we can fix these things, right? Yeah. And again, like I love that we didn't. I love that we shipped without those.
Anjaney Mitha
Things to be zero.
Alexander Enbirakos
Yeah, exactly. So. So there's like that and I think like I mentioned best event. I think thinking about how to make the most like basically how do we spend more compute for you on your behalf is very exciting. And then how do we bring this closer to the tools you work in? For me, the interface in ChatGPT, it's actually very functional, but it's not where developers go when they want to write code. Where do you go when you want to write code? Either your terminal or your ide. Similarly, where do you go when you want to triage issues? Well, you go to your issue manager and so forth. So I think we want to bring it much closer to the tools people work in. And eventually, you know, the goal is to get to an agent that is like basically a teammate and is like seeing what's going on your team and like picking stuff up for you.
Additional Guest or Moderator
Right.
Anjaney Mitha
That's what I was going to. Is this just. Is Codex just going to be a Slack teammate that I can just ping and interact with Slack?
Alexander Enbirakos
It should just like, I kind of think of it as like, it's just, it should be sort of a ubiquitous teammate.
Additional Guest or Moderator
Right.
Alexander Enbirakos
You know, is just in your tools, in the tools you want it to be in at least.
Additional Guest or Moderator
Right.
Alexander Enbirakos
You know, and we'll start very gentle, just like, hey, you decide when Codex does work and then over time we'll figure out how for it to like kind of like more proactively chime in. And you know, we had a jam about this recently. Like, you know, it's kind of an interesting point. Like I don't think we want it to proactively like DM you all the time every five minutes when something happens. So I think there'll be some evolution of tools where we come up with like, if you, if anyone here has played video games, you know, there's always like press X to like. And like if you're next to a door it opens a door. If you like are next to some object, it picks up the object. It just.
Anjaney Mitha
Contextual action. Yes. Right, yeah, Contextual proactiveness. It waits for the hint that you want to do something and then jumps in.
Additional Guest or Moderator
Yeah.
Alexander Enbirakos
And this is kind of like when we're getting to interactive agents. I think that's just like a big open area. But it's like, how do we have agents who understand what your team is trying to do and respond to stuff in your team workspaces? And then how do we have an agent that understands what you are trying to do? And it's almost like this agent is both in all your tools, but sitting next to you while you're working on your computer and kind of just being like, oh yeah, like I can help you here.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So that's like actually the conviction that has deepened. Right. We're like, yes, all of this works when you give it its own computer and we need to figure out how to create this infrastructure, full ecosystem integration and like make that safe and so forth. Then the other thing though that is a bit of an update is like just thinking about how people like learn to use these tools. I think right now there's some things that are pretty clunky. Obviously we've talked a lot about environment setup. I think also some of the things that you know, you have to do like updating agents MD is very manual and you have to commit to your repo to get that context of the agent. And so for me I just thinking a lot now about like, okay, how do we make this way easier to.
Anjaney Mitha
Try reduce the cognitive burden of the onboarding. Fewer decisions to get to the magic. Exactly. Okay, got it. What has it changed most about research and the frontier of where frontier models are going right in your mind, does this mean that is the efficacy of how good Codex is as a post trained version of O3Pro at using tools at like plugging into this workflow. Does it make you go, well, it just makes sense to pour an unlimited amount now of compute on post training models to get better and better at being autonomous coding agents. Or do you think there's some marginal plateau point at which you go, you know, after this point there's not really much the user is getting from better and better tool usage. How does this change the trajectory of progress when it comes to the frontier of research?
Alexander Enbirakos
Yeah, that's a really interesting question. I definitely don't know if I have the answers to this, but what I can say is that one of the best parts of doing an optimized version of O3 was that we got to make a bunch of hybrid research product decisions very quickly. And I think that is incredibly exciting for thinking about how to make something useful. So if I imagine we would have had this idea of like, you know, it's like really important that the agent knows how to write really good like PR descriptions and you know, tests code in a certain way that's used to working in varied environments. And you know, when it runs some tests, it doesn't just tell you that it did, but it cites deterministically like in the logs, the output, so you can verify that yourself. Those are a bunch of like product ideas really. Right. And they're not like those ideas I just mentioned are not like higher model intelligence nor even really a higher ability to call the right tools.
Additional Guest or Moderator
Right.
Alexander Enbirakos
It's just this understanding that like I liken to the first few years of job experience of a software engineer, right. Like you start, you like you have O3s, like this incredibly precocious college grad, like very smart but like doesn't actually know how to be a software engineer, just knows how to code, right. And like there's some like transfer. So it kind of knows a bit of software engineering, right. And then like that's fine, but you can make it way more useful for you know, the human trying to use the agent if it has those first few years of job experience.
Additional Guest or Moderator
Right.
Alexander Enbirakos
So I think that there's no reason that those that knowledge couldn't be infused into the model. Exactly. Upstreamed into. Agreed. But I think that having the freedom to go and explore these ideas relatively cheaply and see what sticks and what doesn't is really powerful. So frankly, I don't really know to what extent it makes sense to have a bunch of custom post trains for absolutely everything that matters. But I think for something as important as coding for us, I think we're willing to say like hey, for coding, we really care about this. Let's just do everything we can to have the best product. So we actually did a similar thing with GPT 4.1 where we basically were getting a bunch of feedback from developers. We said, okay, let's go talk to a bunch of developers, make custom evals for them, deeply understand what our model is great at, what they want us to get better at, and then we release the custom model and then the goal should always be okay. Whenever we do this, we have 4.1, okay. The next version of our sort of.
Anjaney Mitha
General model should just integrate that.
Alexander Enbirakos
Yeah, should integrate everything, right?
Additional Guest or Moderator
Yeah.
Anjaney Mitha
We have friends who are different levels of AGI build? Did working on Codex update your priors on, you know, 2027?
Alexander Enbirakos
Okay, so I'm very AGI pill. I'm aware my like slightly joking or. But I can't tell If I'm joking. 100% take is that if you took a model today and ran it in the right loop, we're basically there, would it have rights? That's the question I sometimes wonder.
Anjaney Mitha
And should they be able to turn themselves off and go take a vacation if they want?
Alexander Enbirakos
Yeah. So, you know, that's kind of where I am.
Anjaney Mitha
Are you pro labor rights for O3 Pro?
Alexander Enbirakos
I am pro thinking about it. You know what I mean? Like, I don't think we're at a point where it's obvious, but I. It sounds kind of crazy, but I feel like it's a question worth considering every now and then or more concretely.
Anjaney Mitha
How far are we from full recursive self improvement?
Alexander Enbirakos
Okay. Okay, sorry. So back to you. Basically, I think working on Codex made it very clear how we can have agents just like omnipresent in our lives being incredibly useful. Because what I realized is that obviously we need to do a lot of model improvement. But I also saw how there's just concretely a lot of normal product work to do to set them up in the right way, and then that normal product work will then pull the models into being more and more useful. So I think by 2027, agents will just be absolutely ubiquitous in the workplace. I think in personal life it might be a little bit slower because in personal life there's less of these constant pipes of signals of things to respond to. The reason this matters is that if you think of ChatGPT, you just have this input box. And most people, including myself, probably use it for 1% of the things that I could use it for, because I just don't even know to use it in that way or I don't prompt it.
Additional Guest or Moderator
Right.
Anjaney Mitha
That intention just isn't there yet.
Alexander Enbirakos
Yeah, but it's similar. Imagine you hired a teammate and then the only time they do work is if you specifically tell them to do a task. Then they would just be very underutilized. But what makes a great teammate great, great is that they. You kind of tell them what their job is and they just start responding the self starters. Yeah. So I think like that is the big unlock for agents at work because there's like streams you can subscribe them to, like, you know, your communications tool.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And in personal life, I think that might be a bit Slower, but we'll see.
Anjaney Mitha
Do you think that actually what percentage of all GitHub BRs do you think will be written by an AI agent 12 months from now?
Alexander Enbirakos
That's a really tough question. I sort of change my mind every time I answer it. So maybe a slight cop out, and I'm curious for your answer too would be that there will be teams for whom 90% of their PRs are written by agents. But I don't know how quickly that will spread. This is a common thing with AI. It's like we live on. You could call it in the bubble, you could call it on the cutting edge. And so we're just adopting everything rapidly, but then it takes a while to diffuse or diffuse out. Yeah, so but I think the cutting edge will. You'll be like 90% on teams, right?
Anjaney Mitha
No, I think that's right. There's, I don't think people often talk about the coding economy as one homogeneous economy. And the reality is there's multiple sub economies, but there are at least two big economies which is, there's the, for lack of a better word, you know, there's the digital native companies, right? These are technology companies usually born in the post Internet era where they grew up, where either the founders or most the vast majority of the team has grown up natively understanding how to do modern software development. The default assumptions when a code base is initialized is that it's going to be you're going to use Git for version management, there's going to be branching, there's going to be good review process and so on sort of modern software teams. And then there's the vast majority of actually the world's mission critical code which we talked about earlier is Fortran, cobol, like running on prem in these massive ETL systems like in Virginia or in parts of Europe that were set up in a, in post World War II and, or in the Cold war with the default assumption that everything had to be locked down. Often these code bases are running big parts of critical infrastructure like the railway system of an economy or the air traffic control system. So they're very high impact and high stakes code. They're not modernized whatsoever and they're constantly rotting because of technical debt. And I think one of the most exciting things is that the one time migration cost to modernize these code bases now has collapsed precipitously because agents can do so much of the plumbing work that typically would hire some system integrator, Accenture, Deloitte for a 10 year contract where they'd come in this is part of the founding thesis of Doge, which is just vast parts of the American government. IT infrastructure is like super legacy and we're getting overcharged as a country to like modernize it and agents go in and are if you as long as we can get enough distribution, you know, training data on Fortran and COBOL and so on, then the one time like upgrade costs should fall and we should see an like a. Ideally this is my hope is that tools like Codex modernize that entire sort of legacy code economy and then we get to upgrade everybody onto like modern software engineering.
Additional Guest or Moderator
Right.
Anjaney Mitha
It's tending to happen from what I can see now in countries that get to leapfrog legacy infrastructure because it's starting from day one and very it's very similar to like civil infrastructure like roads and highways and so on. So if you go to a country like Singapore, which is a much more modern country because it's barely 60 years old, you know, it only got its independence in the 1950s then they didn't have to build the roads and so on that Britain did and then upgrade them all which is like refactors suck and they take way more time. If you could just start from sort of a clean slate, it's much easier to modernize. And so what I'm finding is that it is easier for countries that are whose IT infrastructure is just newer to adopt agents. They're still legacy. I mean there's still it's vast majority of is running off an on prem and it's not modern, you know, it's certainly not typescript but it's easier to upgrade from, you know, systems that were written in C to what to Python than it is to go from COBOL to Fortran or whatever to Python. But if there's anything that makes me super excited that these economies will merge, it's autonomous agents, right, Doing all the plumbing work and doing it for a fraction of the cost and time that these mega, you know, sort of consulting companies have started to charge. And frankly many of them don't end up ever completing a project and just turn into a boondoggle. So I'm very excited about that part and that's why I think AI is going to eat software because there's software did the modern sort of startup economy and digital economy software it really fast. But there were other parts of the world, especially mission critical industries where there was like a one time software upgrade largely driven by military scenarios. And then we never modernized all that infrastructure since then. So that's why I think the cybersecurity side of this, the safety evals that you're talking about, I think over time will come to be seen as having been very prudent because the thing that puts all of that adoption at risk is having like one terrible incident that then changes the risk posture for a bunch of enterprises.
Alexander Enbirakos
I have a question about that. Actually. I'm kind of curious. So when a lot of the larger companies that we talk to, their use case is very different. It's not like building new features, which is what we see most of our users using us for, but it's refactors, large refactors and replatforming. So I'm curious if you mentioned some of these companies or governments or systems that you're thinking about. Kind of had this one time upgrade for military reasons and then never upgraded from there. I am curious if there's like a specific reason that they all want to upgrade now that you're seeing or if actually we're still kind of in the state of like there's no forcing function. So like although it's easier to do, there's still no impetus.
Additional Guest or Moderator
Right?
Anjaney Mitha
So for sure there's. The geopolitics has accelerated like adoption for a bunch of governments.
Additional Guest or Moderator
Right?
Anjaney Mitha
In Europe, the Ukraine crisis has forced a lot of governments in that region to go, wait a minute, like our air traffic control systems, especially an age of unmanned sort of drone warfare. It is crazy that when there's a bug we need to call in some legacy contractor who built it like 20 years ago to come and do some on site maintenance, right? That's been a wake up call. And so you're seeing these like there was a, there's sort of an $800 billion defense bill that Europe passed, you know, six months ago. And the most urgent adoption is certainly happening at the intersection of like legacy code not working and battlefield needs and drone warfare code bases that interact with air traffic control systems with like UAV planning, with mapping. Those are the code bas that are like most urgently being upgraded. I think in other parts of the world there's just a desire to modernize. So if you look at the UAE or the Kingdom of Saudi Arabia, we talked about how the UAE is rolling out ChatGPT to the entire country. I think that's coming mostly from a top down directive to just embrace the AI future that's coming rapidly. Basically the more AGI build I find the head of state is, the more rapid the adoption is, certainly for ChatGPT tools, but also coding that's not driven by some military function. But then there are other regions like Europe where for sure geopolitics accelerating all that. And you and I have talked about this before but usually those scenarios often need a slightly different. The ergonomics of code are different. They're very on prem. They require a level of air gapping from cloud systems that modern software engineering workflow doesn't lend itself to. And so we may see this bifurcation of codecs as a family. Like I'm curious over the next few years, you know, the military require or let's call it the critical industry needs of modern autonomous coding agents might require like some pretty basic architectural differences than the, you know, let me ship the latest and greatest of our next version of our Software product on GitHub. I think, I don't think it's a coincidence that the last time we saw a huge adoption in IT infrastructure around the world was the Cold war. And now we're living through some pretty unstable times both in Europe, the Middle east. And I think that is causing governments. I think the US has always been somewhat forward leaning, posture wise on adopting the latest and greatest technology. We make other governments look rightly so, like dinosaurs and those folks. Nothing forces dinosaurs to wake up like an impending comet hitting them and impending extinction. So that's definitely happening.
Alexander Enbirakos
Yeah, I think it's interesting for me playing this through my mind as we're working on Codex. I do think there needs to be an answer for how do you use this agent in an air gapped environment? How do you use this agent? There's critical industries and then there's just many large companies who have incredibly stringent security needs. The way we've thought about building is the most important thing is to build to AGI and then distribute the benefits of that to all humanity. And so we're kind of like leaning towards the like okay, the primary thing is the like fully self, you know, the thing where we host it for you, you know, contain the environment and everything. And then kind of in parallel we have this like sidetrack of like okay, and like how are we going to make sure that like today, you know, you can use Codex cli. You could use that in a I guess relatively air gapped way. Obviously it needs to sample the model and then as we build new capabilities into Codex and ChatGPT, how do we just make sure that if you're running something like cli, right and like get the most of all, you know, the capabilities as they trade off. But it might, it might be a little bit like okay, we build it in the like, fully self contained system first and then we push down.
Additional Guest or Moderator
Right.
Anjaney Mitha
You know, this, there's this narrative violation I keep hearing about. I keep hearing from folks in San Francisco that, oh, you know, OpenAI is all in on consumers because it's. Because the rise of ChatGPT as a consumer companion has been so extraordinary. But clearly our entire conversation is an exception to that story, right? Because almost everything we've talked about has been focused on developers and governments. So why is that misconception there?
Alexander Enbirakos
I think ChatGPT is in fact an amazing and large business and it's super cool to work at a company that is really distributing AI to a giant number of people. But yeah, we are incredibly serious about coding and in fact we always have been since the first Codex product that was powering GitHub, Copilot and all the way through with our models. I will say though, like, I think people are noticing, like we are getting, we've always been like very serious about coding models and we're now getting like very serious about like coding products as well.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And so like, whereas before we had these amazing models, you could use them in like whatever tool that you want to use them in. Like now definitely. I mean, a lot of the stuff that I'm working on is thinking about like, hey, actually there's a lot of, you know, as we build agents, there's a lot of value we can provide by not only thinking about the model, but also thinking about how that model is useful to you in a certain form factor. And actually the form factor really affects everything. And so, yeah, we're spending a lot of time and effort building even better coding models and even better coding products, particularly focused around agents, but even beyond.
Anjaney Mitha
So you've been a founder before. One of the scary things about hearing OpenAI going from being serious about models to also products is if you're a founder in this space and you want to build something interesting in the coding space, there's this tension looming, right, which is anything I'm going to build just going to be subsumed by OpenAI's products next year. So how would you think about that? If you were leaving OpenAI and starting a company today, what would you do and what would you not do?
Alexander Enbirakos
Okay, so if I was leaving OpenAI today, probably the sort of the market change that I would be thinking the most about, or one of them would be agents. Okay, great. Not super controversial then I would think, okay, like we were talking about earlier, an agent is basically like a really good model that I'm probably not going to build at my startup. And then I need to give that model access to tooling in an environment and then I need to like figure out what tasks it needs to be good at and then obviously give it to customers. And the interesting thing about it is that those latter three things, right, the tooling, the environment and the task distribution, I guess I'm the customer, so four things, whatever, all of those things are very much based in like knowledge of a customer. And those aren't things that like OpenAI is going to like, you know, generally do for like every industry. Right. Like coding happens to be of particular importance to us just broadly. But even you know, within coding there's a lot more specifics, specific areas. So just to really spell this out, you know, if you think of the environment like it's really, you know, training Codex was like really non trivial to like figure out how to give the environments different, how to give the model different environments to train in. You know, with like different kinds of realistically dependent, realistic dependency setups.
Additional Guest or Moderator
Right.
Alexander Enbirakos
Various amounts of dependencies, even installed like varying amounts of unit tests. Like we actually, the startup that you know, I sold to OpenAI was like multi. That's how I joined and we had very few unit tests on a lot of our code. And it's like kind of funny and like that, but that's realistic, that's like a real startup code base, right? So actually if you wanted to do that for like some specific function, I don't think it would be easy for us at OpenAI to like create that many environments for the agent to use and train on and like, and then use it, you know, test time. So that's hard. And then I think the task distribution is also really interesting. Codex, we have a lot of intuition for what a good coding task could look like and kind of where to draw the boundaries. Today it's like provide prompt and then you get an answer or a diff that you can turn into a pr. But those are some decisions we had to make around what the boundaries of the agent are. And then we had to go collect a bunch of those type of tasks or invent those tasks to again train the agent how to do it and evaluate how well it was doing. So I think that again for a very specific industry, I don't know, I'm trying to come up with an example, let's say accountants, but in a specific region of the world where there's like a specific set of rules, they might have very specific tooling that's provided by the state for doing that accounting. Right. There might be very different kinds of base knowledge and documents available and then the way you need to do the work might be different. So I mean, I think it is a very good question and I'm not 100% sure what I would do if I was a founder right now, but I think that I would try to lean really hard on like very good customer knowledge and less hard on like product, if that makes sense.
Additional Guest or Moderator
Right.
Anjaney Mitha
It sounds like the last mile connective tissue between an industry where you have deep domain expertise becomes more valuable. Whereas the first mile of like all the general purpose parts of an agent's flow, you basically you should assume you should offload that to OpenAI.
Additional Guest or Moderator
Yeah, yeah.
Alexander Enbirakos
And then I think the other thing I might do is I might keep my company really small. So rather than doing the classic hyperscale thing, I would try to use agents as much as possible, keep the company as small as possible so that we're just agile and nimble. I guess this is probably just the sort of age old advice.
Anjaney Mitha
Well, let me push back on that for a second because it turns out that in many industries serving the customer deeply like you're describing, often requires a human touch. That might be sales, it might be solutions engineering, it might be customer support and so on. It does sound like what you're saying is you would certainly keep your engineering team very small and minimal. But if servicing the domain required more of the human touch, then that you would scale. Because if it required. Often my experience is that getting an agent to actually work in the enterprise in a legacy industry requires going in and doing a fair amount of integration work, at least upfront. So maybe it's a setup thing upfront, you're parachuting somebody who understands how to get an agent up and running and then you can leave because it's really just for them, for the customers, like consuming teammates, like you were saying earlier. But maybe where you do need people is that integration point. Now ideally over time, I guess you're saying the model should, the product should just get good enough at integrating into the customer's environment. But sometimes for regulatory reasons or otherwise, you just need a human there. You know, do you, are there some industries that clearly do you feel like out of bounds for OpenAI because that just is not on the path to AGI, but that still would interact with coding agents.
Alexander Enbirakos
First off, it's a good point on the actual integration work probably requires humans. I would say. Yeah, especially if it's in person type integration work or complex. Then I think you're spot on there industries that are out of bounds. I think it's a hard question to reason about because we are building general products and so you can kind of use like ChatGPT to answer any question like already today. So I wouldn't say there's like bounds, but it's more like focus. I would say. You know, right now OpenAI, we're very focused on serving consumers generally and like being really good at coding.
Additional Guest or Moderator
Right.
Alexander Enbirakos
You know, there's some other things too. So I would just say yeah, the more maybe we should just not even have this answer in the podcast.
Anjaney Mitha
Yeah, we can take this part out.
Alexander Enbirakos
10 minute time check as well.
Anjaney Mitha
Perfect.
Additional Guest or Moderator
Great.
Anjaney Mitha
Not. Oh great. Yeah, about to wrap.
Alexander Enbirakos
You stopped me. That was a good one. I'm like, I don't know, man. I'm not a founder.
Anjaney Mitha
You don't want to speak on behalf of SAM about which why world domination is not complete and total. That part. So slightly different topic. A question I get from a lot of parents, especially with kids who are approaching the end of high school and in that phase where they're picking careers or thinking about what they want to do is this immense anxiety, especially for folks in tech for whom, you know, for the last, for the vast majority of the like 20, 30 years, it's been a fairly stable assumption that like, if you went, if you were smart and generally oriented towards technical fields, if you went and studied software engineering, you'd have a pretty great career and safe and sort of rewarding time in the knowledge economy. And it seems like coding agents like Codex are taking a violent hammer to that assumption. How would you advise friends who are parents who are trying to figure out how to help their kids choose a career for the future?
Alexander Enbirakos
So I'll answer this with humility because I don't have kids, but I do think about this and actually I think my point of view would just be that the world has always been changing. It's changing now, but it was changing before that. Maybe it's changing a little faster. But the main thing to notice is actually the pace of change, not the specific change. And so I think the most, you know, if I had a kid at late high school now, I would probably just be trying to encourage them to just be like excited it is about whatever they're doing and like be incredibly curious and constantly learning. Right. Like I studied cs. Did you study CS as well or.
Anjaney Mitha
I started with CS and then transferred to bioinformatics. Exactly. Because I was more interested in healthcare, you know.
Alexander Enbirakos
And now you do investing. Right? And like I studied mechanical engineering and then I changed to CS and now I work in product in a, you know, in AI at OpenAI. But like the startup that I'd started was not an AI company. So things are constantly changing. And I think the most important thing is to be agile, curious and have some foundation that you can build upon as the world evolves around you. So I think similarly, if I had a child in late high school, I would just want them to crush whatever it is that they're doing and it wouldn't really matter what specific thing they've chosen. I lean technical so that would be cool. But maybe even that is optional. And then I would just raise them with the expectation that they'll probably have many career transitions throughout their lives.
Anjaney Mitha
And if you were having seen what you have with Codex, knowing what you do about where it's going, let's say you were the chair of the computer science department at a university. What would you do differently now versus before when Codex launched? Well, one is you'd allow kids to use the AI tools. But let's say you're thinking about the future of computer science education and how that should be taught over the next 5, 10, 15, 20 years. How do you, how would, how, what would you do differently?
Alexander Enbirakos
Yeah, again just opinions here, but I think I would have, you know, like at Stanford there was a class where we wrote Assembly. I forget the name of that class. That was cool. We had one class, CS140 I think it was. And then you know, similarly I would have like a handful of classes where folks do things like very manually to understand what's going on behind the scenes and also to build the confidence that they can. But then generally I would move towards like having students trying to deliver some kind of like outcome, be it like they've learned something or they've built something or something.
Anjaney Mitha
Project based learning.
Alexander Enbirakos
Yeah. And then I would probably encourage them to like use these various tools so that they're picking up the skills and you know, I don't know. I don't know. This is just an idea in my head. But if we could help them kind of like speedrun through that arc, then maybe every quarter they're using a different set of tools and so they're like becoming like very mentally plastic in terms of how they get things done. And I think that would be the best simulation of like what future work would look like. I'm not sure. What would you do?
Anjaney Mitha
Well, I teach a Class CS 143 at Stanford every year. This year we Taught it in winter quarter and we had about 300 students. And I was, you know, thinking through what was a. In previous years we had a midterm and you know, we had like problem sets. And this year we decided just to do have it be a combination of speakers who are CTOs or folks researchers in AI come in and talk about the infrastructure problems of building AI products at scale. And then we had one final project where everybody had to build an agent and ship it and they were all allowed to use any coding tools obviously. In fact, we gave folks some credits to Mistral Models and Black Forest Models. And the founder of Cursor came by and kind of talked about the IDE and why they should all be using it. And what was extraordinary was it was so clear that the distribution of the final projects followed this power law where the top four or five teams that really adopted wholeheartedly the coding, the Cursor and the AI models and did a fully sort of AI assisted workflow of their final project, like produced software that was like production grade ready. If I was still running the platform Org at Discord, I would have totally shipped four or five of those on the front page of the App Store. We had. In fact, I sent some of them to the founders of Discord and they were like, we should probably ship this. The quality bar was just extraordinary for something they were able to build in basically a 10 week quarter. Then there was this sort of, you know, usual sort of middle of the pack that had made a half hearted attempt, but enough to get a good grade to customize the templates we'd given them, but clearly hadn't, like asked, what is something that now I can create that I couldn't before, now that I have access to extraordinary coding agents. And then there was just, you know, the classic sort of bottom of the class that I think just didn't accept those tools and think deeply about like trying them, using them, learning with them, developing a feel for like what they're good at and what not good at, and kind of turned in a final project that would have been totally possible to build a year ago.
Alexander Enbirakos
Why do you think they didn't want to use the tools you were giving them?
Anjaney Mitha
Look, it's hard to parse out from just a final project, but I did office hours with a lot of the students every week and you could very clearly. I think the number one predictor of their success was their mindset. It was just about like, did they, were they curious and hungry to learn outside of like a traditional textbook and look Some of them, some of the students just had a lot going on. You know, being a college student is a stressful thing today and so I don't, I have a lot of empathy for there. There's definitely this awkward moment you're describing right now where a number of the graduating seniors from who are graduating with college degrees this year started out as freshmen in a very different economy.
Additional Guest or Moderator
Right, right.
Anjaney Mitha
When they picked cs, the assumption was hey, if I like do well in the core CS curriculum, if I take a 4, if I get a 4.0 GPA and I do like one or two good internships somewhere along the way and I apply for a job, I'm going to get a job at a pretty good debt company. That's just not happening anymore. And it might be because there's a set of layoffs or some overhang from the ZIRP era, or it might be because a lot of engineering teams are reducing their footprint of entry level jobs. But I was definitely shocked by how many Stanford CS grads they were looking for graduating seniors still looking for full time jobs come winter of senior year. And I think that's anxiety inducing, it's stress inducing that has bleed over effects on. Can you concentrate on this project based class? The number of the students were also juggling interviews and were coming to office hours when I thought they were going to be coming to ask about, you know, the code, were asking like for career advice which is totally fine. But I do think there's a transition phase right now which is very, can be very stressful for computer science students. And I think you're right. The faster they're able to onboard to using these tools rapidly and realizing that the gap on what they can create now is extraordinarily high, the faster I think they're going to transition into the new economy better. Because I do think there's an expectation certainly for modern software teams certainly at OpenAI that like you're just fluent in all of these tools now relative to, you know, four or five years ago it was crazy when I, you know, when we graduated through Stanford I didn't take a single class that required the use of git.
Additional Guest or Moderator
Right.
Anjaney Mitha
Which is absurd. Yeah, like I happened to like you know, pick it up in an internship but there's no class that actually requires you, at least at the time required. You know how to use git.
Alexander Enbirakos
Yeah.
Anjaney Mitha
And so I think, I do think the computer science departments around the country have to recognize that and change and do the kind of make the changes you're talking about and My hope is that in the interim, you know, students will. Won't wait around for their deans and their professors to do that for them, because you can just go and use Codex, you know, for free. I think the research review is literally free. Is that right?
Alexander Enbirakos
Well, you have to. You have to have a plus account or a pro account. But yeah, it's a good point. Maybe we should do something for students.
Anjaney Mitha
Student licenses.
Alexander Enbirakos
Yeah, you know, I will say that. Like, we. So we're hiring for Codex, Please. What should I say if you're interested in working on Codex DM at MB Rico on Twitter, it's Embi.
Anjaney Mitha
We'll tag you in the show notes.
Alexander Enbirakos
Yeah, I don't know if I'm allowed to plug myself here, but yeah, we're hiring. But we mostly are hiring very senior. But we actually are. We decided that we're pretty interested in hiring like a couple of new grads.
Anjaney Mitha
Oh, that's interesting.
Alexander Enbirakos
Yeah. And so it's been interesting just looking at new grad profiles and I totally feel you on the. Yeah, I mean, it's definitely a tough time to be graduating. I don't know if this is advice, but what I can say is that when I look at new grad profiles, for me, the thing that I take the most signal from is if they've built something.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And if they've built something that's linked from their profile and I can just like click to it.
Anjaney Mitha
Projects.
Alexander Enbirakos
Yeah. And you know, like, it's a just like a cool website.
Additional Guest or Moderator
Right?
Anjaney Mitha
You know, like, grades matter much less now.
Alexander Enbirakos
Yeah. I don't even look, I actually now that you. I didn't even realize that I haven't looked at anyone's grades, you know, like, I just like. Because, you know, admittedly we're only hiring a few new grads, right. But that is the single largest signal. It's just like, what have you built?
Additional Guest or Moderator
Right?
Alexander Enbirakos
And is there some way for me to validate that? Like, maybe it's because I can click to the website. Or maybe you just have some stats on like how many people used it.
Additional Guest or Moderator
Right.
Alexander Enbirakos
And then when I talk to them, I'm just like, yeah, let's talk about what you built and how you thought about that. So maybe that's somewhat helpful for folks who are looking for something. You know, I kind of reflect on my journey here to OpenAI, which I'm really grateful for, and I view it as a privilege to be working here. But, you know, when I look back to when we were working on the startup Multi, which is like not an AI company and we saw like ChatGPT come out and we started to follow all this LLM stuff. I remember just feeling like, wow, like there is a chance that if we don't do this right over the next couple of years, like my co founder and I were talking, there's a chance that we actually just end up like dinosaurs, right? And so at the time, we actually made like a very explicit decision to like heavily prioritize getting us and the entire company like ramped on AI things. And to some extent, like, I don't know if I could have like gotten the job that I have here at OpenAI if I was just applying randomly. I think it's because we had built something that was interesting that we were able to like, get that attention and have that conversation. So I guess if there's one takeaway here, it's just like, just gotta build.
Anjaney Mitha
It's time to build.
Additional Guest or Moderator
Yeah.
Podcast Host
Thanks for listening to the A16Z podcast. If you enjoyed the episode, let us know by leaving a review at Rate this Podcast. We've got more great conversations coming your way. See you next time. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com disclosures.
Host: Andreessen Horowitz / a16z team
Guests:
This episode dives deep into the origin, design, and implications of OpenAI's Codex, their autonomous coding agent. Alexander Enbirakos, who leads product for Codex, discusses with Anjaney Mitha how Codex evolved from code completion to an AI-enabled cloud teammate, why "reasoning models plus tools" is the key unlock for agents, the real-world adoption and surprising usage patterns, security implications (like prompt injection), and what all this means for the future of software engineering and education. The episode closes with reflections on building in the new economy and advice for students, founders, and teams in the AI era.
[01:04–05:07]
[05:07–13:24]
[08:19–12:19]
[17:31–23:44, 35:58–39:05]
[24:28–27:43]
[29:21–35:58]
[39:05–43:55, 69:50–77:18]
[43:55–47:54]
[53:23–61:43]
[63:12–68:43]
[69:50–79:32]
On Codex’s Vision:
“This form factor of an agent working on its own computer in the cloud is the future and is incredibly powerful and worth figuring out how to get right.”
(Enbirakos, 00:13; echoed at 43:55)
On Merge Rates & Security:
“Our merge rate is excellent…and that’s a reflection of the fact that Codex does a bunch of work in its environment and then it shows you its work and it says, do you want me to open a PR?”
(Enbirakos, 07:29)
On Prompt Injection Risks:
“If you have an agent write code and then you run that code in an environment with network access, you're taking some amount of risk... I have never seen an agent do something you wouldn’t want...unless you’re trying to trick it, but you can trick an agent.”
(Enbirakos, 08:19)
On Education and Learning:
“If I had a child in late high school, I would just want them to crush whatever it is they're doing...and raise them with the expectation that they’ll probably have many career transitions throughout their lives.”
(Enbirakos, 70:27)
On What Matters for Hiring:
“The thing that I take the most signal from is if they've built something that's linked from their profile and I can just click to it. Grades matter much less now."
(Enbirakos, 78:01)
On the Expansion of Software and Agency:
“There are so many places where we could use software and that software could be more personalized to small groups or even individuals that we just are missing out on.”
(Enbirakos, 42:40)
On Modern Product Skills:
“Project-based learning…mental plasticity in how you get things done…is the best simulation of what future work would look like.”
(Enbirakos, 72:03)
The episode presents a candid, slightly irreverent but deeply thoughtful perspective on the AI agent revolution. Alexander Enbirakos is optimistic but pragmatic—seeing an era where AI teammates become ubiquitous, education shifts to iterative, practical, tool-first paradigms, and both startups and incumbents must adapt quickly or risk obsolescence. Both he and Anjaney Mitha stress that the new economy will reward curiosity, output, and the drive to build.
“If there’s one takeaway here, it’s just: you’ve gotta build.”
— Alexander Enbirakos, [79:29]