Loading summary
A
Foreign. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel and I'm joined by my co host Zwicks, founder of Small AI.
B
Hello.
C
Hello.
B
Calling in from Singapore here, but we are in the remote studio because the OpenAI team keeps shipping and today they just live Streamed and released ChatGPT Codex. Welcome to Josh, who I think we've talked about. We've met while you're at Airplane, right?
D
Yeah.
C
I've been building devtools for a bit now and you're kind of talked. I have to talk to you when I'm building devtools.
B
I mean, you have now seen me complain a lot when things happen. So I don't know if it's a good or bad thing.
D
It's a gift, man. Feedback is a gift.
B
Thank you, Alexander. We're new to each other, but you've been leading a lot of the Codex tech testing and demos and stuff.
D
Yeah. Hey, I'm Alexander. I'm on the product team here.
B
Awesome. So yeah, we're going to just assume that everyone's watched the live stream. You also release a blog post with a bunch of test demo videos. Basically a bunch of. It's very interesting. I noticed in the demo videos it was individual engineers sitting by themselves, very lonely, and then they're just talking to their AI friends coding with them. I don't know if that's the vibe you want to give off, but that's how I came across.
D
Yeah, man, those videos we were going for maximum authentic, just engineers talk about how it helps them. Yeah, take the feedback.
B
But no, I mean, it's true. I mean sometimes on call is a lonely job. Mobile engineer is a lonely job. There's not that many of those, so. Yeah, totally. But anyway, so what did you guys individually do? Maybe we can kind of start there. How did you get pulled into the project and we'll start from there.
D
Yeah, maybe I can go first because then we have a fun story about how we started working together. Okay. So actually before working at OpenAI, I was working on a native macOS software called multi, which is like about. It was kind of like a pair programming tool, but we thought of ourselves as working on human to human collaboration. And basically as ChatGPT and stuff came around, we started thinking about like, oh, what if instead of a human pair programming with a human, it was like a human pair programming with an AI? So I'll skip this whole journey. But that was this whole journey. And then we all ended up joining OpenAI and I was mostly working on desktop software, and then we shipped reasoning models. And I'm sure you guys were ahead of the curve in terms of understanding the value of reasoning models, but for me, it's kind of like starts off as better chat, but then when you can give it tools, you can actually make it an agent. An agent is a reasoning model with tools and environment guardrails and then maybe training on specific tasks. So anyways, we got super interested in that and we were just starting to think about, okay, how do we bring reasoning models into desktop? And at the same time, here at OpenAI, there was a lot of experiments going on with giving these reasoning models access to terminals. I wasn't working on those first experiments, to be clear, but that was the first true. Wow. I really feel the AGI moment that I had, it was actually while I was talking to David K, a designer who was working on this thing called Scientist, and he showed me this demo of it updating itself. And nowadays I don't know if any one of us would be the most impressed. That changed the background color, modifying its own code. Yeah. And then it was like they had hot reloading setup. So I was just like, mind blown at the time. And it's still a super cool demo. And so we kind of were experimenting with a bunch of these and I sort of joined one of the teams that was like tinkering with this. And we kind of realized like, hey, it's just super valuable to figure out how to give a reasoning model access to, to a terminal. And then now we have to figure out how to make that a useful product and how to make it safe. You can't just let it go loose on your local file system. But that's where people were initially trying to use it. So a lot of those learnings ended up becoming the codec cli, which shipped recently. A lot of the work there, the thinking that I'm most proud of is enabling things like full auto mode. And when you do that, we actually increase the amount of sandboxing so that's still safe for you. And then so we were working on these types of things, and then we started realizing we want to let the model think for longer, we want to have a bigger model, we want to let the model do more things safely without having to do any approvals. And so we thought maybe we should give the model its own computer, the agent its own computer. And then at the same time, we were also experimenting with putting the CLI in our CI so it could automatically fix test tests. We did this Crazy hack to get it to automatically fix linear tickets in our issue tracker. And so then we ended up sort of creating this project that is Codex, which is basically really the concept of giving the agent access to a computer. Actually I realized. I don't know if you were asking what I personally did, but anyways, I told the story. I hope that's okay.
B
Sure. No, I mean you weave your personal story into the larger narrative anyway. But yeah, and I'm sure Josh has a part two.
C
Yeah, yeah. So my story is somewhat different. I've been at OpenAI for two months here and it's been one of the most fun, chaotic two months of my life. But maybe I'll start back at the company I had founded a few years back called Airplane. We were building an internal tool platform. The idea is to let you build internal tools but really lean into developers and make that really easy. And it sounds unrelated but in many ways like the similar themes started coming up. What's the right form factor for doing local development? How do you deploy tooling to the cloud? How do you run code in the cloud? How do you compose all these primitives of storage and compute and UI to let developers build software really quickly? I like to joke that we were just, I don't know, two years too early. Towards the end we were playing around with like GPT 3.5 and trying to really make. It was really cool. It could actually build a react view really quickly. I think if we had kept going on it, maybe it would have turned into some of the AI builders that you see today. But that company ended up getting acquired by airtable where I ran some of the AI engineering teams there. For me personally towards the beginning of this year I saw the progress we were making in software Egentix software development. And for me it was a bit of like my own moon landing kind of moment that I suspected was about to happen.
B
Right.
C
Whether or not I was involved in the next two years I think we are going to build an agentix software engineer. And so I talked to my friend OpenAI was like hey, are you guys working on something like this? And you know, he gives me a wide eyed look, he's like, I'm not allowed to tell you anything, but maybe you could talk to the team. And so very fortunately this is right when Alex and folks were spinning up things. And I remember actually in our interview we ripped on the form factor, should it be cli the issues with that waiting for it to finish and not be able to interrupt all the time wanting to run it Four times, ten times in parallel. And at that point I said maybe.
D
It should be both.
C
And we sort of are going for that right now. But yeah, I don't know all this to say I was very excited and still very excited just to be pushing this forward. And I think Codex is still real early excited to share it with the world, but there's a lot more to build.
D
Yeah, I'll say it was a very fun conversation when we first met because you came in. I've never had this happen before. It's like, here's exactly the change that I see in the world and therefore the type of product that I want to build. I know you can't confirm if you're working on it, but just so you know, this is the only thing I want to work on. And then I was like, I asked just a few open ended questions and we immediately got into some of the core debates around the form factor of the tool and I was like, okay, this is awesome to work together.
B
I think a DevTools person can spot another DevTools person like that.
A
Yeah, Blink twice if you're working on this.
B
But for what it's worth, early iPhone team at Apple was the same. Because iPhone team members did not know if they were on the same team. They're not allowed to tell each other, so they had to like triangulate.
D
Wow.
A
Anyways, and talking about form factor, so you mentioned the cli, which you already released and I think there's other, you know, cloud code adar, a bunch of other tools out there. Should people think of codecs in ChatGPT as like a hosted codec? CLI? Are there big differences between the two? Let's talk about that.
D
Yeah, go for it. Yeah.
C
I think of it as, I think that's the short of it allowing you to run codex agents in OpenAI's cloud. But I think that the form factor, it's a lot more than just where the computer runs. How does this bind to the ui? How does this scale out over time? How do you manage caching and permissioning and how do you do the collaboration story? And so let me know if you disagree. But I think the really is like form factor is the core of it.
D
Yeah, it's like it was pretty. It's been honestly a really fun journey. Like the other day or maybe last night like in the a.m. josh was sleeping because he had to do the live stream. I didn't have to, but anyway, a bunch of us were like looking back at the dock where we planned what we were going to ship. And we were like, man, we had a lot of scope creep. And effectively all that scope creep was kind of like incrementally made sense because we kept leaning further and further into this idea that this is not just a model that's good at coding, but rather this is an agent that is good at independent software engineering work. And the more we lend into that, the more things started to feel really special. So I'm going to just label and then set aside the entire conversation around the compute platform that Josh has been leading. But let's just take the model for example. We don't just want it to be good at code and we don't just want it to solve, say, swabench tasks. SW Bench is an eval for those who don't know, that has a certain way of functionally grading outputs. Because if you look at a lot of SW bench passing outputs from an agent, they're not really PRs that you would merge because the code style might be different. It works, but the code style is different. So we spend a lot of time making sure that our model is great at adhering to instructions, great at inferring code styles so that you don't have to tell it. But let's say that you got then a PR that was like the code style was good, it followed your instructions well. It still might be really hard to merge if you have this enormous description. Just model how it thought about building it, and you probably need to pull it onto your computer to test the change and validate that it works. And maybe that's okay if you're just running one change, but in a future world that we imagine where actually, maybe the majority of code is actually being written by agents that we're delegating to, doing tasks in parallel, it becomes critically important that you can actually integrate those changes easily. As the human developer, for instance, some of the other stuff we started to train was PR descriptions. Let's really nail this idea of a good concise PR description that I highlights the relevant things. So our model will actually write a nice short PR description with a PR title that adheres to your repo format. We have a way to prompt that more if you want, with agents md and then in the PR description, it'll actually cite relevant code that it found along the way or relevant code in its PR so you can mouse over and just see it. And perhaps my favorite thing is actually the way we handle testing. So the model will attempt to test its change and then it will tell you in this really nice way with Just like a checkbox kind of thing, whether or not those tests passed. And again, it will cite if the test passed like a deterministic reference to the log so you can read it and be like, okay, I know that this test passed. Or if the test failed, it'll be like, hey, this didn't work. I feel like you need to install PNPM or whatever and you can read the log and see what it is. So those are some of the things that I think I've lost track of the original question. But anyways, those are some of the things that we've been really leaning into as we build this basically software engineering agent in the cloud.
C
I think also just it feels very different. You can look at the features, but I think for me the feeling is it takes a leap of faith. The first few times you're like, I'm not really sure if this is going to work and it goes off for 30 minutes but then it comes back and it's like, wow, this agent went out, wrote a bunch of code, wrote scripts to help code mod its own changes, tested this and it really went through the full end to end of thinking about the change it wants to make. I had no faith that at the start that it was going to be able to successfully do it. And after using it a bit you're like, wow, it actually pulled through. So that kind of long running independence is something that's hard to really see summarize. You have to really try it. But it finally feels very different and yeah, that feels special.
A
Yeah, I used it, I open a PR for it a few minutes ago. I was in the lucky first 25% of people to get the roll up. Yeah, it's very nice. It kind of shortcut it because it couldn't figure out how to run RSpec in rails and so it just checked the syntax of the Ruby file and it was like, looks good to me, but I think it doesn't have the agents MD yet. So I think once I set that up that'll be good.
B
No, just, just don't use Ruby, man.
A
Like Python skill issues once it's good enough to migrate the whole thing, that I'll do that.
D
I mean it is funny that there is on, just briefly on the note of like don't use Ruby or not, there's like a bunch of things that I think teams can do to like make better use of AI agents, like.
A
Oh, please stop using Ruby number two.
B
But yeah, if you could list some things out, that's, that's like best practices you know, I noted from the live stream that they mentioned pro users install linters and formatters. So that basically these are in the loop verifiers that the agent can kind of use.
D
Right.
B
So which turns out to be dev best practices as well. But now the agents can auto use it. Commit hooks have always been a tricky thing for humans because I've been on teams that were like no, everything has to have a commit hook. And then I've also been on teams that were like no, this thing gets in the way of committing, so let's rip everything out. But actually for agents it's actually really good to have commit hooks.
C
Yeah. I mean you took the words out of my mouth. I think the three I was going to say would be one agents md. And we put a lot of effort into making sure the agent understand this hierarchy of instructions. Right. You can put them in subdirectories and it'll understand which ones take presence over which others. So over time, I mean we also have O3 and fora writing our agentsmd files for us.
B
I love the tips. You actually open source the prompt descriptions here?
C
Yeah.
B
Anything to highlight?
C
Yeah, I mean I think I would start simple and not try to overdo it. And a simple agent's MD will get you a long way rather than no agentsMD and then it's more of like you learn over time. What we would really like to do is auto generate this at some point for you based on the PRs you create and the feedback you give. But we figured we ship faster rather than later.
B
You mentioned you have 0304 writing ageismd for you as well.
C
Yeah, I'll give it my entire directory and let it just say hey, produce an agent. Actually these days I'm using code 1 to do it because it can codex 1. Sorry to traverse your directory tree and generate the things for you. So yeah, I would recommend slowly gradually investing in AgentsMD. And then you took the words out of my mouth getting very basic linting, formatting up nets you really big wins because it's similar to how if you open a new project in VS code you get some out of the box checking the agent's starting as a human. You're sort of starting without that advantage. And so this is trying to give that back to the. And yeah, I don't know. Do you have anything else?
D
Yeah, so one analogy there and then actually I have just some thoughts we've observed of even like using other coding agents. Like just any coding agent, you know how to prepare for that. But like, you know, the analogy that I kind of like is like you. So if you start with like a base reasoning model, actually you basically have this like really precocious, like incredibly intelligent, incredibly knowledgeable and like weirdly spikily intelligent, you know, college grad. But we all know if you hire like that person and put them, like, ask them to do software engineering work, like independently, like, there's just a lot of practices that they're not going to know about. And so kind of a lot of what we've done with Codex 1 is basically give it its first few years of job experience and like, that's effectively what the training is so that it just, it kind of knows more of these things. And like, if you think about it like a PR description is a classic example of that. Like writing a good PR description. Right. And possibly knowing what not to put in it, actually. Right. And then so that's what you get there. So now you have this like this like weirdly, weirdly knowledgeable, spikily intelligent college grad with a few years of job experience and then every time you kick off a task, it's kind of like their first day at your company. Right. And so Agents MD is basically a method for you to kind of compress that test time exploration that it has to do so it can know more. And as Josh said, obviously we want to like, right now it's these research previews, so you have to update it yourself. But there's a lot of ideas we have for how to make that automatic. So that's just a fun analogy.
C
Yeah, Maybe the last one I'll say is like, make your code base discoverable. It's like the equivalent of maintaining good engineering practices for new hires that you make, letting them understand your code base faster. Right. A lot of my prompts start with like, I'm working in this subdirectory. Here's what I'd like to accomplish. Can you please do it for me? And so giving that guidance at scoping helps.
D
Yeah. Okay, I'll give you three. Sorry, three things for like, generally. So first, like, language choice. I was hanging out with a friend the other day who's a bit of a latecomer to AI, and he was like, oh yeah, I want to try building like an agent's product. Should I build it in JavaScript? And I was like, you're still using JavaScript? No wonder. Use at least Typescript. Give it some types. So I think that's a basic one. I don't think anyone listening to us now needs to Be told this another one is just make your code modular. The more modular and testable it is, the better. But you don't even have to write the test. An agent can write the test, but you kind of need to design the architecture to be modular. I saw this presentation recently by someone here who was like, they weren't vive coding, they were professional software engineer but using tools like codecs to build a new system. And they got to build a system from scratch and there was kind of this graph of their commit velocity and then their system had some traction. So then it was like, okay, now we're going to port it into the monolith that is the overall ChatGPT code base that has seen ridiculous hypergrowth. And so maybe is not the most architecturally pre planned. And their commit rate the same engineer, same tooling. Actually the AI tooling continues to continues to improve. Their commit rate just plummets. And so I think the other thing is just like, yeah, architecture, good architecture is even more important than ever. And I guess the fun thing is for now that's the thing that humans are really good at. So kind of good important for the software engineers to do their job.
B
I don't know, just don't look at my code base.
D
Yeah, well definitely don't look at mine. The last thing is just kind of a fun story, which is the code name. The internal code name for our project is wham. Like wham. And we chose it actually. I was working with a research lead and he was like, hey, make sure you grep the code base before you choose the code name. So we searched the code base and the string wham was only present in a few larger strings and never present as its own string. And that means that whenever we prompt we can be very efficient. We can just say in wham and then wam code that is like for our web code base or our server code base or on our shared types or anywhere else is really efficient for the agent to find. Whereas let's pretend, alternatively we would have called our product ChatGPT code. Not saying we didn't consider that. Then it would be super hard for the agent to figure out where we wanted to direct it to. And so we'd probably have to provide more relative folder paths. So there's a lot of this stuff like as you start to think ahead, like oh, I'm going to have an agent, it's going to be using terminal to grep. Then you can start naming things intentionally.
A
Would you start naming things less for humans readability and more for Agent readability. What's kind of the trade off in your mind?
C
Yeah, it's interesting because I definitely had different priorities coming into OpenAI. I currently believe that the systems are actually very convergent. There's a lot of. Maybe it's because as long as you see humans and AI writing it, maybe if there's a world where it's only AIs maintaining a code base and the assumptions change. But the moment you sort out to break that fourth wall and a human's coming in, doing code review, deploying the code, it has human fingerprints all over it. So how humans communicate to AI, where to make the change, how humans communicate a bug that needs to be done or communicate business requirements, all those things aren't going go away immediately. And so I think the whole system still feels actually very human. I think there's a cooler answer. I could say that it's like, oh no, it's like this alien thing. It's completely different. But I don't know, I think it's like these are start off as large language models. There's a lot rooted in human communication.
D
Yeah. By the way, if there's somewhere you want to take this, you should actually cut us off because I realize we're just kind of monologuing between each other, but it's.
A
No, no, I think, I think this also ties to the Agents MD. Right. It's like, why is it called Agents MD and not ReadMe MD? There's kind of like, I guess in your mind some fundamental difference with how the agents and the human consumes the information. So I'm curious if you think that's at the class naming level. It's just at the instruction level, like where does it break down?
C
Yeah.
D
Okay, so this is like a few options for this naming. Right. Which we considered. So you could go for reading readme md. You could go for contributors md. Right. You could go for codecsagent MD and then maybe codecli MD as these two separate files but that are sort of branded.
B
There's also cursor rules with rules. Everything has rules.
D
Yeah. And then you could go for agents md. Right. And so there are a few trade offs here, I guess. One is openness and one is specificity, I suppose. And so when we thought about it, we thought about, well, probably there's things that you want to tell an agent that you don't need to tell a contributor. And similarly there's things you want to tell contributors to really help them set up in your repo or whatever that you don't need to tell the agent. The agent can just figure that out. So we were like, okay, maybe this is going to be different and the agent's going to read your readme anyways. So maybe agents MD ends up being the stuff that you need to tell the agent that it's not automatically figuring out from the readme. So we kind of made that decision. Then we considered, okay, there are different form factors of agents. The most special thing about what we are building and shipping is it's just like an out of the box way to use a cloud based agent that can do many tasks in parallel and can think for a long time and can use a lot of tools safely. And so we thought, well, how fundamentally different is the set of instructions that you want to give that from an agent that you're working with more collaboratively on your computer? We had a good amount of debates about that, to be completely honest. And then we ended up concluding like, actually those sets of instructions aren't different enough that we need to namespace this file. If there is something you need to namespace, you could probably just like say it in plain language within the file. Then the last thing we consider is like, well, okay, how different do we think the instructions you have to give, like our agent are to the instructions you might give to an agent running on a different model or built by a different company? And we just think it kind of sucks if you have to create all these different agents. And whatever part of why we made the Codex CLI open source is a lot of problems like safety issues that you need to figure out for how to deploy these things safely. And no one should have to figure these out more than once. So that's why we went for a non branded name.
C
I have one specific example why readme and Agent MD are different for agents. I don't think you really had to tell it code style. It looks at your code base and writes code that's consistent to that. Whereas a human's not going to take its time, sorry, their time to go through the code base and follow all the conventions. So that's just one example. At the end of the day, there are differences between how these two kinds of developers approach it.
B
Cool. I think those. That's a really good set of advice. I think you just gave us our episode title like we're just going to call it Best practices for using ChatGPT Codex. And you know, I mean, I think people are going to want best practices. So I noticed like something that's very interesting, right? Like I think there's always a two versions in terms of building agents. One which is you try to be more controlling, you try to make it more deterministic. And then the other, you try to just prompt it and trust the model. And I think your approach is very much prompted. Trust the model. I see inside of the agents MD system prompt that you just prompt it to behave the way that you want and you hope you would just expect the model to behave it. Obviously you have control of the model so you can train it if it doesn't do well. But one thing that makes me question it is how do you fit everything in context? What if I just have a super long HSMD in your live stream? You had it demoing on the OpenAI monorepo, which is just giant. How do you manage caching and context windows and all that?
D
Yeah.
C
Would you believe me if I told you right now that it all fits in the context window window.
B
Not the OpenAI repo?
C
No, sorry. Everything that the agent needs.
B
Right. So you reify the agent's md, you put it at the top. Right. It's just like another system prompt.
C
No, actually it's a file that the agent knows how to grab and set.
A
For.
C
Because there might be multiple ones. And so you can actually see it in the worklog.
D
Right.
C
It's like going to look for it very aggressively looks for an agent. It's been trained to do that. I'll say it's been really interesting joining OpenAI and seeing how when you're thinking about where models are going and what AI products will look like years from now. You design products in a different way before OpenAI, especially when you don't have access to a team of researchers and many, many GPUs. You're building these deterministic programs. A lot of scaffolding around how this operates, but you don't really let the model operate at its fullest capacity. Right. It was interesting when I just joined actually I got a lot of pushback saying like, hey, why don't we just hard code like listen, you keep using this tool wrong. Let's just say in our prompt, don't do that. And then the researchers will be like, no, no, no, we don't do that. We're going to do it the right way. We're going to teach the model why this is the right way to do it. And I think that's related to this overall thought. Like where do you put the deterministic guardrails in and where do you really let the model think?
D
Right.
C
Similar conversation around planning should we just have an explicit planning stage where it's like, think out loud first write down what you're going to do and then go do it. Sure. But what if the task is really easy? Do you really want to think this whole time? What if it needs to replan as it goes? Do you have all these if else conditions heuristics to do that, or do you train a really good model that knows how to switch between those modes of thinking? And so it's tough. I definitely have advocated for little guardrails here and there until the next training runs done, but I think that's really like, we're really building for this future where the model is able to make all these decisions. What's really important is that you give.
D
It the right tool. Right.
C
You give it ways to manage context, manage memory, manage ways to explore the code base. Those still are really important.
D
Yeah, that's super. Well said. I think building here is super fun and different. And the model isn't all the product, but the model is the product. And you kind of need to have this kind of humility in terms of thinking about, okay, well, what are the things that there's three parties, there's the user, the developer, and the model? Maybe what are the things that the user just needs to decide up front? And then what are the things that we, the developer, are going to be able to decide better than the model? And then what are the things that the model can just decide best? Right. And every decision just has to be one of those three. And it's not like everything's the model. For instance, we have two buttons in the UI right now, like ask and code, and those probably could get inlined into the decisions the model makes. But right now it was just really like, it made sense to kind of just give the user choice up front because we spawn a different container for the model first based on what button you press. So if you ask for code, we put all the dependencies in. I'm going to oversimplify here, but if you don't ask for code, if you're just asking a question, we do a much quicker container setup before the model gets any choice. So that's maybe a user decision. There's some places where user and developer decisions kind of come together around the environment. But ultimately a lot of agents that I see are really impressive. But it's basically part of what's impressive is it's a bunch of developers building this really bespoke state machine around a bunch of short model calls. And so then the upper Bound of complexity of problem that the model can tackle is kind of actually just what can fit in the developer's brain. And over time we want these models to capture or to solve for much more complex problems just by themselves on more and more complex individual tasks. And then eventually you could really imagine that you get a team of agents working together, maybe with one agent that's kind of managing those agents, and the complexity just explodes. And so we really want to get as much of that complexity, as much of that state machine as possible pushed into the model. And so you end up with these kind of two modes of building. In one place you're building product UI and rules, and in the other case you still have to do work to get the model to learn something. But rather what you have to do is you have to figure out what are the right things that this model needs to see during its training to learn something. And so it's still a lot of human work to figure out how to get that change. But it's a very different way of thinking of we're going to get the.
A
Model to see this, but how do you build the product to get those signals? So if you think about the code and ask, it's almost you're basically getting the user to label the prompt in a way. Right? Because they say ask. This is an ask code, this is a code prompt. Are there any other kind of fun product designs as you built this of like, okay, we think the model can learn this, but we don't have the data. This is how we architect codecs to kind of help us collect the data.
C
I think file context and scoping is we don't have great built in things like that right now, but it's one of the obvious things that we need to add is another example of this. Right. You could have. We're often usually pleasantly surprised as, oh, it was able to find the exact file that I was thinking about, but it takes some time. And so a lot of times you'll shortcut a bunch of chain of thought by just saying, hey, I'm looking at this directory, can you go? So I think that'll probably be there for a bit until, hey, you have some better architectural indexing and search capabilities.
D
Yeah, I'll add to this. I'm actually going to double down on my thing about how do we think about it. So one thing we might consider is context window management and should we intervene here? And so we could do a product intervention, write some code to intervene and then kind of the next level of thinking maybe a Little bit more AGI pilled is like, okay, let's get the model to see context window management stuff in this training. I can't even come up with an example now at this point because I'm too AGI filmed. But I don't know, we could come up with something that it has to see to learn how to manage its context. But it's specifically tasks related to context windows. But then the most AGI pill thing to do is to be like, we don't actually need to think about this problem. The model will just figure it out. All we have to do is give it harder and harder problems and then it will just have an emergent property of managing its own context, because that's the only way it can solve these problems. So I'm kind of slightly oversimplifying here, but basically the model learns to manage its context. And so when you were talking about it working in the Monorepo, it learns how to be efficient with the way that it spends its tokens as it's browsing and setting. And in your example of there's a giant agents md, I guess we would just need to show it some versions where there was that. And so it learns, it shouldn't read the whole thing every time and it should first figure out how many lines it has, et cetera. So anyways, summarizing, I'm like, we just need to keep giving it harder and harder problems and a lot of these things that we might be very tempted to build a sub intervention for. It will just have to figure out. And if it doesn't figure it out, maybe it didn't matter.
B
Sure, yeah, I totally get that. I think we don't really have online models yet, and that's kind of what you need for your vision to be real. And for what it's worth, I wasn't thinking about a giant agent's md. I was just thinking about hierarchical nested agents MD with a lot of code. I think one issue where you have this version where the model is the product is your dev cycle as the Codex team, like the two of you, like, you have to kind of. It's not as tight because you have to be like, okay, every time there's a bug. All right, now I need to go get data. And where do you get the data? I don't know, like, maybe employees use it. Maybe you have, like you buy it from vendors and like you hire some human raiders or whatever. And then you have to train it in and then you have to go test it again. It's Very slow, isn't it?
D
Yeah, I think it's definitely from a building perspective. You have to do this when you're really willing to play the long term vision of we're going to build a better model, maybe even a better model bespoke for a certain functional purpose like Codex one, and then we're going to generalize the learnings from that model into an even bigger model that's getting all these other learnings from other functional purposes and these together will become a really powerful thing. And that's kind of like the philosophy we've had with training models so far. And it has been working, but it's definitely like a long term play. We do do this on occasion. For example, recently we released GPT 4.1. Really good coding model. And again that was based on working. We were like, hey, we want to invest better in this area. Let's hang out with a bunch of developers, understand their feedback, how things work, create some evals. And like you said, this is like it's a lot of work to do that. But then we end up with a great model and even more exciting, we can then take those learnings and put them into our mainline models and then everything benefits and you kind of the sort of philosophical view, I don't know if I can factually prove it or not. Maybe someone here can. Is that if you can do something very specific for a specific purpose, actually when you bring that and you bring it into the generalized model, you might even get outsized returns on that because there's transfer from all these different domains.
B
Okay, cool. I think we had a couple factual things to wrap up on just codecs itself and then we wanted to double click on the compute platform stuff which I think Josh, you wanted to cover more on. So I notice in the details it was between one to 30 minutes in length. Is that a hard cutoff? Have you had it go for longer? Any comment on the task time?
C
Yeah, I mean I just checked the code base before this. Someone else had a similar question. Our hard cut off is an hour right now. Although don't hold us to that. It may change over time. The longest is, I've seen two hours when in development mode and the model went off the rails. So, you know, But I think 30 minutes is a great ballpark for the kind of tasks that we're trying to solve. Right. These are hard tasks that require a lot of iteration and testing and the model needs that time to think.
D
Yeah, I mean, yeah, I think actually like our average is like pretty significantly lower than 30 but if you give it a hard. If you give it a hard task, you'll end up at 30.
B
Yeah, I mean, I think there's a couple analogies here. One, I think the operator team released a benchmark where they had to cut off for two hours. And then the other one is the meter paper, which I don't know if has been circulating, where they estimated that the current average autonomous time is like an hour and it's maybe doubling every seven months. So like an hour sounds right, but also, I mean, that's the median, so there's going to be some that go longer than that.
D
Yeah, totally.
B
Is this part of the. You had cutoffs for a few like 23 suite bench verified examples that were not runnable. Was that part of it in terms of length or was just something else?
D
Yeah, to be honest, I'm not exactly sure, but I feel like there's a bunch of sweet bench cases that actually are like invalid might be too strong of a word and a little bit not sure. But I feel like there's issues with running them so they just don't work.
B
Okay. And then max concurrency. Is there a concurrency limit? If I have 5, 10, 100 simultaneous.
D
Codex, 5 and 10 is totally fine. Do we actually have a. I feel like we did introduce a limit for fraud reasons. I don't know what it is.
C
Yeah, I think right now it's 60 an hour.
D
Wow.
B
So one per minute. I'm just going to.
D
Yeah, but look, this is literally the point, right? Long term, we actually don't want you to have to think about if you're delegating or pairing with AI. If you imagine an AGI super assistant, you just talk to it and it just does stuff. It answers quickly if it needs to. It takes a long time. And you also don't have to only talk to it. It's also just present in your tools. So that's the long term thing. But in the near term, yeah, this is a tool you delegate to and the way to use it that we see going back to, I guess maybe the title of this podcast of best practices, it's like you must have an abundance mindset and you must think of it as not using your time to explore things. And so often when something, a model is going to work on your computer and it's going to work on your computer, you're like really craft the prompt because then it's going to use your computer for a while and maybe you can't. But the way we see people who love codecs the most using it is. They think for maybe 30 seconds max about their prompt. It's just like, oh, I have this idea, like, boom. Oh, there's this thing I want to do. Boom. Oh, I just saw this bug or this customer feedback thing, and you just send it off. And so, yeah, the more you're running in parallel, actually, I think the happier we are and the happier we think, like, users are when they see it. Like, that's just the vibe of the product, really.
B
Yeah. I would pass my own anecdote. So I was on the trusted testers team for this thing, as both of you well know, and I was using. I found out I was using it wrong. I was using it like, cursor. Like, I had my chat window open and I watched it code.
D
Yeah.
B
And then I realized I wasn't supposed to, and I was like, oh, like, you guys are just firing the things off and, like, you know, going on about your day and. Yeah, that was a change in mindset.
D
Yeah. One.
B
Yeah.
D
Real quick. I'll keep it brief. Like, one thing that's quite fun is, like, use it on your phone because somehow, just like, being on your phone just, like flips the way people think about things. So, like, we made the website responsive and we'll pull it into the app eventually. So try it. It's actually super fun and satisfying.
A
Okay.
B
Because there was a voice. There was one of the videos showing the mobile engineer coding with it on its phone, but it's not available in ChatGPT's app.
D
Okay. Yeah, yeah, not yet.
A
Just one question. I got from the mobile. I got the notifications that I get when it starts the task. It says starting research the same way the Deep Research notification is. Is it using Deep Research as a tool or did you just reuse the same notification?
D
We just used the same notification. Yeah.
A
So you mentioned the compute platform. You mentioned how you share some of the infrastructure with rl. Can you maybe just give people a high level of, like, what the codecs has access to, but it doesn't have access to? Like, it doesn't look like people can run commands themselves. They can only instruct the model to do it. Any other things people should keep in mind?
C
Yeah.
D
So.
C
And I'll say it's an evolving discussion as we figure out what parts we can give folks access to and the agent and what we need to, like, hold back for now. Right. And so we're learning. It's really. We would like to give humans and agents alike as much access as possible within safety and security constraints. What you can do Today is as a human set up an environment, set up scripts that get run. These scripts typically will be installing dependencies. I expect that to be maybe 95% of the use case there and just really get all the right binaries in place for your agent to use. We actually do have a bit of an environment editing experience where as a human you can drop into a repl, try things out, so please don't abuse it. But there's definitely ways for you to interact with the environment there.
D
We laugh about that because earlier I mentioned scope creep. We weren't planning on having a REPL to interactively update your environment. But anyway, Josh was like oh man, we need this. And so that was an example. Scope creep. Thanks for doing it.
C
We do have rate limits in place and we do monitor that very carefully. There's interactive bits of there to get that going. But once the agent starts running. What we actually do today, and we're hoping to evolve on this, is we'll cut off Internet access because we still don't fully understand what letting loose an agent in its own environment is going to do. For now, the safety tests all have come back very sturdily. Like it's not susceptible to sorts of certain exfiltration attempts on prop injection but. But there's still a lot of risk to this category. So we don't know. And that's why to start we're being more conservative there. And when the agent's running it doesn't have full network access. But I'd love to be able to change that, allow it to give limited access, certain domains or certain repositories. And so all this to say it's something we're evolving as we build out the right systems to support that. Not sure that quite touches on your original question. The last thing though that I do want to mention is there is an interactivity element with like as the agent's running, sometimes you're just like oh, I want to correct it, tell it to go somewhere else or let me maybe fill this part out and then you can take back over. We haven't quite solved those problems either. What we really wanted to start was to shoot for the fully independent, just deliver massive value one shot kind of approach. But yeah, we're definitely thinking about how we can weave human and agents together better.
B
I mean, for what it's worth, I think the one shot thing is a good angle that the other people. This is me comparing you to alternatives like Devin and Factory and all the the others, they are more focused on multi shot human feedback. All These. But like, you know, so I have a website I'm working on and I gave it a request and I compared it all the others. That was my test for Codex and it did one shot it. I posted the screenshot as a tweet just earlier today and it's. I think it's really good, especially if you're running 60 at a time. So I think that really makes sense. But it is a very ambitious goal because human feedback is a crutch that we like to use. It also I think makes us write more tests, which is annoying because I don't like to write tests, but now I have to write tests. Fortunately, I'm now getting Codex to write my own tests and I really like on the live stream as well, you can just kind of ask it to just look at your code base and just suggest stuff to do because I don't even have the energy to figure out what I should be doing.
D
Yeah, delegated delegation. I thought that was a great line. Yeah.
C
And to be clear, we're not saying one form factor is better than the others. Right. Like, I love using Codex cli and it's really. We just want like as we talked about in our interview when I was interviewing at OpenAI, you really want both modes. But I think what we see as the role of codecs here is to really push frontier on that sort of single shot. Autonomous software engineering.
D
Yeah, I kind of think of the Research Preview as our thought experiment. It's like what does a coding agent in its purest, most AGI pilled or scale pill form look like? And then maybe for me personally, I don't know. Part of what excites me about working at OpenAI, it's not just solving for developers, but it's just really thinking about how does AGI benefit all of humanity and what does that feel like to non developers as well. And so for me, what's really interesting is thinking of Codex as an experiment for what it'll feel like to be in other functions doing work. And the goal for me to build towards is a vision where it's like we do the work that's ambiguous or creative or hard to automate in whatever way, but otherwise we just have agents that we're delegating most of the work to. But these agents, they're not like this long horizon thing versus short horizon, they're just kind of ubiquitously available with you. So yeah, we decided to like take the purest form to start, which we thought would be the smallest scope thing to ship and probably isn't but yeah, then we're going to bring these things together.
B
Okay. I think we have time for a couple questions. I'm just going to double click on the research preview a bit. It is a research preview. What is left? What do you think would qualify it to be a full release? On the live stream, Greg mentioned this seamless transition between cloud and cli. Is it that or is there other things on your mind?
D
I mean, to be completely honest, part of why we believe so much in iterative deployment. I can give you some of my thoughts now, but also we're really curious to see because this is such a new form factor, but some of the items that are top of mind for me are multimodal inputs. You know, we've talked. Yeah, I know. You've flagged that, right? Yeah. Like another. Another example would be like, you know, just giving it a little bit more access to the world. You know, a lot of folks have requested for like forms of network access. You know, I also think that right now kind of the. The UI that we shipped is actually one that we iterated around. There's like a fun story there, but like. And it's one that people find useful, but it's definitely not the final form of what it is. And we would love for it to be much closer with the tools that developers spend time in. So those are some of the themes we're thinking about. But to be clear, we'll iterate and figure that out.
A
I wanted to ask, why did you put finding a typo as one of the onboarding things? Because I used it and then I saw it and it's literally just grabbing for potential type. It's like searching, grabbing for like selenium with an N or like something with like some DGN instead of NG. But it went through like 50 of these and then finally found Will misspelled this W I L without the two thing. But it was really cool to see what it thought. Like default spelled as like D E F U A L T. It just grabbed all these different things and then eventually got there. Like why did you make that task?
C
Honestly, Tebow and I were talking about it and he's like listen, it would be funny if I had a typo as I typed this prompt out and just make it a little bit meta nervous fingers on a live stream so maybe optimize further for that. I have noticed likes to do teh to look for the. It's a work in progress.
A
That was great. Any parting thoughts? Call to action. Are you growing the team? Do you Want specific feedback from the community?
C
Yeah. I think for me the one thing.
D
That.
C
Is really on my mind for getting better at over the next few months is really helping you customize your environment in a more high fidelity manner. Turns out the good news is the agent can do a lot of really good work with only the basics. Right? Much like if your dev machine is borked and you're sort of looking at your editor, but none of the type checks are working, a lot of folks can still actually do a lot of good work. But how do you get close that last 30, 40%? It's really hard because there's such a wide variety of environments out there, but especially would love feedback from folks on how they would like to see their environment customized. Do they want to just ship us a Docker image? Would they rather have us support dev containers? Right. So the form factor of how you do the DX of how you do environment customization still very much an open question that we need to improve on.
D
Yeah, big plus one to that. And I think for me, maybe the thing that I'm most interested in is, hey, this is, this is like a new shape of tool to collaborate with and I'm just really interested for people to try working with it in as many different ways as possible and kind of figure out where does it work. Well in your workflow, like you mentioned earlier, you were trying to use it kind of like your IDE and then you realized it was different. So I would just love for people to take advantage. Especially now we're very intentionally just providing very generous rate limits so people can try it. We just want you to try it and figure out what sticks, what works, what doesn't, how do you prompt it and then we want to learn from that and use that to lean in. So yeah, I guess my parting call to action is like, please go try it out in ChatGPT. Use it as much as you can, especially now, and then let us know how you like to hold it. Basically.
B
Yeah, I'm worried about the pricing when it happens, but yeah, I'm going to abuse this.
D
Why not? Why not? Yeah, send us feedback on pricing too.
B
Yeah, okay. It's too early to talk about pricing, right?
D
Yeah, it's too early now. Yeah. Okay.
B
But yeah, based on cloud code, that's the thing that people are worried about. Right. And Claude has started to introduce some kind of fixed pricing and variable pricing and I think it's a huge mess. Like it's, there's no right answer. Everyone just wants the cheapest form of code. Attention they can get.
D
So.
B
Yeah.
D
Good luck. Thanks.
C
I mean, my take is I don't know if it's going to make it into, but like, we aim to deliver a lot of value.
D
Right.
C
And it's on us to show that and really make people realize, like, wow, this is like doing very economically valuable work for me. And I think a lot of the pricing can fall from that, but I think that that's where the conversation should start. Are we actually delivering that value?
B
Yeah. Awesome. All right, well, thank you so much. Yeah, thanks for working on this and thanks for sharing your time. It is. It's been a long time coming, but I think people can start seeing OpenAI in general is getting very serious about agents. It's not just coding, but coding obviously is the one loop that is self accelerating that I think obviously you guys are super passionate about. It's really inspiring to see.
D
Yeah. Super excited to just ship everyone this coding agent and then. Yeah. Bring it together into just the general AGI super system.
A
It.
D
Yeah. So thanks for having us on.
A
Thank you, guys.
D
Cool.
B
Thank you.
Episode: ChatGPT Codex: The Missing Manual
Date: May 16, 2025
This episode dives into OpenAI’s ChatGPT Codex—their latest software engineering agent for code generation, automation, and developer workflows. Hosts Alessio and Swyx are joined by Alexander (from the Codex product team) and Josh, both instrumental in building and shaping Codex’s features and philosophy. The conversation dissects Codex’s origins, technical architecture, best practices, the future of AI-powered code engineering, and real-world usage advice direct from those at the frontier.
“We were...working on human to human collaboration. And…what if...it was like a human pair programming with an AI?” – Alexander [01:59]
“I think we are going to build an agentix software engineer...whether or not I was involved in the next two years.” – Josh [06:58]
“We realized...it’s just super valuable to figure out how to give a reasoning model access to, to a terminal. And then now we have to figure out how to make that a useful product and how to make it safe.” – Alexander [04:20]
“This is not just a model that's good at coding, but rather this is an agent that is good at independent software engineering work.” – Alexander [09:24]
“The feeling is...it takes a leap of faith. The first few times you’re like...not really sure if this is going to work...but then it comes back and it’s like, wow, this agent went out, wrote a bunch of code, wrote scripts...tested this and it really went through the full end to end.” – Josh [12:47]
agents.md files for agent-specific instructions.“A simple agent's MD will get you a long way...We would really like to do is auto generate this...but we figured we ship faster rather than later.” – Josh [15:31]
“Maybe agents MD ends up being the stuff that you need to tell the agent that it’s not automatically figuring out from the README.” – Alexander [23:12]
“For agents, I don’t think you really had to tell it code style. It looks at your code base and writes code that's consistent to that. Whereas a human’s not going to take...their time to go through the code base and follow all the conventions.” – Josh [25:02]
Prompt-Driven or Deterministic?
“The model isn’t all the product, but the model is the product.” – Alexander [29:12]
Delegation Mindset
Long-running, Fully Autonomous Tasks
“What we see as the role of Codex...is to really push frontier on that sort of single shot. Autonomous software engineering.” – Josh [46:10]
Environment & Compute Platform
“We would like to give humans and agents alike as much access as possible within safety and security constraints...But once the agent starts running...we'll cut off internet access...For now.” – Josh [41:49, 42:57]
Concurrency & Rate Limits
“I found out I was using it wrong. I was using it like, cursor. Like...I had my chat window open and I watched it code...you guys are just firing the things off and...going on about your day.” – Swyx [40:13]
“Some of the items...are top of mind for me are multimodal inputs...just giving it a little bit more access to the world...UI...not the final form...” – Alexander [47:42]
On Shifting Developer Mindset:
“You must have an abundance mindset and you must think of it as not using your time to explore things...The more you’re running in parallel, actually, I think the happier we are.” – Alexander [39:55]
On Building for the Future:
“We really want to get as much of that complexity, as much of that state machine as possible pushed into the model.” – Alexander [30:58]
On Difference Between Human and Agent Practice:
“For agents, I don't think you really had to tell it code style...Whereas a human’s not going to take its time...to go through the code base and follow all the conventions.” – Josh [25:02]
On Feedback and Environment Customization:
“How do you get close that last 30, 40%?...Would love feedback from folks on how they would like to see their environment customized. Do they want to just ship us a Docker image?...still very much an open question.” – Josh [50:53]
On AI Hiring Analogy:
“If you start with a base reasoning model...you have this...weirdly spikily intelligent...college grad...What we’ve done with Codex 1 is basically give it its first few years of job experience.” – Alexander [16:47]
Visit latent.space for show notes, follow-up links, and more.
Summary compiled in the spirit and language of the episode’s lively, technical, and open-dialogue tone, to provide both actionable insights for practitioners and context for non-listeners.