
AI coding agents are rapidly reshaping how software is built, reviewed, and maintained. As large language model capabilities continue to increase, the bottleneck in software development is shifting away from code generation toward planning, review,
Loading summary
A
AI coding agents are rapidly reshaping how software is built, reviewed and maintained. As large language model capabilities continue to increase, the bottleneck in software development is shifting away from CO generation toward planning, review, deployment and coordination. This shift is driving a new class of agentic systems that operate inside constrained environments, reason over long time horizons, and integrate across tools like IDEs for version control systems and issue trackers. OpenAI is at the forefront of AI research and product development. In 2025, the company released Codex, which is an agentic coding system designed to work safely inside sandboxed environments while collaborating across the modern software development stack. Thibaut Sotio is the Codex engineering lead and Ed Bayes is the Codex product designer. In this episode they join Kevin Ball to discuss how Codex is built, the co evolution of models and harnesses, multi agent futures, Codex's open source CLI model, specialization, latency and performance considerations, and much more. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup and organizes the AI in Action discussion group through latent space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website K Ball LLC.
B
Hey guys, welcome to the show.
C
Hey, hey, thanks for having us.
B
Yeah, I'm excited about this one. You guys are doing some really interesting stuff and I want to dig in. But let's start with you a little bit. Can you each give a little bit of your backgrounds and then how you got involved with Codex and what you do there?
D
Yeah, I'm a product designer on Codex. I've been at OpenAI for just over a year and before that I worked in robotics and generally kind of at the intersection of design and research. And yeah I've been on the codec team for about six months and with each model release, each product release have just got more and more into the coding side and excited to chat about how he's on the team today and.
C
I'm Thibaut joined about the same time as you actually been thinkering and thinking about AI intelligent systems for as far as I can remember it's one of the first programs I tried to write as a kid and then over time just it got more and more fascinating. I feel like today really it's come to life where I finally have the thing that I was trying to build when I was seven where actually I'm able to type in my terminal and get an intelligent response back and have this little assistant in my computer. So it's actually wild to think about that that has come true. But yeah, joined OpenAI about a year and a half ago. None of this was possible. We didn't really have reliable agents doing work over many, many hours and periods of time. And so I've been tinkering with that at OpenAI since I felt that models were actually capable of that. Late last year, I kind of became obsessed with this idea that model capabilities were continuing to evolve. And it was really about getting the right infrastructure and product around it so that we could continue to benefit and have that step change in utility that you can get from the models compared to just being able to chat with them. Kind of felt like chat was a bit saturated and then we're able to express a lot more. Things evolved over time. There was a lot of prototyping earlier this year and then it really came together as a team and now we're like, you know, pushing on codecs with quite a few people over here and it's more exciting than ever, I would say.
B
Yeah, I definitely have felt that kind of acceleration across the board in the last couple of years. That is just wild to experience in our industry. I'd love to actually dig into a few of those different pieces. And one of the distinctions you made there is around kind of the models, the model capabilities and their advancements, and then the infrastructure and the harness and all these different pieces around it. So I'm curious from your perspective, how you and the team think about what is the relationship between those two? How do they connect and feed back into each other?
D
Yeah, it's a good question. I mean, I think on the research side, on the infrastructure side, I defer to Thibaut, but I think one of the really interesting developments that's happened over the past, say six or so months is this kind of like co evolution of the model and the harness. And I think it's really come together in our products and that if you use our models in our harness, it's kind of different than if you use it elsewhere. And I think that's really exciting this as a product person, as a designer, the idea of like not just building a model that you can use in an API and kind of shows up elsewhere, but really co evolving these two together and all the incredible things that that can lead to.
C
Yeah, definitely that element of like co evolution and that co evolution is happening at many levels. There is co evolution of the harness and the model, co evolution of the products that need to evolve at A really rapid pace right now. It definitely doesn't feel like we have yet figured out the ultimate form factor of how you interface with an ever more intelligent system that is doing all these things for you on your behalf. But if you think about the harness, it's really just your body, right? You have your brain, you have your body, like how you end up acting upon the world around you. And then there's a little bit more to that as well, which is how do you act, but safely. So one of the things that we do out of the box is codec sits inside a sandbox. The network access is restricted, the file system access is restricted. And this is really important because it allows the model to experiment and touch its environment, but without potential negative consequences. And then this is an important topic where we view coding agents very much under the lens of alignment and safety. And so there's this aspect as well of where does the hardness stop and where does it start to be the world? But definitely seeing that when we think about the two together, we get much better results. And I think this will continue to be true. And then there's this separate aspect of what is the right interface to this agent. And that's where products really come in and delight. And I think that, yeah, this will definitely need to continue to evolve, as we have agents that just are never interrupted and just run forever. But that's going to be a whole other game at that point.
B
Sandboxing is an interesting one to maybe just pinhole down into for a minute there, because I think one of the things that stands out to me, I use all the agents, at least as an aspect of research. Some of them I use every day. I was using Codex to solve a problem for me earlier today, and some of them I just try, and then I say, you know what? You're not ready or I'm not using you. But one of the things that stands out about Codex is the strong sandboxing model. Everything is sandbox to begin with. And that is both good and sometimes frustrating and can cause some awkward user experiences. So I'm kind of curious how you think about that balance and how you see this sort of safety question evolving across the ecosystem here.
D
Yeah, it's a really good question. I mean, I think from a product perspective and from like a user experience perspective, as you say, that's where some of these tensions surface from. You know, you're always being asked to approve certain commands. Ultimately, these agents are extremely powerful. So we have great sandboxing, great safety features, and that's a Cool part of the product as well in terms of why you might use codecs over others. So within those constraints, I still think there are some interesting things that you can do around user experience to make it a little bit easier and put some control in users hands. So users can change their sandbox permissions, they can change the kind of mode they can. If you use our product in the IDE extension, you can basically choose between agent mode where it will go off and make changes in your working directory or this read only mode which is a little bit more restrictive and will ask you for permissions in many areas. But I think one thing we recently released, which I think is quite exciting, is as you go along, as you approve certain commands, we give you fine grained control over which commands are you approving? How will they be saved into your configuration? I think exploring as well where that sits between you and your team. So I think ultimately giving users control but still maintaining that really high threshold of safety.
C
Yeah. And if we take a step back of where we started with Codex, it was Codex on web, sometimes referred to as Codex Cloud. But we started with the idea of all of this should happen in a safe environment. So it's like a completely isolated virtual machine with its own sandbox. We use KATA containers under hood and then from there we decided to actually bring that to your machine through codec CLI and the codecode extension and then definitely keep true to that principle that it should be safe by default. It doesn't matter how convenient it is to run outside of a sandbox. Ultimately you are giving control to a very capable and intelligent entity to do whatever if you were not using a sandbox to do whatever it would want to do to your machine using your own credentials and having any consequences that this can carry. And by default we prefer to be safe. Obviously there are use cases where you don't want to use a sandbox and we do caution against that. But it's also something that we do support if you do know what you're doing.
B
Yeah, well, and I will say Codex has never tried to delete my database, which is not true of every coding agent I've tried.
C
Sometimes it can be that the agent maybe does something inadvertently and that has negative consequences on you as a user. It could also be that it's been structured. There's obviously prompt injections and other risks to think about. But ultimately if you do give control to an agent, to something that's quite sensitive, that you'll either want to have it deleted or take any other nefarious action that is something worth thinking about as a user and is that we do really feel like the responsibility that we have there to make sure that there are no unintended negative consequences.
B
The thing that you mentioned in terms of the different ways of running codecs brings me back to another thing I'd love to hear from you guys, which is how do you use codecs internally? Are you running it all through Codex, Web or cloud, whichever one you're calling it now? Or do some of you use the ide? Are you CLI geeks like I am? How does that play out internally?
D
That's a really good question. Yeah, I mean I think like there's a bit of a meme which is everything is Codex, right? We have a bunch of Codex models, you have Codex Web and then we have the CLI product and we kind of think of it as the same coding agent that shows up in different spaces. But internally it's been really cool to see how it's evolved really over time. So Siba said we initially shipped the web product earlier this year and got great excitement internally for this. So a lot of teams. I think the really cool thing about it as well is you connect your GitHub and your kind of team settings and you can go in and you can not touch a line of code and you can ask for something and it can do like pretty amazing things. So that's super empowering for perhaps a UX copy team or maybe like go to market who like want to change some string about pricing, they can do it themselves. They don't have to bug some front end engineer to do that. So I think that's really cool. That was one of the first use cases that we saw. And then I think like the CLI is really popular, right? We have a bunch of incredible developers across the company and developers often live in the command line. So I think that's become really popular. But also personally I use it in the ID extension a lot. I prefer the gui, I prefer being able to click around. I also that's just my kind of go to development environment environment. But some other really cool things as well that I think we've seen recently. We've shipped a linear integration, we've shipped a slack integration. So what you will often see as well now in threads is you might be chatting back and forth maybe about like a piece of customer feedback or some new feature that people are discussing and someone can just hop in and kind of at codecs and basically that will kick off a task in the background. It will Route it through all of our different slack or kind of linear and it will just ping you with a task that you can click, you can open it in the web. So that's cool. It's kind of seeing it like surface within threads and you can assign issues as well in linear as well, which is super fun. So I'd say yeah, it's kind of one of those everywhere things. But I feel like the CLI is pretty popular.
C
Yeah, there's a lot of different use cases like among technical staff, but we also have a lot of like ambient intelligence where it's sort of like all around you, including code review where every single PR that is written in OpenAI now is reviewed by Codex. And it sort of acts as this safety net where it's hard to think about the world where we wouldn't have that safety net anymore given how many critical flaws it catches every day and how much time it saves. It's really able to go much more in depth than the time that we have when we're reviewing each other's code. Especially now that we generating code is so cheap. But the cool thing as well is it's not just about technical stuff. Like more and more people across the company are using this tool to do a lot more than just writing code.
D
Yeah, I think one really cool trend that I've seen over the past few months is within the design team we have a few of these slack groups like these work in progress groups where people will post work. And I've kind of seen this basically slow, well not that slow like change over the past few months from static images from figma to these interactive prototypes, even sometimes links that you can click into and use yourself, which is cool. And I've DMed a few people who post them as like I didn't know you could code. And they're like I couldn't until. Until I tried codec. So. So there's this range, right. And so it's for professional software developers who obviously have very high bar of code review standards to go through. It's for these throwaway prototypes that designers can play with so you can test responsiveness and all of these edge cases that you can't in like a static prototype. And it basically kind of like collapses the boundary between disciplines which is slightly been artificial over the past 50 years or so or the kind of recent history in technology because of these disciplines and certain often even in organizations boundaries of this staff can access this technology and it's a great equalizer, I think.
B
So it might be worth us Going through. You mentioned a few different points in the software development life cycle where Codex is sort of taking place now or speeding things up or simplifying or collapsing boundaries. Have you thought rigorously across that whole life cycle? Like if we look at our industry, we're all trying to figure out how do we adapt. I think the process of developing software has probably changed more in the last year and a half than in my 20 year career. Like before that, it is wild. So how are you adapting across all of those different points? Using Codex?
C
It maybe goes back to that co evolution where you can do a lot of from first principle thinking and trying to, trying to understand how exactly you should structure the teams and the work to best benefit from this as it's going or you can just stay very flexible and learn every day as you're co evolving the ways that you work as a team, as an individual and an organization together with the coding agent. And that's definitely a lot of what we're seeing where for example, small teams that have all a lot of energy and ambition are able to achieve so much more and are like highly effective because they can iterate and learn much faster. We've seen this with Sora, we've seen this recently with Atlas as well, where entire parts of the code base were able to be spun up just based on an idea and the few individuals that were really steering a whole series of Codex agents. But then also it's clear that bottlenecks are moving around. So code generation is almost maybe solved right now and the bottleneck is moving to code review, moving to deployment, also moving to planning and bringing in a lot of these ideas and the user feedback and we're thinking about how to solve those bottlenecks. Like with the Codex teams, we're definitely not just focused on code generation. This is why we started to invest very early on in code review because we identified that this was going to be a bottleneck. So there's a lot of to the story here to the picture is like some of the bottlenecks we anticipated beforehand, some of them were like, ah, we hadn't really thought about this and like now this is breaking because everything else has gotten so productive.
D
Yeah, I think the thing that has really surprised me since joining OpenAI is just how small some of these teams are that build these products that reach billions of people. You know, I remember chatting to a designer who was on, I think it was deep research, one of these products and it was like, you know, one pm, one designer, a few engineers, a few researchers and it's kind of purposefully small by default. And I think internally the way that we're able to do that is that we're co evolving as co workers with models as well. Right. You know, we're building the models and we're able to access them immediately and really integrate them into people's workflows. So I think that's very cool to watch. And also, yeah, the way that we're building products is we're building for professional software developers, which means thinking through the entire lifecycle of product development, which Thibaut says it's not just writing code, it's the planning process at the beginning. Right. So it's using tools like Linear or Slack and like meeting people where they work, where they speak, where they plan work, integrating coding agents there. It's about the code review point as well, which Thibaut has already spoken about. So I think like, I think an interesting thing to look at in the future is thinking through what is the full lifecycle of a software development cycle and where can you support beyond just co generation?
C
And there are parts there that are easier to crack. Going back to the safety and the sandboxing part of the conversation is clearly code generation there. It's easier for it to happen in a sandbox if you're thinking about what happens next around deployment and being on call to a service. Now you enter a whole realm of this agent. If we want intelligence to be driving this, if we want agents to be driving this, they need to act in a way that also carries a lot of risk. And like, how do you do this? How do you achieve this? This is still very much, I think like an open question of like how to achieve this safely.
A
Se daily listeners. Quick question. When things go wrong in production, do you know why? In minutes or hours. App Signal is the application performance monitoring tool designed for developers who want clean, actionable insights without a huge observability bill. You get all the tools you need to fix issues before customers notice, like error tracking, performance monitoring, log management and more. AppSignal works for teams of all shapes and sizes from startups and side hustles to SMEs and enterprise, and is especially great for teams that build with Ruby on rails, Elixir, Node JS and Python. Start your free 30 day trial and get 10% off a yearly plan with code SCD10. Go to www.appsignal.com sed that's www.appsignal.com SCD and use code SED10.
B
You're a developer who wants to innovate. Instead you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers. MongoDB is acid compliant enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns. Start building@mongodb.com build in mobile application Security.
A
Good enough is a risk Guard Square uses advanced multi layered code hardening techniques and automated runtime application self protection and mobile application security testing custom combined with real time threat monitoring to deliver the highest level of mobile app security. Discover how Guard Square brings all these together to provide mobile app security for your Android and iOS apps without compromise at www.guardsquare.com this kind of goes to.
B
Another question I have. So as we talk about applying this in a wide range of things, one of the things that I've definitely observed in my work and working with a bunch of different things is that different models seem to be better at different things. When GPT5 came out, we do a lot of work in Go and it is phenomenal at working with GO like it is phenomenal hands down. Blew away every other model we were using. Sometimes less good at working with HTML and css and we still sometimes go to other models, Maybe even non OpenAI models for some of that work. How do you think about the sort of multimodal aspect of this and the extent to which? Are you aiming for a model that can do everything and you go to the right things? Are you imagining a multimodal future? How do you see that ecosystem playing out?
C
Yeah, we're definitely aiming for the holy grail or one model that is spectacularly good at everything and then you don't need to ever think again about which model to choose. In practice, what we do think it's going to evolve into is more like a multi agent type of world where you don't necessarily have to be the one deciding of like hey, what is the right underlying setup of which precise model, which configuration, which tools in order to achieve that job? Maybe you will get help there as well and realizing that as much as humans also collaborate in order to achieve useful things in the world, maybe it will also be the same for agents where they have to collaborate together and use the specific strengths that they have. There is a whole series of issues there of like as a model, how do you disclose your strengths? Is it something that the model even knows, is it like intrinsic to the model and like a knowledge that the model possesses, or is it something that needs to be discovered by you as a human or by other models in order to be able to understand, like, you know, hey, this is actually the strength of this particular setup versus this other setup which achieves maybe similar results at lower cost, or maybe this one achieves like better results but at higher latencies. And so there's all these trade offs where I think it's going to be this beautiful world of collaboration between agents, but hopefully also much simplified for you as a user.
D
Yeah, I think that the kind of meme in the design world is all designers are redesigning the composer, right, and try and work with this tension of how much do you expose the capabilities, right, these different modes, these different amazing things that it can do, like image generation, for example. A model like 4.0 is like natively multimodal, so you can just ask IT stuff and it will do it. Right. But like, how do you expose that in the ui? The same with the model picker, right? This meme as well, as we kind of go back and forth not, you know, just. Just us, everyone. And do you list out a thousand different options and you yourself have tested them so you know which one is exactly right for your use case? Would you simplify it? Stiba as well? I think, like, obviously we're aiming for this for the kind of the ideal, the single model. But yeah, how we get there is.
B
Now if I open up, I have Codex CLI running here and I do slash model, I see a list of five. So you're clearly not falling into the show. Everything one differential I'm going to ask about here is like GPT. I see, for example, it's defaulting to GPT 5.1 Codex Max. There's also 5.1 Codex. There's also 5.1. If we were to like peel back the covers, how would you describe the difference between any five generation and the Codex version of that?
C
Yeah, so where we started when we got significant traction with Codex and Codecli was roughly like three months ago when GPT5 came out. Just saying that. I'm like, I have to do a double take. That was like three months ago, three and a half months ago, GPT 5. And then we had been training on the side another model which was even more effective, specifically within the codecs hardness. So this is how to think about it. It's like you have GPT5 and then you have GPT5 codecs and GPT5 codecs is a version that will be more at ease within the harness that codecs provides and be able to achieve better results. So this is always the model that we recommend. You have the same for 5.1 and 5.1 codecs and then with 5.1 codecs max, we were able to have a few research breakthroughs that we packed into that model which made it even more effective and able to work for longer. And we published a benchmark. There is better results across different tier. So able to achieve stronger results but also using fewer tokens and being cheaper on average, which allows us to just pack a lot more in the same subscriptions. Whether you have a plus a pro subscription, you just get more out of it. And at the end of the day it's really about how much economical value are you able to achieve either in a unit of time or in a unit of cost. And this is really what we're striving to provide. And we've restricted the model picker to the few models which we think work very well in codecs. And then there's a default as well, which that's the one we recommend by default for folks like if you don't just really want to think about it, just use a default and you'll be well off.
B
And what goes into making the model, you said it works better with the codecs harness. And I will say like within the codecs cli, I always use the recommended and it seems to just work.
C
That's great.
B
When I'm often using for example cursor, I will also use GPT5.1 or whatever. And actually in that context I found just like the bare model 5.1 often works better for me than the Codex model. So now I'm like, what is it you're doing that's connecting it to that harness?
C
Yes, I was really thinking about that co evolution of the harness and the model and thinking about it as one entity and one agent. Fundamentally what we're building as Codex, the Codex team is like it's an agent and then we figure out where to put it to work. And the agent isn't just the model itself. The agent is the model together with the set of tools and the way that it's going to handle its context and be able to think and reason through which actions it should take. And it's pretty clear that if you co evolve and co train these two things, you can achieve better results, which is what we're achieving.
D
I think one cool thing as well is Codex. The CLI products is completely Open source. So to your question of what's going under the hood, you know the great thing and we have a really vibrant open source community who contribute a lot of great ideas and issues. And you can just go, you can look at the system prompt. It was also a funny thing. When we released the new model, there was this tweet which was like system prompt leaked. It's like, yeah, it's in the open source repo.
C
It's just right there, like nothing to hack.
D
So yeah, so I think like in terms of new capabilities or tools, go and have a look. Which I think is super exciting.
C
Yeah. There's a lot of effort and research that goes into what are the optimal tools in order to get the results that you want. And oftentimes we're actually quite proud of how simple the harness is and how simple the set of tools is. This is something that we strive for, is that simplicity, being able to have the harness scale with the continued levels of capabilities jump that we expect to see over the coming months and years. It's something that if you don't optimize for eventually sort of comes back to you because you have hyper optimized something in the short term that doesn't scale with continued capabilities improvements. And then by being so close to the Codex, we run it as one unit. We have product, we have engineering, we have research. We all sit together ideating a lot and using some techniques from research to put in the harness and using parts of the harness and using that in training. And so there's just like this design cast there and the sharing of ideas and always zooming in on what will make the agent perform better as one unit. It's not about optimizing the model in isolation, it's not about optimizing the harnesses isolation. It's finding that combination that works the best together. And that's what the Codecs series of model offers as well, is like that guarantee that we have considered how well it actually works in codecs and that's like the best that we could do.
B
If it's not too much secret sauce. How do you consider that? Is that related to the reinforcement training that you're doing? Is it a different initial data set? What is actually causing it to behave differently there?
C
It's really about thinking about the model as not just needing to be intelligent, but needing to be an efficient agent. And if you think about what an agent is, it's going to be a model that gathers its own context in an accident in its environment, in a order to achieve a Goal. And so if you set yourself to train a model to be extremely good at that and be an extremely good coding agent, you're doing different trade offs. You find that you are able to take different trade offs at the research level, be it at the post training or the RL or the specifics of the training which we're not going to go into, but the trade offs are there and so you're able to achieve efficiency gains and move up in the performance curve.
B
Now I mentioned some of the models I see. There's one other model that I see in this list which is GPT 5.2, which I think was not there when I looked at this a week ago. So what's that about?
C
We just released it yesterday. It's been very successful, more so than maybe we anticipated. We were actually like some of the team was up like all night like shuffling computer on and making sure that it kept working and we were achieving the latencies target that we have for agents. The latency of the model and the reasoning and exactly where the compute is is more important than ever before. This is because you have that latency element between that GPU that we run somewhere and then your computer where the tool calls are run. So you have always this back and forth and then obviously if the model is able to perform and we're able to sample more tokens per second, that's going to translate into shorter amount of time to get that result. 5.2 is a particularly exciting model launch. I would say it's like a significantly higher jump than what one might expect from compared to 5.1. GDPVAL captures this fairly well. I think a lot of the benchmarks these days are saturated, but a good way to think about it is economical value that you're able to create in the World and GDB Val. I think we see more than a 20% jump there, so definitely recommend trying it out in Codex. It's quite exciting. Depending on when this podcast goes out, you might have something even more exciting to try out, but we'll see about that.
B
I think that is interesting and thinking about that interaction between local and data center. So how are you all? I mean some of that I'm sure is proprietary, but how are you thinking about locality in this? Are you pushing computer? What does that look like for somebody at the scale you're at, the closer.
C
The compute is to your laptop, if that's where you run codec cli the better because you reduce that round trip. Another way of doing it is to bring the compute environment closer to the compute. So to bring for example, virtual machines and have those effectively be as close to the GPUs as possible. That's the approach that we take with Codecs Web. But then if you're running locally within your VS code extension and the agent runner is effectively running on your machine, then you do want that GPU to be as close to you as possible. So there's an element of where in the world is that running for you? And sometimes you might be better off if you're somewhere in the middle of nowhere on an island and we're not running a data center there, you're going to feel that actual latency.
B
Let's actually dive in a little bit to the guts of the agent. Because I think one of the things that most of the software development world right now is trying to figure out how to build effective agents. And I think coding agents are really at the frontier. They're pushing the edge of what that looks like. So can we kind of break down just like first the very high level pieces that you think of that go into this agent? The software layer, not the model.
D
Yeah.
C
What are the higher level pieces of? We touched a lot on it already. So you have the model and the inference that's going to be the intelligence that's driving the rest of the software stack. And so you have this interesting combination of a piece that is non deterministic and a piece that is deterministic, at least for now. A lot of the harness is considered to be deterministic and it's quite simple. If you look at it under the hood like it's all open source for codecs is like there isn't that much magic. It's a for loop and then a bunch of tool calls and then tools that have been designed to work well like for coding. And it's a pattern that you can apply essentially to, you know, any other discipline and any other agent. It's that control going from the model back to its environment, executing an action and then taking what's been observed in the environment and then pushing that back to the model in order to decide the next action and then doing that over and over and over again, you know, maybe hundreds of times, until the point where the model believes that the desired outcome has been achieved and decides to stop. So at the very beginning you have a prompt or like an intent from a user, then you give control to the model, decides on like the next tool call, goes on and on and on and on and at some point has achieved or is unable to achieve and decides to yield back control. And that's when the agent has finished its job. But delightfully simple. It's just a couple of tools, a for loop and then a model that's given control over that. But the really exciting thing is not that I think it's really the products around it that allow you to have control, steer and supervise those agents. And then as well as thinking about the agent being its little system that will continue to evolve and being increasingly more complicated and being able to perform increasingly complex works, it's not maybe a single agent that's going to be at work. Maybe it's going to be like multiple agents that are going to be at work. But I think a really exciting thing is how do you interface with this ever more complex system that is doing work on your behalf?
D
Yeah, totally. And if you think there are some parallels, I think if you look at ChatGPT and when it was released, right, it's like very, very simple. You go online, there's a text input, you type some intent, some message and the model responds back. But as Thibaut says, in this world where we have an agent loop and an agent is carrying out work for you, maybe it's delegating to other agents, it's collaborating with other agents, it's speaking to external agents even. I think the user experience changes. It goes from this back and forth to a little bit more like how we interact with other humans in the world today, right? Like if I ask Thibaut to, you know, get me a glass of water, it's going to take him a little bit of time. He's got to go outside, he's got to do a bunch of stuff, right. Or if I'm collaborating with a colleague, if I'm collaborating with a colleague, right, I might ask them to do some significant tasks. So, you know, build some new infrastructure project. It'll take time, they'll have to go out, they'll have to coordinate. So I think we're moving from, to kind of longer and longer tasks with more and more complexity and you know, models that are increasing in capabilities and I think they're interesting question from a product perspective is then how do you design those interactions in a way that is simple, maintains simplicity, also fits into, you know, everyday workflows nowadays and also like exposes these incredible capabilities of the model as well in a very simple way.
B
So one of the things that's interesting to explore in this domain of like user experience of agents, especially as they're going on is like when I wrote code in the olden days, right. It was doing multiple things for me. It was creating this runnable artifact that somebody could interact with. Okay, that's great, that's useful. It was updating my mental model of the system that I have that I'm working with. And it was also doing some amount of like, problem solving and updating of my mental model, probably of the user's problem, or at least how to map that user problem into my system. And so there's these, like, cascade of mental models that I'm updating as well as this final artifact that's being generated. So now as we get into this world where we are delegating more and more of the work of generating the artifact, there is still this very real need for us to update our mental models across the board. So how do you think about or see that working in this agentic world? How does the product facilitate it? What does that look like?
C
Yeah, I think that's a super important point and it ties a lot into, in my mind of what we've seen people use coding agents for primarily over the last, say, like six to three months, which has been a lot to solve and write code for them on their behalf. I think there's a much deeper role that agents have to play in the future, which is to understand, hey, what do you care about? How can it help you understand the state of the world efficiently around you? Maybe it should send you something every day, but here's how the code base changed, here's what users are thinking about the product. Here's how to really explore this topic a little bit more. And so you go much further than just the code generation. You're helping with planning, you're helping with ideating, you're helping with understanding user feedback. You're bringing a lot more context into play than just code itself. And in a way, if you were just to focus on code generation, you would miss out a lot on the opportunity here. We're thinking about this broader set of things that we can help, you know, people with. I think it's going to be ever more important, like maybe co generation actually will be like a very small part of what agents end up doing for you. And, you know, we're definitely thinking about this at the product.
D
Yeah, yeah. I mean, it's interesting. Like a very small maybe example on the team is, you know, I think when a new starter comes on board. Right. It often takes a long time to get used to a code base. You have to really get to understand it. But yeah, as well as writing code, I've Seen new engineers on the team just, you know, speaking to codecs and really deeply understanding the code base going back and forth. And that means that they don't need to tap on their colleague's shoulder as much anymore. If they do, it's for some really high value touch point. But yeah, as you say, I've seen people use it for all sorts of things, writing notes, code understanding. So it's really beyond just code generation.
C
Yeah. And there's this awesome thing about giving codecs to someone who just started on the team and be like, hey, explore the code base with the help of codecs. And then we barely write documentation like how things work, because that's just in the code itself. What we tend to document more is why things exist. And so I think there is going to be an evolution there as well of how do we maintain the knowledge base, how much of it is redundant? Definitely when you have sort of intelligence and a little buddy that you can just send off in order to explain something for you. So you tend to find that maybe it also shifts what you want to write down.
B
Yeah, absolutely. Well, and there's kind of an interesting thing there. One of the techniques that we found works really well for us with agents is actually documentation that is maybe transient in some form. So it's like, here's this problem that I'm solving. Gather all the relevant pieces and documentation and link to the relevant files so that I have one short piece of context. Okay. Now use that to get me to a solution on this particular thing. And so it's like much more temporary documentation than permanent documentation, but giving the agent this map that it can work with.
D
Yeah.
C
Is this more like a design doc or how does it differ?
B
So we can dive into this in a couple of different ways. So I'll use a very quick example of one of my common practices. And I've done this with codecs or other agents. So you have a problem to solve. And I know like roughly the area of the code base involved, but I don't want to maybe I don't know it that well. So I'll say to Codex, for example, hey, I'm going to be wanting to muck around with this subsystem. Please do an analysis of how that system works today. Write me a document that includes, you know, file links and symbols and all these other things. And conceptually, to me, what I'm doing is I'm. I'm creating a map of the territory. It's like essentially a context condensation. Right. It doesn't need to read all those files all the time. But it needs to know where roughly everything is. So when it needs something, it can pull it. Okay, now I have that subsystem and I say, okay, I'm looking for a solution that looks something like this. Can you map out like three different variations of that, have an argument about which ones are better, whatever. Make some characters do this, kind of map out the solution space. Now I have these two very rich documents and I can say, okay, based on these. Look at this, look at this, pick which is the best solution. Write me an implementation plan. Okay, pretty good. Break it down a set of task lists, go. So I'm in some ways manually managing the process of this, but kind of guiding it towards here's all the relevant parts of the code base with me in the loop. Often to be like, no, you missed something over here, you got to go look at that again. Or something along those lines.
C
The workflow that you're describing is extremely powerful and it's all based on files and you deciding on, hey, this workflow is actually very useful for myself. And you discovered it by yourself, maybe talking to other people and they're just like sort of sharing element right now of recipes of how to work with agents. It's not necessarily that the product is prescriptive about it. It's delightfully open ended actually right now where you can ask Codex to do anything for you. And there's this creative aspect of why do you actually ask it? How can it help you? And maybe it's through this complex workflow of planning and then ideating on different options and then going and performing some implementation of it. We like that a lot like, you know, that flexibility. And we try to also be very mindful when we introduce more opinionated frameworks into the product that could also sort of like restrict that flexibility. That's definitely not something that we want.
D
Another interesting angle as well is just seeing how some of the maybe like more non technical people across OpenAI use it or in different disciplines from distribution, traditional software engineering. So there's one person that I know who basically uses it for everything he uses, right? Documents. You know, on the design team I know a lot of the designers and the product managers, you know, they might do some coding but they'll, they'll do it for lots of other things. Ideation, as you say, planning some very cool things as well on the like data science. Go to market side, a lot of just like data analysis, right? Crunching through numbers, you know, looking through a CSV. So yeah, and I think like from a product perspective, you know, we've been, you know, deliberately opinionated about keeping things simple. Just like ChatGPT, right? It's this general purpose interface that you can go to and you can ask it to do anything with ChatGPT that might be, you know, generating images, you know, answering a question, searching the Internet. And I think for us, you know, the amazing thing about a coding agent is it's like extremely general purpose. So you really want to keep it as simple as possible and then let the user, you know, let that creativity go run wild.
B
So on that note, and looking forward a little bit, one question I'd love to ask is, are there any plans for enabling like an inside Codex SDK or like hooks or some other way to generate? Because like, for example mentioned I have this workflow which figuring out like, oh, I want to steer it in this way to fit in my workflow. It would be great if, for example, every time there was a context poll or something like that, it reinforced this or other different ways to kind of nudge and control, tightly control the context that's going on to fit esoteric workflows that you don't want to have in the core agent.
C
Yeah, what you're getting at is really what we're seeing with the power users, Even including within OpenAI, some of our most prolific users maintain their own fork of codecs. That's one of the awesome parts of it just being code that's open source as well. If you want to change it, you can just fork a code. If you happen to be advanced, that shouldn't be too daunting either. Codecs can help you change it in a way that's productive to you as well. It is written in Rust sometimes we get some comments on that. But we want it to be very robust and performant as well. It's quite delightful when you just have you type codecs and it opens instantly. That's what we get from putting a lot of effort into that. Hooks are something that we're debating. We'll get there eventually. What we're super excited about right now is building the right set of primitives for the agent to be able to perform increasingly complex work. And so you can think what will happen if you're able to run an agent for an entire day or maybe an entire week and steer it as it goes. Is that a different thing? Does that require different product thinking? And then we touched upon multi agent as well. And this is something that we think is extremely exciting and is going to emerge in 2026 for sure as something that is not at a prototype stage like you're seeing across the industry right now, where maybe folks are excited about their little sub agent, but it's going to be these really robust networks of agents collaborating together in order to achieve something for you. That's the kind of stuff that we're really excited about right now. Hooks, maybe at some point.
D
Yeah. Nothing massive to add. Just to say we have a Codex SDK so it's possible now with you have a documentation page up on it and you can start to play with it. But yeah, to TiVo's point, I think there's this tension between catering for very specific workflows and then thinking about what these primitives are in these building blocks so that you can build on top of that and build some pretty incredible experiences.
B
What would you say the missing primitives right now are?
C
We have a long list of GitHub issues. Some of them are top voted. Those are actually the ones that we tend to prioritize. So one of the things really that's been requested like sub agents or we're actively working on how to think about multi agent networks and then a lot of it is still product overhang I think where it's not about the agent itself, but how can we make the product more delightful and more interesting and better suited for managing, steering and supervising agents at scale.
B
Scale.
C
That's what's keeping us very busy right now.
D
Yeah. Again, without going too deep into the roadmap, I think one interesting provocation to think about is as the complexity of using agents or in a multi agent world becomes incredibly complex, how do you stay on top of that? How do you keep track of what different agents are doing, what actions they're taking and whether you need to give any permissions and the like along the way? There's any artifacts that they've created, whether that's code or elsewhere, keeping a track of that and staying on top of it. I think for me as a designer on the team, a really interesting interaction design problem. Right. Like say we're moving from this world where you watch a rollout of like a minute to as I say, this 10 hour job. Like how do you stay on top of that? How do you keep it delightful? How do you meet users where they are and so you're not kind of context switching all the time between all of these different things. That's. Yeah. As well as the kind of, you know, core primitives from an engineering perspective, they're the kind of problems that we're about thinking on the product side.
B
One other thing you talked about there was all of the non technical use cases. And I think one of the most amazing things I've seen with as coding agents and LLMs, just as coding assistants have grown is the extent to which now subject matter experts are able to build at least their own prototypes and often their own applications to help them in their workflows. Are there any aspects from either a product or technology standpoint that you're thinking about, particularly for those non technical users looking forward?
C
Yeah, we're thinking about it, especially since it's been sort of like a natural thing that's been happening where we see increasingly amounts of non technical people inside OpenAI, outside of OpenAI, use codecs in their terminal and get it to do cool things for them. Definitely got us thinking about how we can do this better. And also there is this sort of pull for generality as ultimately the very best coding agent is a general agent that's able to reason across much more than just code. The Codex agent and models are extremely good at instruction following. People find us very useful for data analysis, for editing spreadsheets, for doing market research and these things. And it's definitely something that we want to lean into and cater to at some point. At the same time, right now we're also laser focused on making codecs the very best tool for professional software engineering. And there's this tension of hey, if we want to be really good at this, should we also think a lot about these other things? But ultimately we see it combined very well. And so it also got us thinking. It's very satisfying to see codecs used for more and more things in addition to just being an extremely good tool for coding.
D
Yeah, and from a product perspective, there are some things we're building an agent, for example, a coding agent. So on the web, right, you have to set up an environment. There are some things that you just can't get around which are pretty technical. And you know, if you're a software developer, you know, you need to kind of go through these processes. But I think from like a core product experience perspective, there's also more than we can do. And this is what I'm focused on, which is just what's that first experience? Is it delightful? Is it simple? Can you as a non programmer kind of, you know, rock up and just get involved? And how can that be an on ramp for you to learn more about coding and to get deeper into it yourself? So this is something in the design team, for example, we had this offsite and we had a few other people on the team who code kind of going around and basically onboarding everyone into Codex, into the CLI product, into the extension, depending, you know, where they, where they worked. And you know, to be honest, like for some of the non coders it's a little intimidating to like get into the terminal and you have to like, you know, they were installing NPM and these things that might, might be a little new for them but once they got on board and you know, I think once they kind of saw what the work that the model was doing and started to learn a bit about it, you know, some of the people who just tip their toes now I see them more and more coding. So I think it's also like a really cool opportunity to kind of, you know, expand the aperture of, you know, what is a software developer and create a really great on ramp for people to learn more and go and you know, dig deep deeper themselves.
B
Awesome.
C
Well, we are getting close to the.
B
End of our time. Is there anything we have not talked about yet today that you think would be important to leave folks with?
C
One thing we haven't talked about is like the mindset that's important to continue to adopt. And I feel like it's an amazing time to have problems as solving them has never been easier. And then there's also this aspect of hey, this really helps with answering questions and there's this curiosity that gets super rewarded right now and definitely being able to try and get interested in changing your approach of how you're going about your day and thinking about solving the problems that you have. Maybe you had useful ways of doing that that were effective to you years ago and you've stuck to them. I definitely feel like it's the right time to question everything and try new things. Personally, I find this super exciting and having always many ideas and unsolved problems, finding that the amount of problems that are unsolved reduces with time is just like I hope someday agents will be able to also creatively come up with super interesting problems that I should be thinking about. It's like we're not there yet, but you know, what a time to just try these things and get a ton of new things done.
D
Yeah, plus one. I think also what a time to be a creative. As a designer in the team and when I'm speaking to young designers or I occasionally teach or kind of mentor young folks, the main thing I say is just kind of just get involved and give things a try because you've never been, there's Never been a time where kind of curiosity has been better rewarded by really just getting your hands dirty, pushing yourself out your comfort zone, and very quickly realizing that, you know, you're able to achieve way more than you might have thought before just by, you know, even on a week by week basis. If you look over the past few weeks with all of these model releases, you know, it's just crazy, the acceleration that's happening. So, yeah, just kind of, you know, stay curious and get involved.
C
Yeah. It's not long ago where, six months ago, where, you know, you would show, you know, static figmas or slides, like, you know, and just be like, hey, you know, this is an idea of mine. And then now it's like fully fun, functional little products that I'm like, whoa, this is better than what we have shipped in production. It's like, we better get this out soon. And that step change in what you're able to achieve solo as a designer. I don't know if even referring to you as a designer does it justice anymore. There's this blurring of roles that's quite delightful.
D
Yeah, there's never been a better time, I think, to be a software engineer or a designer. Sam.
With Thibault Sottiaux and Ed Bayes
Date: January 29, 2026
Host: Kevin Ball (K. Ball)
Guests: Thibault Sottiaux (Codex Engineering Lead), Ed Bayes (Codex Product Designer)
This episode delves into OpenAI’s Codex project—an agentic AI coding system at the cutting-edge of software development. Thibault Sottiaux and Ed Bayes join host Kevin Ball to explore the technical underpinnings, product design, safety and sandboxing, code lifecycle changes, model evolution, usability for both developers and non-technical users, and the transformative potential of coding agents in the industry. The conversation is grounded, insightful, and candid about trade-offs, future directions, and the blurring boundaries between traditional roles in software.
“But if you think about the harness, it's really just your body, right? You have your brain, you have your body, like how you end up acting upon the world around you.”
(05:08, C: Thibault Sottiaux)
B (K. Ball): “Codex has never tried to delete my database, which is not true of every coding agent I've tried.”
(09:13)
“How much do you expose the capabilities, right, these different modes, these different amazing things that it can do… how do you expose that in the UI?”
(21:43, D: Ed Bayes)
Thibault and Ed emphasize a moment of profound acceleration—bottlenecks are moving, roles are blurring, and the software creation process is more accessible (and open-ended) than ever. Codex stands not just as a productivity boost for professionals, but as a gateway for the creatively curious to enter and shape the world of code.
“What a time to be a software engineer or a designer.” (51:47, D: Ed Bayes)