Summary6 min read

Latent Space: The AI Engineer Podcast

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust

Guests: Brian Fioca & Bill Chen, OpenAI
Host: Latent.Space
Date: December 26, 2025

Episode Overview

This episode brings together Brian Fioca and Bill Chen from OpenAI for an in-depth conversation about the evolution and future of coding AI agents—specifically, the new GPT5-Codex-Max model. The discussion focuses on training agents with distinct personalities, advanced tool integrations, building trust with engineers, agentic architectures, and the new abstraction layers enabled by next-generation coding models. The guests share insights into agent design philosophy, real-world use cases, and the cultural and technical shifts underway in AI-powered software engineering.

Key Discussion Points & Insights

1. Naming and Design of Codex-Max (00:35–02:31)

Origins and Meaning: The name "Max" was chosen to distinguish from previous models and signal its ability to run for extended periods (24+ hours), favoring speed and maximalist operation over deliberate, slower alternatives ("Pro").
- “Max can run for a really long time. Like, 24 hours or more. It’s about speed and maximization, like maximalist.” — Brian Fioca (01:44)
'Max' Model Performance: Not just longevity, but improved speed and quality—"better and faster" coding outcomes for the same types of problems compared to previous models.

2. Training for Personality and Trust (03:02–04:11, 09:15–11:36)

Purposeful Personality: Training the model with a trustworthy “pair programmer” persona was central to increasing developer adoption. Characteristics like communication, planning, and verification were emphasized.
- “It’s really important to build trust with developers...if a model doesn’t act the way that you expect or doesn’t work alongside you as well, you’re not going to really trust it.” — Brian (03:15)
- The model communicates its thought process, announces tool usage, and keeps users informed to foster collaboration.
Customizing Personality:
- “I created a [more fun] personality for my coding agent…because I want my tools to be fun to work with.” — Brian (11:12)
- However, verbosity can be a downside in long-run agentic tasks; toggling “personality” is easier in general-purpose GPT-5 models than in the more opinionated Codex line.

3. Tool Use & Harness Design (04:36–08:36)

Integration and Adaptability:
- Codex’s training is deeply tied to terminal tool interactions. Partners discovered increased performance by mimicking terminal tool conventions even for non-terminals.
  - “If you call it ‘grep’, it does a little bit worse. If you call it ‘rg’, it actually does really well.” — Bill (08:24)
- Model “habits” emerge akin to human muscle memory, making harness and naming conventions critical to outcomes.
Model Generalization: The "5.5.1 non-codex" (mainline) models are more general and steerable for diverse tools, at some cost to specialized performance.

4. Shifting the Abstraction Layer Upwards (11:55–15:56)

Agent Layer Abstraction:
- More opinionated, agent-centric design—packaging agents (like Codex) as ready-to-plug-in units for platforms, decreasing the need for continual adaptation to model upgrades or API changes.
  - “Rather than focusing on optimizing with every single model release, you can just plug in an agent like Codex into your platform and use it as a box.” — Bill (12:45)
- Implications for Startups: Teams can now focus one layer above, integrating agentic behaviors without heavy lifting on harness updates or resetting for every new model.

5. Sub-Agents and Multi-Agent Architectures (14:04–15:56)

Emergence of Sub-Agents: Codex-Max’s self-managed context windows enable it to spawn and manage sub-agents, parallelizing work and opening up "multi-agent" workflows as a new design paradigm.
- “Codex Max manages its own context window...so it can run basically forever...and hand off its own context to sub-agents.” — Brian (14:46)
Next-Level Agentic Workflows: The team expects significant evolution in long-running, agent-coordinated tasks and modular integration—agents that use agents and create new abstractions on the fly.

6. Trust, Evaluation, and Applied Evals (16:10–20:23)

Organizational Uptake & Trust:
- At OpenAI, widespread adoption of Codex (“I haven’t written a single line of code by hand in months”—Brian, 15:56) hinges on rigorous evaluation pipelines, robust tracing, and meta-prompting for continuous improvement.
Evals as a Path to AGI:
- “The path to AGI goes through evals...There are a lot of academic evals...but a lack of evals of the real world on what people care about the most.” — Bill (17:43)
- Tight focus on “applied evals” to capture real user priorities, with platforms for agent traces and rollouts.
Multi-Turn Evals: Evaluating agents over extended interactions is becoming a core challenge. Ideas include using LLMs as judges for world trajectories, “job interview evals,” and building agentic harnesses that mimic real-world tasks.

7. Automation and Beyond Coding (22:28–24:51)

Personal & Professional Automation:
- Coding agents increasingly used for personal productivity—organizing desktops, automating mundane tasks, handling email, and beyond.
  - “I had Codex go through my messy directory...completely organize them...it was wonderful.” — Brian (23:51)
  - “I used it for something more boring—organizing my desktop.” — Bill (24:00)
- The lines between coding tools and general automation agents are disappearing.
  - “Coding tools are breaking out of coding and just...everything. They're personal automation.” — Bill (24:18)

8. Vision, UI, and 2026 Predictions (24:51–27:06)

Vision Native Agents:
- Current agents are not yet “vision native”—better integration of visual and UI-based tasks is expected by next year.
More ‘Computer Use’ Agents:
- Anticipation of agents building their own integrations even with legacy/no-API apps, suggesting a next leap in extensibility and general-purpose automation.
Democratization of Software Engineering:
- “I wish every company...could turn to a coding model and be like, hey, how do we do this crazy refactor...and have it be so trusted and so right and so smart that we can actually perform better than we could normally get access to.” — Brian (26:16)

Notable Quotes & Memorable Moments

On Model Habits:
- “Model training is literally like, they develop habits just like a person does.” — Brian (08:36)
On Model Steerability:
- “If you’re wanting to go bleeding edge coding focused, pay attention to the Codex line...people are having success bending it in ways that maybe we haven’t thought of.” — Brian (06:46)
On Agent Layer Abstraction:
- “Packaging that up more closely so we’re actually shipping the entire agent together. Then you can actually build on top of that agent.” — Bill (11:55)
On Model Trust and Testing:
- “At OpenAI, I get to work with some of the most amazing developers...I wish every company...could turn to a coding model and be like, hey, how do we do this crazy refactor...and have it be so trusted and so right and so smart that we can actually perform better than we could normally get access to.” — Brian (26:16)
On Automation Beyond Coding:
- “Coding tools are breaking out of coding and just like everything, they're personal automation agents.” — Bill (24:18)
On Evals and AGI:
- “The path to AGI goes through evals...applied evals is capturing all of those sorts of real-world use cases and things for us to hill climb together.” — Bill (17:43)

Timestamps for Important Segments

Naming & ‘Max’ Philosophy — 00:35–02:31
Training for Trust & Personality — 03:02–04:11, 09:15–11:36
Tool Use, Harness Design, and Emerging Model Habits — 04:36–08:36
Model Differentiation: Codex vs. Mainline Models — 06:40–07:38
Move toward Agent Layer Abstraction — 11:55–15:56
Sub-Agents & Parallel Workflows — 14:04–15:56
Evaluations: Applied Evals and Multi-Turn Challenges — 16:10–20:23
Personal Automation Use Cases — 22:28–24:51
2026 Predictions & Vision for AI Engineering — 24:51–27:06

Final Thoughts

Brian and Bill’s discussion paints a compelling vision for agentic AI in software engineering: faster, more reliable coders with rich, modular personalities and a growing capacity for trust. The abstraction is moving upward—soon, entire teams may simply integrate and build atop intelligent agents that handle everything from code integration to personal productivity. The guests foresee a future where agentic behaviors, enhanced by persistent context and evaluative feedback, radically democratize coding and automation—expanding the reach of “coding agents” far beyond code itself.

Contact & Feedback:
Brian and Bill invite listeners to share feature requests and feedback via email or social media, highlighting OpenAI’s open stance toward collaborative product development (27:11–27:29).

Loading summary

Transcript138 lines

[00:04]
A
Okay, we're here at AIE Code and we have two of our speakers, Bill and Brian. Welcome.
[00:10]
B
Hi.
[00:11]
A
Thank you for having us. Bill, Brian, I know you've been the listener for a little bit. Oh, yeah. What's your take on linspace? Like, how does it. What role does it perform in your functions at OpenAI?
[00:22]
B
Yeah, I mean, first of all, love the name. I'm a massive latent space context management person. Tell the story behind the name.
[00:29]
A
Give me the chance. Yeah.
[00:31]
C
So.
[00:31]
A
So we never had LSpace as a name. At the start it was called LSpace.
[00:37]
B
Interesting.
[00:37]
A
And one of my readers donated the domain name Leighton Space.
[00:42]
C
He's like, you aren't it?
[00:43]
B
I'm like, yeah, awesome. So Leighton just like, came accidentally.
[00:48]
A
She was in the ether. But, like, I didn't have the domain. Yeah. So I just, like, called it LSpace.
[00:53]
B
LSpace is like the Visqlaine.
[00:55]
A
Yeah.
[00:55]
B
No, it's amazing. I love it because you're, like, always on the cutting edge and it goes into a lot of detail about all the things that, like, I should be keeping keeping up with as part of my job. And there's so much to keep up with. Right. So there's only so many sources of really good, high quality information for what's, like, happening on a deep level.
[01:12]
A
Well, you guys have your own podcast now, so I'm like, you know, repetition.
[01:15]
C
Yeah. Well, I still listen to yours and I still think yours is really good.
[01:21]
A
So you guys, I guess, are representing like a startups team. Codex. Yeah, all the things you just launched. Codex? Yeah, the Clarice Max put Academy yesterday. Yep.
[01:33]
C
We're good on namings.
[01:35]
A
I do. People do make friends. I think Thibault was like, yeah, you know, we're good at a lot of things, but not naming. I was like, well, why call it Max? Was there any, like, internal discussion?
[01:45]
B
Yeah, I mean, it's complicated because it needs to be differentiated from the previous one. And the idea is like, Max can run for a really long time. We can go 24 hours or more. I've actually, like, sort of had it gone for more than that. And the name is, you know, it's.
[02:00]
A
Inside codecs on the whim. Is that. How do you. When you say really long time, 24 hours.
[02:05]
B
Oh, I. On my. On my. Oh, that's. I think that was on the web inside of Codex. I'm not sure, but I've actually done it on my local computer for quite a bit longer than 24 hours. Over the course of a couple days with closing my laptop and reopening it. But the name, you know, you could come up with something like pro, but Pro is sort of like slower, more thoughtful. Max is about sort of like speed and maximization, like maximalist.
[02:31]
C
For this model, it can run for a long time, but it can also actually for the same types of problems, it can actually get to the right answer it faster. So it's simply better and faster.
[02:46]
A
Yeah, yeah. So I think big part of what you guys are speaking about is the training that goes into something like that, right? Vaguely. People just kind of wave their hands, say rl. But like, what specifically have you learned about what's a good patty sauce on?
[03:02]
B
So I got to, I mean, this sounds weird to say, but I was lucky enough to be really close to the training team while GPT5 was training. And one of the big things that we focused on, Bill was there too. We focused on personality.
[03:15]
C
Right.
[03:16]
B
So it's really important to build trust with developers for like how a model works. And if a model doesn't act the way that you expect it to do, or if it doesn't work alongside of you as well, you're not going to really trust it. You're not going to get as much out of it. So for coding, we thought, okay, well, what is the best personality for a coder, for a pair programmer, for somebody you trust? And how do we like eval against that? How do we come up with behavioral characteristics? And we came up with things like communication. It needs to keep you impressed of what's going on while it's working. Planning, like come up with a strategy, do some searching around, like figure out context, gather, figure out what to do before you just dive in if it makes sense to, and then check your work. Right. And so these are just best software engineering practices that turn out to be behavioral characteristics. And we can measure the model's performance on those behaviors and grade it that way.
[04:12]
A
Yeah.
[04:12]
C
I will say that another key aspect to how we train to the model is you work really, really closely with some of our coding partners. And a lot of those folks lead on the bleeding edge. And so they have a lot of understanding of what particular particularities they need. And we really focused on sort of those areas and really draw. Dive deeply as into those.
[04:37]
B
Yeah, that's right. Especially tools. Right. So like different harnesses have different tools. Some people have context like semantic search, some people have different ways of doing code edits. And initially, you know, our models are trained the way they were trained to use tools. And that kind of bakes in a habit. And so we've been getting the models better at using different types of tools.
[05:00]
A
Yeah, a lot to follow up on but I'll go tools first and then I'll go back on the personality bits both the genius wise. I think the communication by the 5 codecs just came out was well this is the model trait for our Codex, not necessarily your choice. Right. Has that message changed for other startups using the 5 codecs model? Right. No.
[05:20]
B
So Codex is just to be clear, Codex is the frontier coding model that we have that is optimized for its harness. Yeah, the Codex team is very focused on creating a coding agent and they want it to work perfectly inside of the shape of the harness and API that we have. So they're completely unbounded. It's open source, so yes, that's open source and the model is available in the API. So that's what they focus on.
[05:44]
A
And then the conflict is well you just said other startups have other tools.
[05:47]
B
And obviously I know that it is possible.
[05:49]
C
Like one thing to mention here is I think we can probably disentangle a little bit on sort of the Codex apart from the sort of the mainline models. The Codex models are sort of focused on the agents itself. Right. Like the Codex agent itself, the model has been trained with the agent specifically in mind. That actually turns out to be somewhat even sometimes easier to integrate because we come into it with a firm opinion on what the sort of best way of using it looked like. And so some folks that we work with actually really appreciate that we come into it with an opinion. Well, for the other ones that has a more of a general or specific pools that they definitely need, the mainline model is the one that's more general in a sense and that's sort of what Brian was referring to when he talked about GPT5's tools getting.
[06:40]
B
Yeah, so the 5.5.1 non codecs is more general across the board. It can respond to things that are.
[06:46]
A
It's.
[06:47]
B
It's much broader than just codec. It has coding capabilities that are also mirrored in Codex and they work together to keep that trued up. But since it's more general, it does have more steerability to different types of tools. And when you're implementing tools the model can get bogged down if it hasn't seen a tool that it's used to and it might take more time thinking about how to use it or make more mistakes. So our recommendation is if you're wanting to go bleeding edge coding focused, pay attention to the codecs line and the Codex SDK and The Codex models, because that's the one that's, like, really aimed at that you'll have to do, you know, some work to, like, look at how we're implementing our tools inside of Codex to maximize its capability without bogging it down. But, like, people are having success, like, bending it in ways that maybe we haven't thought of.
[07:39]
A
If you're. I always want to pry. If.
[07:42]
B
Sure, yeah.
[07:43]
A
Do you have any examples you say bendy in ways you haven't thought of it?
[07:47]
C
Yeah, so I think so. Codex is trained with terminal tools in mind. And so what we've thought would be the case is you will essentially only have to, like, strip out. You have to strip out all of the tools except for the terminal tools. But we found some, like, partners of ours, like the discovery that what you can do is that you can actually still have a lot of the tools just named in the same way as the terminal tools, as well as having the same input and output. And all of a sudden, the tool called performance jumped up by a lot.
[08:19]
A
Yeah.
[08:20]
B
And Codix loves RIP grep, so if you make a rip grip tool and tell it to use it, it'll use it.
[08:24]
C
So if you call it grep, it actually does a little bit worse. But if you call it rg, actually does really well.
[08:32]
B
But, yeah. Yes.
[08:33]
C
This is something that. That we ourselves only discover.
[08:37]
B
This is one of the coolest things about, like, model training is literally, like, they develop habits just like a person does. Like, if. If you're, like, working on some podcasting tool. Right. You're really good at editing, and then somebody makes you use a different one, it's going to slow you down. You're going to get kind of bogged down and make mistakes.
[08:52]
A
Sure. But I would. I don't know if, like, yes, that's very humid. But I would. I don't. I don't know if I call it cool, because it's supposed to generalize well. Right.
[09:01]
B
That's the end. The end goal. Yes, of course. And so that's what we're doing with the 5, 6 series of models is they're. They're way more general. And Codex is focused on maximizing coding, and those are the sort of two horizons that we're working on. Yeah.
[09:15]
A
Awesome. I want to go back on personal personality.
[09:18]
C
I know you hate that word sometimes.
[09:22]
A
Means different things to do.
[09:23]
C
Yes.
[09:24]
A
And when it comes to people who are, like, very angry, keen on, like, model research, model personality is much more like, I think maybe what you're from, BIG would say, yeah, it's like warm spirit, friendliness. I agree about just having people's emotional state, whatever. And so this is really jarring when that is also applied to clothing agents where like, well, I want to talk to the sheet. Right. Silicon Valley. HBO is also saying antar, but it could be payload of the frick. I think the other thing is also. But what doesn't matter because you said a lot of things about like commenting is that you're going to user engage and all that doesn't matter if it's a chrono based anyway, right? Like you're going for 24 hours, you're closing your laptops, you have like the extra high parameter now. Doesn't matter.
[10:12]
B
Exactly. So here's we're in this world right now where we're in between a situation where people don't quite have like the models don't quite have the trust of senior engineers or engineers doing like very important work. And so we found, our customers have found that people really want to follow along with what it's doing so they can like interject or stop it or at least understand what it's thinking so they don't waste all the kinds of time like doing a rollout that they.
[10:37]
A
Have to throw away.
[10:38]
B
So with the 5 Series, because it's more general and it's just about as good as coding as codecs for a lot of things, we've taught it to be more communicative and so it has preambles before tool calls. It'll say things like, I'm about to go look for this. Yeah. And you can steer that really well. I actually really like it. I have, I've created like a personality. I tweeted about this. I created a personality for my coding agent because I really like my tools to be kind of like fun to work with if I'm in there with them. And so I have it. It's got this like it gets really excited if we do something together and like, because I want to wake up in the morning and be like, oh.
[11:13]
A
I'm going to go work on this.
[11:13]
B
Project with my, my buddy five one. Right. But some people don't like that. And also for like you said, long running agentic tasks that can get in the way, like you're burning tokens that don't really matter if it's running in the cloud. So 5.1, you can turn that off. You can prompt it not to do that, but the Codex model can't actually do that. And it relies on the reasoning summarizer to give you that update I guess.
[11:37]
A
More broadly, why should people know or think about in terms of what will be nice to if coding models in general, more broadly than just like the media but experts release just like what trends are you seeing, what discussions are active?
[11:56]
C
Our talk today is focused on talking a little bit about sort of the trend that we're sort of seeing is the abstraction layer really moving, starting to move upwards from the model layer, whereas the agent layer, as I said, we train our models starting to be a little bit more opinionated, especially with regard to building model like codecs and the models are really good at doing certain things when inside of a certain barnes, a certain type insert shape. And so we're packaging that up more closely so we're actually shipping this entirety entire agent altogether, then you can actually build on top of that agent. That's one of the patterns that we're seeing here is rather than focusing on optimizing with every single model release, you're actually just be able to plug in an agent like Codex into your platform and be able to use an app box.
[12:46]
A
Yeah.
[12:46]
B
And you're seeing ZED use this GitHub VS code lets you just like package a whole agent to work inside of it. That way, like if you're building a coding tool like Zed and you don't feel like having a whole team, keep up with all every single model release and every single API change and how to update the harness to do different cuts of sandboxing and all that kind of stuff and you can just build one layer above and that is actually super powerful because coding is just like one agentic behavior. It turns out it's a really nice one to start with because you can measure the performance, sometimes easier with a lot of other ones. But it also gives the model the capability. Right. So we started out with like chatbots. Like you're having a conversation. Let's give the chatbot a tool to use. Okay, so now you have an agent that can like run commands. Well, let's give the chatbot agent a Codex to use. So now if it doesn't have a tool, it can make a tool that it needs to solve a problem. Right. So that's like another layer of abstraction and it's not just coding. You can write software that has an agent that can spin up a Codex instance and write a custom plugin for your software for that customer's API. Right. And so now your software is self customizable because it has its own team of people inside that can do integrations at launch.
[14:05]
A
Yeah, solving integration engineering is A gi. Yeah. I think one theme I'm finding at this conference so far, even early, like the first heater ops, I think people are starting to really explore sub agents, agents more abstractly, agents that use agents and we used to call it multi agent. I don't know now I don't know if there's any thoughts on your end about this where like you can tool call. I guess like a very basic example is what you just said, which is the agents can create another instance of Codex that creates a tool and then straw just use the tool. Is there a case for scaling some agents, teacher?
[14:46]
B
Yeah, I think so. I mean Codex Max was designed for that. Right. So it has its own compaction and context management. Codex Max manages its own context window. And so it can run basically forever without you having to work worry about it while it's inside of the Codex harness. And that lets you do a lot of different things. You can essentially have it hand off its own context to other sub agents. Right. So letting it sort of like spawn different agents to do more of its work in parallel and all kinds of things like that. So it's built for that. I mean we're just sort of like starting to see the indications of like what that means. But that's I think the future and we're really excited about that.
[15:28]
A
Yeah, it's really, I think like as.
[15:31]
C
I said, the trend that we're sort of observing here really moving up the abstraction layer to the agent to the agent layer really allows you to do a lot of cool things like brand new sponsorship, spending a few agents creating new abstractions as things as the long running agent workflow continues. And right now we're building all the primitives as well autos, specifically with animics.
[15:56]
B
Yeah, and it's really about moving the threshold up further. Right. Like I was saying before, like I now trust like Codex to do some of my hardest work. I haven't written a single line of code by hand in months because I know what I can trust it to do.
[16:11]
A
You're the forums person that said that in the last 24s.
[16:14]
B
Yeah, no, it's real. I mean I've actually launched something. There's an open source project that I did. There was a Codex upgrade pack for migrating from completions to responses that was totally written by Codex and I didn't write a single line of that code. And now it's, it's out there, it's open source.
[16:28]
C
Most of the folks at OpenAI, well initially when Codex first launched it was around 50% of folks at OpenAI started using you, but now they go.
[16:37]
B
But those folks at OpenAI, that's very true.
[16:39]
C
Use it every day.
[16:40]
B
The way that we do it is we're really good at evals, right? Like in order to develop trust and like build a product that can do more than you design it for, which is really what we're talking about here. You're making an agent that can like solve its own problems. You have to get really good at figuring out how to build those guardrails and evals around, you know, what is it doing, what is it allowed to do and check it in production. So we have all of this platform tooling now around agent traces and rollout traces and coming up with evals for that and building, you know, graders and all of the things you need to sort of like maximize the pipeline so you can let it go and then like be like, okay, I don't really like the way it did that. Grade it, have it metaprompt itself so that next time it actually does a better best practices.
[17:21]
A
One of the biggest views in terms of organizational capabilities that OPI is investigated is Azure biodiversity. Can you say more about that? Like why is that suddenly a big priority now? Obviously I think there was OPI always did internal emails, but now it's like a team that is more of rear facing and maybe go this random error.
[17:44]
C
That the path to AGI goes through evals and. Well, I'm sorry, that was a little.
[17:50]
B
It's though, it's so true.
[17:52]
C
Repeated way too many times. But I think there are a lot of academic evals, right? There's like sweep bench, there's other like, you name it. But I think there's a slightly lack of evals of the real world on sort of what people care about the most. And we want to make sure that whatever we're developing model wise as well as product wise are aligned and are actually making the most amount of useful impact on this world. And applied evals is really in that direction, capturing all of those sorts of real world use cases and things for us to hill climb together.
[18:30]
B
I like to think of it as like we have, I mean people say us a PhD in an API, right? But if you hire a PhD student, they don't know how to do the job. You have to give them a job description, okay, That's a prompt, right? So now you have your policy and then you have them do the job and they're going to kind of like flail around, right? So they need mentorship, they need guardrails, they need Evals, performance reviews on how to do their job, the best practices. And so what we're doing is we're trying to put our models out there and see what they're good at, what they're not good at. Talking to our customers who are like, oh, we could really use your model for more things if it could do this one thing. Here's our eval for it. Or help us build those evals with you so that we can see where we're deficient and go back and train the model to be able to do that job in the way that we wouldn't normally get to see it. Form.
[19:18]
C
Yeah.
[19:19]
A
How do you do multi turn evals? So I think that's the really hard thing. That I mean sometimes you need multi turn. If it doesn't get around the first go, but it could just get around the first go, then it's no longer multi turn. Right.
[19:31]
B
So then what do you want to. I have, I have some ideas.
[19:35]
C
Oh yeah, you go.
[19:36]
B
I mean I've built a few myself. I don't. This is, this is sort of like my personal work. I think this is like an area that people are just now getting into.
[19:45]
C
Right.
[19:45]
B
We have LLM as a judge. You can use LLM as a judge to look at an entire world, the trajectory.
[19:51]
A
Yeah.
[19:52]
B
And see, okay, over the course of all of this, like how well did it perform, what did it do? And then you could maybe like walk it back a step to the part where you don't like. And then you could have the model run the next step with the instructions, grade it on that and then have it improve itself. Oh, I don't like the way that you, I. We do this all the time inside of harnesses. It's like that was a good answer, but I don't really like how long it took you to get there. So can you give yourself better instructions for doing that next time? And it'll write something and we'll add it in there and then suddenly it's better. Right. So there's. That's one way of doing it.
[20:24]
A
Yeah.
[20:24]
C
I think multi turn evals. Most of the companies or startups that we work with, like these days, the agent runs in a multi turn way. Right. And then so therefore if you can build an agentic harness that works in a multiple turn way, you can eval it. And then there are like also academic benchmarks. Already does this in some ways, like cowbench. And now we have like tile squarebench that does this like particularly well and would definitely certainly take inspirations from that.
[20:54]
B
I have this idea. I call it like a job interview eval. I haven't finished it, but really, like if you're evaluating a coding agent, what do you want it to be able to do? You want it to be able to take an underspecified. Imagine you are interviewing a developer. You give them a problem, hey, like go implement a string reverse or whatever. And then it's like up to them to like ask for, for, okay, well I need more information. What are the constraints here? And then you judge them on that and then they start implementing it. You give them some modifications, you grade them on that. You can imagine building with an LLM, a rollout that is promptable and the model responds and then I can kind of grade the whole thing. Yeah, yeah.
[21:35]
A
One thing I would love, and this is like the feature request part of the podcast is batch multi turn eval API, you know, so batch API is single turn, but you can't really batch multi turn requests. Is that already doable?
[21:52]
C
Batch multi turn requests? I don't believe it. You can't do it yet.
[21:56]
A
But yeah, I think that's like a challenge because you need evals to be cheap as possible.
[22:00]
C
Yes.
[22:00]
A
They're not that time sensitive and you.
[22:02]
C
Want to run it overnight when the things are cheap.
[22:05]
A
Yes.
[22:07]
C
Well, feedback taken feature, I think. But that's the thing, like every day we're trying to make the platform better. And right now evals is certainly part.
[22:14]
B
Of literally how we make product feature updates. As we talk to people like you, they're like, hey, can you do this?
[22:19]
A
I mean it's super like, yeah, if I'm going to throw thousands of runs at this thing, I should probably spend some time worrying about costs.
[22:26]
C
Speaking of which, what are you trying to eval?
[22:29]
A
I mean, Devin and Cascade. So I have a personal side project where I want to make Devin for non coding. Oh, I love Devin so much. Like Slack, my kind of semi hot take that I'm floating around because just to see how it feels is I think Slack is the ultimate user interface.
[22:49]
B
Yes.
[22:49]
A
For work, right. I don't want to read email. I just read Slack all day. I interact with my email agent through Slack. So basically I'm building a dev in for email.
[22:59]
B
Yeah, well that's the thing is like you can use, you could use Devin to do that. Right. Like a coding agent. Like Codex, a cli. It used to be back in the old days. Like I started out in the 90s working at IBM as a system administrator and I had to write my own custom software and bash scripts and whatever to like actually solve real world problems every day. And so I had this like, you know, toolkit of scripts that I made.
[23:22]
C
Right.
[23:23]
B
That were like organizing file directories or doing like other random things that weren't necessarily writing code.
[23:28]
A
Yeah, yeah.
[23:28]
B
And so you can get from.
[23:30]
A
But not, Cody, use cases to just.
[23:31]
B
Like sort through your email using like ELM or something right. In the terminal, or like have it generate like snippets of video clips from YouTube that you can watch later or things like that.
[23:44]
A
You know, I never thought about that, but I do that all the time as part of lanespace. Yeah, I should probably invest in that tooling.
[23:51]
B
I had Codex go through my really messy directory of all of these experiments that I was running and like completely organized them and like put them into shape. And it was so wonderful.
[24:00]
C
I used it for something that's more boring. Organizing my desktop.
[24:04]
A
Yeah.
[24:05]
C
You know, we have a lot of files on the desktop and Codex is really good.
[24:08]
A
Yeah.
[24:09]
B
People think they have only a codename.
[24:11]
A
Img0416.Jpg or that thing. Yeah.
[24:14]
C
Well, I'll just find all the images and put them in one folder. I think that even that that's, that's something Codex can do.
[24:19]
A
I think that's one of the big themes that we're also seeing. Like coding tools are breaking out of coding and just like everything, they're personal automation. Exactly.
[24:26]
C
Because the way if you can think about before graphic user interfaces and browsers, like what did we. How did we interact with the computer? We did so through a terminal and we did so by writing commands and writing code and stringing them together inside of the terminal. So what would you think about it is, are those coding agents are actually a computer use agent but for the terminal?
[24:49]
B
Yes.
[24:49]
A
Yeah.
[24:50]
C
They're actually incredibly general.
[24:51]
A
I would say that coding agents today are still not vision native enough. Like you have to try to get it to use vision and oftentimes it fails. Still, we should use vision a lot more. Yeah. I would say, you know, I was going to end the episode with asking for your 2026 predictions. Like, we sit down this time next year, what do you want to see? You know, what do you hope to see? I'll just kick it off with the easy one. Yeah, More computer use. And I think like when you say things like, oh, we'll have a coding agent build its own integration to your application. A lot of applications don't have APIs, don't have NCPs. The only thing you have is a UI, right?
[25:29]
C
Yeah.
[25:30]
A
Because they're legacy or because they don't want you to take the data, but while the data is yours, you just have to like in a non permission way, take it by the user.
[25:39]
C
Yeah, yeah. And I can continue just by sort of like saying that that's definitely going to be something I think is going to be something that we'll be capable of in 2026. But also the other thing that I am sort of really like looking forward to are codecs being able to do more. Right. We're already starting to talk about how codecs or like coding agents can sort of use computers in novel ways. We're going to be able to sort of see more general and general use cases like that coming along as well and more extensible ways for you to build with those sub agents as well.
[26:16]
B
I really want to see the trust level go up even further. Right. Like at OpenAI, I get to work with some of the most amazing developers I've ever worked with in my life. They're incredible. Like some crazy tech leads. I wish every company, no matter whether it was like a small dev shop in Alaska where I worked for a while or OpenAI, be able to have on their team capabilities that you would only be able to get at a top tier firm. Right. So, so all of my teammates at all of these places could turn to a coding model and be like, hey, how do we do this crazy awful refactor that we have to do to get to support this new customer that we have? Or like, wow, there's so much of a mess here. Or what's the best way to actually implement this new technology and have it be so trusted and so right and so smart that we can actually perform better than we could normally get access to.
[27:06]
A
Yeah, see I think that's gonna be any final construction.
[27:11]
C
Oh yeah, we're Brian and bill@OpenAI and yeah, feel free to find us on our Twitter, socials, whatever and let us know how you're building.
[27:20]
B
Yeah, and we love working with startups and anytime you have feedback about you really wish the model could do this or the product can do this and you could unlock some massive capabilities. Just let us know.
[27:29]
A
Yeah. Amazing audio. I said Aggie, nice.
[27:32]
C
Like, thank you, Sam.