
Loading summary
A
Foreign welcome back to How I AI. I'm Claire Bow, product leader and AI Obsessive, here on a mission to help you build better with these new tools. Today we're going to bring you up to date on all the new coding model releases from OpenAI and Anthropic. In case you missed it, OpenAI released last week Codex, their desktop app for AI engineering. The new model GPT 5.3 codecs. Try saying that five times fast. And Anthropic released their response Opus4.6 and Opus4.6 fast. If you're new here, then you don't know. But when these new models come out, I put them through their paces. I test them. I test them side by side on the same task and I'm going to give you my opinion about where they do well, where they fall apart, and which one goes where in my AI Engineering stack. Spoiler alert. I've shipped more code in the last five days than I think I have in the last month. So. So I think these are pretty fabulous models. But they do have their quirks, they do have their strengths, and sometimes they go off the rails. Let's get to it. This episode is brought to you by workos. AI has already changed how we work. Tools are helping teams write better code, analyze customer data, and even handle support tickets automatically. But there's a catch. These tools only work well when they have deep access to company systems. Your copilot needs to see your entire code base. Your chatbot needs to search across internal docs. And for enterprise buyers, that raises serious security concerns. That's why these apps face intense IT scrutiny from day one to pass. They need secure authentication, access controls, audit logs, the whole suite of enterprise features. Building all that from scratch, it's a massive lift. That's where work OS comes in. WorkOS gives you drop in APIs for enterprise features so your app can become enterprise ready and scale upmarket faster. Think of it like Stripe for enterprise features. OpenAI perplexity and cursor are already using work OS to move faster and meet enterprise demands. Join them and hundreds of other industry leaders@workos.com start building today. Okay, to start, I like to pick a task when I'm evaluating new models. That's pretty ambitious, something I definitely wouldn't want to do by hand and is consistent enough that I can actually compare the pros and cons of each model side by side. And I picked a task that I choose often when comparing these models, which is redesign my marketing site. I think all these models are pretty good at one Shotting kind of a landing page or a marketing page, a simple app. I don't feel like that's a practical evaluation criteria for these no models. I like to take a code base that's relatively complex or at least established, and compare side by side how these models work inside these code bases. So I took my Chat PRD homepage marketing site. It's got lots of pages, it's got a blog, it's got the How I AI workflows on there. It's not a simple app, even though it's just kind of like a content front end. And I want to bring it up to my 2026ambitions, which are all about the enterprise. So while this, you know, website looks great, it's cute, it's got nice colors, it's definitely more focused on the kind of PLG individual user workflow. And I want to uplevel this as we sell more to enterprise customers. So I'm going to have these models duke it out and see which one does the better job. And I'm going to test these in order of when they came out. So the first thing that came out in our very busy week last week was Codex. Now, Codex, as I said, is OpenAI's desktop app for coding. And before we get into it, I want to show off some of the things that I think make Codex unique. First of all, Codex is focused around git primitives. Now, if you don't know or you're not technical, you're a new software engineer, you probably run into some concepts of git as you've gotten started vibe coding. But I just want to walk through a couple things that might be useful for you to know. The first thing is the idea of a git repository that is basically a whole code base that represents an app or a project. Git repositories are represented over here in Codex as projects. You can see I have different repositories here that I'm working on, including my Chat PRD website, the WWW website. Then in your repo, you can start working on new types of code. And there are kind of two ways you can take code and make it contained so that when you edit it, it doesn't break your production website. The first way that I use a lot are branches. Branches are little, as they say, branches of your code that you can make changes to commit and then ultimately decide to merge to production. There's also the concept of work trees. These are full copies of your code base that you would use or an agent would use to make changes and One of the benefits of work trees versus branches and you get many of them going on on the same time on your same machine. And so if you're working with a lot of agents, you could give each agent its own work tree to work on and it could do a lot of work in parallel without running into each other or causing issues. If you want to learn more about work trees, definitely watch our episode with Alex from OpenAI on Codex, the terminal app where he goes through how he uses work trees on a daily basis to kick off his agentic work. And then up in the top right you can see we have a good diff panel. A diff is again the difference between what you had and what you have now. You'll see red is code that was removed, green is code that was added. You can see up here the count of line changed, either added or removed. And then you can create pull requests from Codex. Pull requests are kind of a signal to your team that says, this code that I'm working on is ready to be part of the main production branch. Can you pull it in? I'm requesting it. And often that's where your CICD pipeline, your pre develop or your pre production checks go and where your team with their human eyes tends to look at your code. And you can see here, as I'm talking through this, Codex has put these concepts up front and center. And I think that's because they're trying to appeal to two audiences. One, they're just trying to appeal to, you know, let the tokens go, highly empowered, use all the agents, software engineers that are doing a lot of things at once on their, on their local machine and need to be able to benefit from these concepts of git, work trees, local and cloud agents, all that kind of stuff. The second thing is I think this is actually a really good framework for folks that are less technical to learn the concepts of GIF. I have always said you should invest in the GitHub desktop experience. It is a version of this. It's what I use all the time to manage by work across branches and across files. I could work in the command line tool for GitHub. I just think it's nice to be able to see your changes and really know what's going on. And so Codex has brought some of these visual concepts, UI concepts of Git into the Codex app. So it's nice if you're learning. The second thing that you'll see in Codex that is a little new and unique compared to other apps is the concept of bringing skills up as a first class citizen. So if you are new Skills are sort of a package set of prompts, instructions, reference files and code that can be called by an agent to kind of consistently execute a task over time. If you want to be like really cheap, it's like a bundled prompt. And you can see here that OpenAI and Codex have given screens a home and they've given them icons and they've given them buttons. And I have to say, I love this. If you watched my early episode when Skills first came out, I was so exasperated that Skills were like a zip file that you had to upload somewhere or put in your repository. This just makes it a much more visual experience to add Skills to your code base or to your system and refer to them over time. I Also like that OpenAI shipped a bunch of recommended skills that a lot of people can benefit from. So you can get your mind wrapped around what kind of skills would benefit your AI work. The final thing that I think OpenAI put kind of front and center in Codex that's interesting is this concept of automations. So, so automations are basically tasks that can run on a schedule. You can see here when you create a new automation, you give it a name, you say what project it needs to run on, you basically run a prompt, it's not that fancy. And then you give it a schedule. And again, like skills, OpenAI has shipped a bunch of out of the box automations. Now my reaction here was I'm already doing a lot of this stuff. You know, I'm a little ahead of the curve when it comes to some of the automations around my code base. So I've solved these problems, but I think everybody should solve these problems. So if you're looking for inspiration on what kind of automations would benefit your code base, the Codex Automations recommended Automations is a really good place to start and get some inspiration, but let's get to actually writing code. Now. I have to say one caveat, which is I ran this process using GPT 5.2 codecs, which was the recommended model when this app came out. Now, very quickly they came out with 5.3 and we'll see that towards the end of the episode. But I do want to call out, this is a slightly older version of the model though I think the family of models, given my experience, have very similar output. So I would say I would probably get the same experience with 5:3. Now, what is the test case? We're going to do as I like to do, we are going to redesign the chat Purity site. Last time when some models came out, we redesigned a page but we've been pitched that these new models are more independent, can do more long running tasks, can handle more. And so I want it to take an existing code base and redesign the whole thing. And I'm going to trust these very smart models to do it without too much prompting. And so that was my test cases. I wanted to take this homepage and this website which is lovely but it's very PLG focused and make it more polished, more up leveled for an enterprise audience. And so I started that in Codex and I gave it pretty high level prompt, but I thought it could go with it, which is I said optimize the marketing site. In this repo for PLG Plus Enterprise, you can create new pages, redesign templates, etc. To make it the highest quality marketing site I could have. And then I listed a bunch of sites that I really like. If you're on this list, I think you have a nice website. Now here's where it immediately disappointed me and I'm sad to say it, but it did. One of the things that I've noticed about the GPT 5x Codex models is they are so literal. They are so literal and so they follow instructions very well. And I know that is a in many instances a feature, not a bug. You want your model to follow your instructions explicitly, but you don't want it to follow it blindly. And that's what I found. I found that the codecs app Harness plus the Codex models were just too literal to do greenfield or creative broad work on my behalf. It will do high quality coding work. I will get to that soon. But your ability to tell these models like hey, go and do X. I often found that with a combination of it being too literal and not pushing me to the next step, not actually saying are you ready for me to build? Meant that it was much more painful and slower to get work done with these models. And this is really ironic because the 5.3 model is actually pretty fast and so it should feel faster to code with it. But the actual back and forth experience conversationally was really challenging and you'll see some of that here. So I said redesign the website. We went back and forth on how to use the FIGMA skill. It didn't actually pick it up well. So I just gave gave up on that and then I asked it to redesign the page and it did it. Now here's where example number one of being too literal came in. I had told it I wanted to redesign the marketing Site for a combination of product led growth and enterprise. Basically, I wanted a market site that'd be friendly to users, but it would also help our sales team bring in in leads. And it built it and literally had explicit references to PLG and enterprise in the copy. It was like, if you're here for product led growth, click here and sign up. If you are here as an enterprise customer, click here and talk to sales. It was so explicit. And this was my perpetual cycle with Codex on this redesign. We went back and forth. I gave it some design help. I asked it to design a couple things on styling. At some point I said, the design's okay, but it could be better. Take more inspiration from the sites I offered. Make the copywriting top tier. I've spent $2 million on it. You can see some of my desperate prompting here. Just trying to figure out what is the unlock. Is it a technical spec unlock? Is it a, you know, find reference content unlock? Is it an identity unlock for these models? I couldn't figure it out, so I kept trying. And what was really funny is I just, every time I would say something, it would overfit to my prompt. So when it gave me a website that I generally liked but said, hey, can you add more about integrations? Our enterprise customers really like integrations. It made the entire page about integrations. If I said, hey, I want to focus a little more on enterprise, it would make the entire page about enterprise. It really didn't have that nuance of what goes where and how to build a balanced experience. It was really overfitting to my last prompt. And I will, you know, I was saying, like, we don't need to list exactly everything. I was trying to give it explicit examples. And then it put a long list of all those examples. It was just having a really hard time editing itself. And then I'm going to give my favorite example of Codex being way too literal, which is I told it, you know, created with something that I thought was fine, but it was a lot of images and not a lot of content. And I said, hey, I like a more content dense site like Hex Hex. You have a lovely site. I think you did a really nice job. I just wanted more copy on there because I think I want to be more technical, more detailed, more precise about what the value of my product is. And after two prompts literally made the headline a dense product workflow for AI powered teams. And I was like, oh, I mean, I made the, like, facepalm emoji face. I was like, why in the world Would you say that our product has a dense workflow? I asked for a content dense site. I didn't say make our content all about how dense our product is. So I just had a really tough time with Codex and Codex 5.2GPT 5.2 on this particular task. We eventually got there and I would say the output was okay. So this was the before, the after from Codex. I. I really liked this headline that came up. It was like one of the things somewhere buried down on the page that I thought was great. It eventually got overwritten by my content dense headline. I thought some of the headlines were like kind of interesting. It looks pretty nice. It pulled some interesting graphics from our repo. It put placeholders in here. You know, I think this is okay. It kind of didn't quite fit our design aesthetic. And what I was more frustrated by than the literal nature of the GPT model, which I kind of gotten used to. This is like not something that's new to me is that it really only redesigned this homepage and the enterprise page. So I had asked it to redesign the whole page, the whole site, and it really did not do that. And so again, this like sort of, it can do long running tasks and take on ambitious things. It just took a lot more work for me to get it to even get to this two page redesign, which I thought was okay, not great. Now the code is great, it's fine, it's not terrible. It's certainly faster and better than what I would have done myself. That being said, I think we could do a lot better. So speaking of doing a lot better, let's go over to my friend Opus now. Again, spoiler alert, y'. All. I love, I love her. I love Opus. And I will caveat by saying I found a place where I really love Codex. So we're going to come back. But as soon as I started getting my hands on Opus, I was just really happy. But it didn't start off perfect. So let's talk about where it went well and where it kind of went off the rails. So again, I started with the same prompt, optimize the chat marketing site. In this repo for PLG and Enterprise, you can create new pages, redesign things, et cetera. Again, I put this content dense framework in here. I just, I had just come off that bad experience. I wanted to see what it did. And I will say opus 4. 6 was just a lot better at planning for itself so that it could execute a long running task. So it did its exploration of our code base and reference marketing sites. It used Cursor plan mode to do a plan and then it started building the components. Now I have to give kudos to Cursor. I'm still a cursor girl. Yes, I could have tested Opus 4.6 in Claude code. I am sure there are optimizations there. I just hand to God think that Cursor does a good job of building harnesses for all of these models. I think the combination of like planning and to DOS and exploration and the question tool, I just tend to get good results. So there is this open question of was it the model or was it the codecs harness that, you know, in the desktop app that is not as mature as Cursor? Which one caused that bad experience? I'm not sure, but using opus 4.6 in the cursor desktop app was quite nice. Okay, so it's building, it's building, it's building, it's building. It goes. It runs a build. It gives me a summary. I am very pleased with the independent nature of this model. I'm about to hire her. She can go run my marketing site. You are now my marketing engineer. Except the copy was great, the design was terrible and unfortunately I didn't commit to this at this point. So I lost the design. But it just did not look good. It did not look sophisticated. I was like, I'm going back home to Codex. What are we doing here? It was terrible. So again, I did my desperate prompting here. I want it to look like I spent a million dollars on my design with the best agencies out here. Here are some colors. Let's see if I said, oh. I said, I want you to develop a unique and modern front end visual style. This is Tailwind Indigo AI slop. If you know, you know. And so it agreed with me. It was like, you're right. I gave you generic. Tailwind's law. Let me rebuild. And it rebuilt and it was so lovely. And so we went back and, you know, it integrated our design system. It gave me an outline of what it did in terms of design. We had to go back and forth on build, but eventually I got something lovely. Here was the before and the after was like this. I love this so much. We're probably going to ship this in the next day or two, hopefully live when the episode goes live. But it still matches our brand aesthetic, but just looks so much nicer. It has our colors. She is pink. It uses some of our graphics instead of placeholders. It calls out some numbers, which is really great for selling the value proposition. It highlights the reviews and Then, you know, instead of what Codex was doing, which was making like very blunt statements about enterprise, it was like 100% security, all this stuff. It gives a really nice kind of value proposition oriented view of what would be nice for enterprise. And redesigned our enterprise page as well. So once I got exactly what I liked, I asked it, okay, let's take these styles and go ahead and redesign the rest of the site to bring it up to matching. And it did a really good job. It kept everything consistent. It redesigned our pricing page, it's working on our How I AI page to make sure we're matching some of the designs. I think this looks really nice and I was super happy with the output and this is going to be my meta assessment of Opus4.6 versus the GPT5.X models, is that Opus 4.6 is really good at kind of generative, broad greenfield work. You want it to implement a new feature, it will go implement a new feature. You want to completely rede design your site, it will completely redesign your site. I was really, really pleased with my experience on this model and we're probably going to ship this live now. This is a much more front end, focused, design oriented task. I like this task because we can literally say, okay, what did I start with before? What did you know, Opus come up with? And then even compare that directly to what did Codex come up with? Which I can refresh and show you here. I can do a side by side and you can see with your eyes, you can read all the words and really make a decision about where these models do well. But that is not enough to assess whether or not these are good models, bad models. I like them, I'm going to use them or I'm not going to use them. And as I go into the next workflow where I found both models to be super useful, I'm going to omit something that is a little scary and maybe impressive, which is I asked Devin today, how much code have I merged into GitHub in the last five days? I need to fix my Devin workspace, but if you go into it. In the last five days I have merged 44 PRs containing 98 commits across 1088 files. I have added 92, almost 93,000 lines of code. I have removed 87,000 lines of code. We have added 5,000 net new lines. We have released 1, 2, 3, 4, 5 MCP integrations, we've completely overhauled one of our big components, we've completely refactored our components folder, and we've shipped and fixed and we have fixed a bunch of bugs. We have done a lot. And this is, none of this is in the web app. This is all in our core application, which is quite complicated and much more complicated than our marketing site. And I did all of this with now my two pals on my team, Opus4.6 and Codex 5.3. So I did find a place that these two operate really well and I am going to talk you through it. As I mentioned earlier, one of the big features that I released recently on Chat PRD was a bunch of MCP connectors. So now from our chat, you can look at what's happening in GitHub, you can look at what's happening in Linear, you can look at what's happening in Granola, and you bring all that into your product work. And this is one of probably two dozen tools that we now have available in the Chat PRD app. And we were displaying them all in different ways. All our tools were different, they were individual components. Our code was super, super messy. And so one of the things that I kicked off in Opus was a refactor of a reused component that I wanted to be able to add, to remove from customize, but have some shared code. I just knew the way we were doing this wasn't great. And so I started off a Opus 4.6 task to refactor how we use our tool components. So let's talk about how I actually rebuilt these components and where I use these different models. So first I open up Cursor and honestly, this might be the secret sauce in some of these experiences. I opened up Cursor, I built a plan with Opus4.6 using plan mode. I kicked it off and I went back with4.6 on how to build this. And you can see here, I got this lovely like sort of extensible tool component where I could add different things in, give them different link or give it different copy and language. As it went through it built a bunch of really nice front end components for me and I think honestly, they look lovely. So as we saw before, you get these lovely tool calls here. They look nice across all of our different kinds of tool calls. Whether you're creating documents. I'm just really happy with this experience. Now I'm ready to push this code to production. Here is where our friend Codex comes back to play now. And this is where I love to use Codex. I went back into Codex and I said, I've redesigned tool usage in this index. It's gone through several rounds of feedback. Can you review the architecture and performance and See if you have any feedback we should consider before shipping. We're looking for something scalable but customizable and we don't want to overfit in any direction. And it went through and searched and identified a couple high impact issues, prioritize those issues for me, asked me questions. I said one is intentional, two is a edge case. And it asked me if it wanted to implement any of the polish. I said yes. And it polished it. It passed our AI bug bot code review and we shipped this to production and now this is my flow. So this was a very, again, kind of front end, focused component, focused workflow. We just, you know, like for the technical folks out there, we just completely are replatforming our vector stores. It was a huge, huge, huge thing. It touched 50 files. It was really hard to do without kind of doing a huge, huge PR. It required like, I don't know, probably 30 rounds of feedback on this thing. And GPT5.3 Codex was so lovely. Love it for code review, architectural review and finding edge cases. And what I found is you could ask Opus 4.6 to build something, it would build something 80 to 90% done or good. You'd ask Codex to find everything wrong with it. It would find all the things that were wrong with it and then you take it back to CO or Opus and Opus would be like, oh yeah, yeah bro, you're right, I really missed that thing. I better fix it. And so I do think I'm gonna give Codex some love here. I think it's the better software engineer technically. OPUS is kind of the software engineer that you want on your team though. It actually builds stuff. And so what I've been saying to people about GPT5.3 cod is it really replicates the principal software engineer experience in that you will fight them tooth and nail to build anything for you, but they are more than happy to tear apart someone else's code. So if you are looking for a principal engineer on your team to pair with your eager product engineer of Opus 4.6, definitely, definitely use Codex. And I kind of feel like I can't live without Codex reviewing my code now. So I'm quite happy with this experience. Again, Bugbot, which I use from Cursor, does a lot of review of our PRs. It's also run on the Codex model, so I think it's a really good eagle eye reviewer. It's just too hard to get out of the gate building new products. So I really like this flow and I highly recommend that folks replicate it. I think it's really useful. To conclude our episode, I just want to give a quick nod to Opus465 fast. If you have not heard, Opus 4.6 is Opus4.6 but fast. You can select it here it is most powerful model but fast. And it is expensive six times the price. I think it's $150 per million output token, something like that. I actually used Opus4.6 fast a lot and now I gotta go look at how much I'm spending. So what I will say is while I have consumed the tokens floating through an infinite ocean of tokens, I embrace a token abundance mindset. I'm starting to spend a lot of money on models which at the end of the day, super, super high roi. Again, if we're looking at this, how expensive would it be for me to ship 44 PRs? Really really huge features. It would take months of time, tons of people. We probably also wouldn't get it to perfect quality. And so I am really bullish that this is a worthwhile investment for my team. But don't mess around with 4. 6 unless you're ready to pay the bill. And so I just think we're all going to start looking at where does this fit from a personality perspective, where does this fit from a capability perspective and then where does it fit from a budget perspective? And as my friend from Cody at Sentry said, if you're playing between 4.6and 4.6fast, don't pick the wrong task or you're going to get a bill that you're not happy with. So that's today's model focused episode of How I AI. I compared Opus 4.6 Codex GPT 5.3 codecs and Opus 4.6 fast. What I found you want to use Opus for your product and feature work, being creative and creating high quality designs. You want codecs, catching all your bugs, advising on our architecture and really writing exceptional, high quality hardened code. Both of these models have a place in your stack. I still love Cursor for using them. I'm still a multimodel girl, but I think they do well in either the Codex desktop app, Claude code, or wherever you like to get your AI generated code. That is today's episode of How IAI AI. I'm looking forward to hearing your feedback about what your favorite model is and where you're using it and we will see you next week. Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show@howiapod.com See you next time.
Episode: Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days
Host: Claire Vo
Date: February 11, 2026
In this episode, Claire Vo compares the latest AI coding models—OpenAI’s GPT-5.3 Codex and Anthropic’s Opus 4.6 (including Opus 4.6 Fast)—through the lens of a practical, ambitious project: redesigning and upgrading an existing, complex marketing website for her Chat PRD app. Claire shares hands-on insights, her AI engineering workflow, and specific strengths, weaknesses, and quirks of each model. The episode spotlights not only the performance of the models but also real-world strategies for maximizing productivity as an AI engineer.
Literalism Issue:
Codex is “so literal"—it closely follows prompts, often to a fault, without abstracting nuance.
Prompt Overfitting:
Small prompt tweaks drastically shifted site content and direction (e.g., focus entirely on integrations or enterprise if mentioned even briefly).
Memorable Moment:
Claire asks for a more “content dense” site. Codex makes the headline:
“A dense product workflow for AI powered teams”—a comically literal misinterpretation.
Output Assessment:
Code quality is technically good, but creative and holistic redesign fell short. Only 2 pages were fully redesigned despite a broad prompt, and both were average.
Used in Cursor Desktop App:
Cursor’s plan and execution harness may have improved results versus Codex’s own desktop app.
Planning and Execution:
Opus 4.6 demonstrated planning capacity before implementation, resulting in a better-structured workflow and greater “independent” work.
Initial Results:
First draft’s design was poor despite excellent copywriting; required explicit prompting on visual style.
Responsiveness to Feedback:
Opus 4.6 took feedback, then produced a highly polished, visually appealing site aligned with branding.
Consistency:
Opus maintained consistency when applying design changes across multiple pages (pricing, features, etc.).
“Opus is kind of the software engineer that you want on your team…it actually builds stuff. What I’ve been saying…GPT-5.3 Codex…replicates the principal software engineer experience…they fight you tooth and nail to build anything, but are more than happy to tear apart someone else’s code.” (44:54)
On Codex’s Literalism:
“They are so literal…the codecs app Harness plus the Codex models were just too literal to do greenfield or creative broad work on my behalf.” (14:50)
On Workflow Division:
“You could ask Opus 4.6 to build something, it would build something 80 to 90% done or good. You’d ask Codex to find everything wrong with it…and then you take it back to Opus and Opus would be like, oh yeah, I really missed that thing.” (44:04)
On ROI:
“If we’re looking at this, how expensive would it be for me to ship 44 PRs, really, really huge features. It would take months of time, tons of people. We probably also wouldn’t get it to perfect quality. So I am really bullish that this is a worthwhile investment for my team.” (47:27)
| Segment Description | Timestamp | |----------------------------------------------------|-----------| | Choosing the test task and setup | 03:02 | | Codex model review and literalism issues | 09:50–21:00| | Codex’s overfitting and memorable copy error | 16:50–18:38| | Opus 4.6 initial planning and first results | 25:45 | | Opus 4.6’s design improvements and strengths | 28:19–32:43| | Massive code output: 93,000 lines shipped | 34:51 | | Pairing Opus (builder) with Codex (reviewer) | 37:21–45:40| | Opus 4.6 Fast and cost/benefits | 46:51 | | Final product and model recommendations | 48:05 |
“Both of these models have a place in your stack…I still love Cursor for using them. I’m still a multimodel girl.” (48:00)