Podcast Summary: How I AI
Episode: Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days
Host: Claire Vo
Date: February 11, 2026
Main Theme / Purpose
In this episode, Claire Vo compares the latest AI coding models—OpenAI’s GPT-5.3 Codex and Anthropic’s Opus 4.6 (including Opus 4.6 Fast)—through the lens of a practical, ambitious project: redesigning and upgrading an existing, complex marketing website for her Chat PRD app. Claire shares hands-on insights, her AI engineering workflow, and specific strengths, weaknesses, and quirks of each model. The episode spotlights not only the performance of the models but also real-world strategies for maximizing productivity as an AI engineer.
Key Discussion Points and Insights
1. Setting Up the Experiment (03:02–08:40)
- Task Chosen: Redesigning the "Chat PRD" marketing site, an established website with multiple pages, to target enterprise clients as well as product-led growth (PLG) users.
- Decision Rationale: Claire chooses realistic, high-impact tasks rather than simple landing pages, to test the models’ value on genuine engineering work.
- Claire emphasizes the need for complex evaluation:
“I like to take a code base that’s relatively complex or at least established, and compare side by side how these models work inside these codebases.” (04:14)
2. OpenAI GPT-5.2/5.3 Codex: Pros, Cons, and Workflow (08:41–24:30)
Product & UX Highlights
- Codex as a Desktop App:
Focus on Git primitives—repositories, branches, work trees. Visual diff panel and first-class skills and automation UX. - Skills: Bundled prompts/instructions that can be reused, now with an improved UI.
- Automations: Built-in, scheduled, repeatable tasks, and out-of-the-box examples.
Coding Test Results
-
Literalism Issue:
Codex is “so literal"—it closely follows prompts, often to a fault, without abstracting nuance.- Quote: “They are so literal…you don’t want it to follow [instructions] blindly. And that’s what I found.” (14:30)
-
Prompt Overfitting:
Small prompt tweaks drastically shifted site content and direction (e.g., focus entirely on integrations or enterprise if mentioned even briefly).- Example Quote: “It really didn’t have that nuance…just overfitting to my last prompt.” (16:50)
-
Memorable Moment:
Claire asks for a more “content dense” site. Codex makes the headline:
“A dense product workflow for AI powered teams”—a comically literal misinterpretation.- “Why in the world would you say our product has a dense workflow? I asked for a content dense site…I didn’t say make our content all about how dense our product is!” (18:38)
-
Output Assessment:
Code quality is technically good, but creative and holistic redesign fell short. Only 2 pages were fully redesigned despite a broad prompt, and both were average.- “It was okay, not great.” (20:20)
3. Anthropic Opus 4.6 (with Cursor): Greenfield Strengths (24:31–34:50)
Workflow & Model Experience
-
Used in Cursor Desktop App:
Cursor’s plan and execution harness may have improved results versus Codex’s own desktop app.- “Plan mode…exploration, the question tool, I just tend to get good results.” (26:17)
-
Planning and Execution:
Opus 4.6 demonstrated planning capacity before implementation, resulting in a better-structured workflow and greater “independent” work.
Output & Iteration
-
Initial Results:
First draft’s design was poor despite excellent copywriting; required explicit prompting on visual style.- Quote: “The design was terrible…I want it to look like I spent a million dollars…” (28:19)
-
Responsiveness to Feedback:
Opus 4.6 took feedback, then produced a highly polished, visually appealing site aligned with branding.- “…it rebuilt and it was so lovely…she is pink, uses our graphics…calls out numbers…highlights reviews.” (29:30)
-
Consistency:
Opus maintained consistency when applying design changes across multiple pages (pricing, features, etc.).
Strengths Summary
- Big Picture:
- “Opus 4.6 is really good at generative, broad greenfield work. You want it to implement a new feature…it will go implement a new feature. You want to completely redesign your site—it will completely redesign your site.” (32:43)
4. Measuring Productivity: 93,000 Lines of Code in 5 Days (34:51–37:20)
- Quantifying Output:
Claire merged 44 pull requests, 98 commits, across 1088 files, adding ~93k lines and removing ~87k, netting 5k new lines—plus major new features and refactors. - “I did all of this with now my two pals on my team, Opus4.6 and Codex 5.3.” (35:46)
- Emphasis: Most of this volume was in core product code, not just front-end, and “would take months of time, tons of people” otherwise.
5. AI Pair Programming Flow: The “Opus Builds, Codex Reviews” Pattern (37:21–46:50)
- Workflow Example:
Opus 4.6 (in Cursor) builds/refactors components; Codex 5.3 reviews for architecture, performance, and edge cases.- “Can you review the architecture and performance and see if you have any feedback…looking for something scalable but customizable…” (41:05)
- Codex found high-impact issues, prioritized them, and even handled implementation of polish suggestions.
- Pattern Recommendation:
- “Opus would build something 80–90% done. Codex would find everything wrong with it, then Opus would fix it.” (44:05)
Notable Quote
“Opus is kind of the software engineer that you want on your team…it actually builds stuff. What I’ve been saying…GPT-5.3 Codex…replicates the principal software engineer experience…they fight you tooth and nail to build anything, but are more than happy to tear apart someone else’s code.” (44:54)
- Conclusion:
Codex is ideal for code and architectural review; Opus excels at building generative, new features and designs.- “I can’t live without Codex reviewing my code now.” (45:40)
6. Opus 4.6 Fast: Cost, Speed, Trade-Offs (46:51–48:05)
- Opus 4.6 Fast Overview:
Functionally identical to Opus 4.6 but significantly faster and ~6x more expensive ($150 per million output tokens).- “Don’t mess around with 4.6 fast unless you’re ready to pay the bill.” (47:15)
- ROI Calculation:
Despite higher spend, AI-driven productivity (“shipping 44 PRs with major features”) delivers “super, super high ROI.” - Budgeting Advice:
- “Don’t pick the wrong task or you’re going to get a bill you’re not happy with.” (47:55)
Notable Quotes & Memorable Moments
-
On Codex’s Literalism:
“They are so literal…the codecs app Harness plus the Codex models were just too literal to do greenfield or creative broad work on my behalf.” (14:50) -
On Workflow Division:
“You could ask Opus 4.6 to build something, it would build something 80 to 90% done or good. You’d ask Codex to find everything wrong with it…and then you take it back to Opus and Opus would be like, oh yeah, I really missed that thing.” (44:04) -
On ROI:
“If we’re looking at this, how expensive would it be for me to ship 44 PRs, really, really huge features. It would take months of time, tons of people. We probably also wouldn’t get it to perfect quality. So I am really bullish that this is a worthwhile investment for my team.” (47:27)
Timestamps for Important Segments
| Segment Description | Timestamp | |----------------------------------------------------|-----------| | Choosing the test task and setup | 03:02 | | Codex model review and literalism issues | 09:50–21:00| | Codex’s overfitting and memorable copy error | 16:50–18:38| | Opus 4.6 initial planning and first results | 25:45 | | Opus 4.6’s design improvements and strengths | 28:19–32:43| | Massive code output: 93,000 lines shipped | 34:51 | | Pairing Opus (builder) with Codex (reviewer) | 37:21–45:40| | Opus 4.6 Fast and cost/benefits | 46:51 | | Final product and model recommendations | 48:05 |
Model Recommendations & Closing Thoughts
- Opus 4.6:
Use for new features, broad, creative, generative work (especially front-end/design-heavy tasks). - GPT-5.3 Codex:
Use for code reviews, bug and edge-case detection, architectural analysis, and high-value polish. - Opus 4.6 Fast:
Only when speed is essential and cost is acceptable. - Workflow Tip:
Combine both: “Opus builds, Codex reviews” for maximum speed and quality. - Favorite tooling:
Cursor app harnesses both models well; multimodel workflow is highly effective.
“Both of these models have a place in your stack…I still love Cursor for using them. I’m still a multimodel girl.” (48:00)
