
Loading summary
A
Today on the AI Daily Brief, a primer in using the goals primitive in Codex and Claude code and how to use it to level up your use of AI. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's Sponsors Robots and Pencils section, super intelligent and blitzy. To get an ad free version of the show go to patreon.com aidaily brief or you can subscribe on Apple Podcasts. And if you want to learn more about sponsoring the show, send us a Note@sporsidailybrief.AI Today we're talking about something that a lot of power users of AI are incredibly excited about, which is slash goals. So let's dive in. Today we are doing another very operator centric episode. Recently I did a show about code Codex Maxing, effectively a set of tips and best practices on how to get the most out of OpenAI's codex. Now, in many ways, while that episode was specific to Codex itself, a lot of the interaction patterns you could also follow in other harnesses like Claude code. The Codex Maxing piece was built off of a blog post by OpenAI's Jason Liu. Jason wrote up about nine techniques or interaction patterns that he had discovered allowed him to get the most out of codecs, not just for coding, but for other types of knowledge work as well as and some of those tips represented fairly different types of patterns. One of them, for example, is the idea of durable threads or mono threads, where instead of using some sort of infrastructure like a project where you have multiple threads all related to the same topic that share a memory base, you instead use a single thread, relying on the Harness's compaction tool to make sure it always preserves the relevant context. You also saw in that Codex maxing post a number of ideas about how to effectively reduce the latency between the human providing guidance to the model and the model getting things done. I think in some ways, in fact, that you could kind of summarize the overarching direction of what Jason was exploring as a way to move past the turn based paradigm of AI. In other words, the standard way of interacting with chatbots that we've all gotten used to over the last few years, where you give it a prompt, wait for it to do a thing, review the thing it did, develop and provide it your feedback, and wait again for the next thing that it does by using features of codecs like the side panel where you can inspect artifacts as they're being built voice input to more freeform, give feedback with a lot of additional context because you're talking through it, steering to insert that feedback even as codecs is still working, and some other features like remote control and heartbeats to make sure that this can happen even when you're not sitting at your desk. All of what it amounts to is a new, more parallel way of working with agents through these harnesses like Codex. Now when it comes to codecs, however, there has been one feature that has been lurking in a lot of the conversation throughout the month. It is in fact one of those features that once introduced, becomes normalized across all of the competitor set, with other companies adopting it even if they weren't the first to do it. I'm talking, of course, about goal. Back at the beginning of May, the Codex team's Thiebaud wrote, Goal might be the most consequential thing we have shipped in Codex. The value of good instructions has never been higher. Pavel Huron explained, you state the outcome, the model loops, self evaluates and stops when it's done. Now this idea of looping is a key part of this. You might remember how we talked about the Ralph Wiggum loop, which is basically an early hack it yourself version of this that figured out a way to get an agent that you initiate on a problem to keep working against that problem over and over without human steering having to be involved, effectively extending the window of how long it can work without your immediate interaction. Former OpenAI co founder who is now at Anthropic Andrej Karpathy also has been spending a lot of time with looping, such as his Auto Research loop. At one point he said LLMs are exceptionally good at looping until they meet specific goals. Don't tell it what to do, give it success criteria and watch it go. Pawel concluded his tweet, the skill that wins is engineering the intent, why it matters, strategic context and how the success will be measured so the agent can make better autonomous decisions. Now over the next couple weeks people really started to click in on goal. Gregor Zunich writes, Goal is one of the best things OpenAI ever shipped. Alex Finn wrote, Goal is the most underrated feature in AI right now. Ollie Lemon called it basically autopilot for complex AI tasks, trying to describe how goal worked for non technical folks. He wrote, 1. You type goal and describe the end result you want. 2. The AI starts working. 3. After every step it checks itself am I done yet? 4. If no it keeps going. 5. If yes, it stops and tells you. And honestly, people found so much utility so fast with this that just a couple weeks later Claude Code shipped the same feature. And in recognition that it was better to participate in a new primitive rather than trying to own it, they did the super smart and mature thing of just calling it Goal in Claude Code as well as Microsoft's Nicholas Bustamante wrote, I'm glad to see Goal becoming the new primitive for long running tasks. The model does not naturally persist across turns, context windows, sandboxes, process crashes or days of work, so it needs the help of the harness. He continued, I also love how simple it is. An initializer agent turns fuzzy user intent into durable workspace structure with a plan MD file. Then worker agents make bounded progress against that structure and a judge agent decides whether the stated completion condition is actually met or it will keep running. Once again, the abstraction is moving up the stack. In 2024 you wrote your own while loop. In 2025 you wrote prompt files and hooks, that is Ralph Wiggum. In 2026 the loop is becoming a product primitive. Sean Wang, aka Swix wrote that this represented an increased level of autonomy from skill, which was preset prompts to slash plan, which was human refined inputs to goal, which was AI evaluated outputs. Now, as this new primitive has taken hold, lots of people have started to try to write guides and tip documents. One of those came from the OpenAI developers themselves, and that formed the basis for the guide which we'll be going through for the rest of the show. How to use Goal this is not comprehensive and it honestly still slants more technical than I was trying to get to. But hopefully, especially for those of you who are thinking about how to apply this for knowledge work as opposed to just actual software engineering tasks, you'll feel a little bit more like you have a handle on this once we're through. Now, as you might imagine, I did use Codex to build this presentation. So if you see any lingering meta text, that is where it converts instructions into marketing copy, or it's just in general a little overly verbose, I'm placing all of the credit for that squarely at the feet of 5:5 in Codex itself. Now let's start by defining the difference between a prompt and a goal. And an important point here is that goal is not a bigger prompt. It's a fundamentally different type of a thing. In summarizing the OpenAI guide and some other primers, I gave it the way the Codex described Goal was as a finish line contract what should be true, how success should be checked, and what has to stay intact along the way if a prompt involves asking for a result, the harness model combo doing the immediate work, the harness model reporting that work and waiting for your feedback. Repeat Goal is instead a continuous loop that 1 works towards the durable objective that you've given it, 2 checks current evidence against the finish line as it's defined, and three determines whether to continue, whether the task is complete or whether to stop because it's honestly blocked. And part of the recognition behind Goal is that there's lots of types of work that's sequential in a way where the work can't know its next step until the last step is taught it something. Now, because it's from their developers, the OpenAI document centers on tasks like profiling, patching, benchmarking, reproducing, flaky tests, migrations, bug hunts and research audits, with the common thread being that although each has a specific target, the path to get there changes as Codex gathers evidence. If you didn't have a system like Goal, you'd be sitting there waiting to see what it said after each intermediate step, only to say something like keep going now. Check this now rerun that goal effectively pushes that keep going button for you. Now, despite all these examples being about coding, goals can apply to any objective that has and requires some sort of auditable persistence. So what does a type of work need to have to be a good candidate for goals? First, it has to have a durable objective. In other words, the target should remain true across each turn. The target itself is not going to change over time. A second aspect of work that's good for goals is an uncertain path to success, one where Codex or Claude code may need to inspect, compare, rerun, revise or investigate before knowing what the next best move to make is. Finally, that objective needs to have really strong clear finish line evidence where completion is not dependent on vibes, but instead on tests, sources, artifacts, citations. Basically some sort of proof that is inspectable by the AI, where it can self judge successfully if it's actually done. Simply put, a goal defines completion for a particular body of work. And by using GOAL in Codex or Claude code, you're engaging in a particular type of work where you've shifted from telling the AI what to do to instead telling the AI what you want to have done when you're through. And while goals are a way to increase autonomy, they're not about cutting out the user entirely. They are still highly user controlled. You define the outcome, the goal can be paused, resumed, cleared or completed. Basically, lifecycle authority stays bounded to the user and the system and evidence that the user has provided. There's a set of commands including goalpause, goalresume, and goal clear, that if a user finds the path that Codex is going down seems to be wrong, or the rubric for success needs to change, they can intervene without having to throw away everything that's been done so far. Now, one pattern we talked about when we were talking about Codex maxing was the idea of the importance of durable threads, or some people have called it the mono thread pattern, where instead of a project with a shared set of memory, the unit of context is the thread itself. That's how goals work as well. The thread itself is where everything accumulates. This is not taking advantage of global memory or project instructions. More broadly, the objective itself travels within that specific thread. One thing I keep seeing in enterprise AI companies hedging across every cloud, every model, every framework, or paying a GSI for a pilot that never ends, the teams actually shipping they've picked a lane and they move fast. That's one of the reasons I like today's sponsor Robots and Pencils. They've gone all in on aws. They're an advanced tier in AWS pattern partner, and they ship production AI coworkers in 45 days. That's led to them doing some of the more interesting work I've seen on AI coworkers. And by that I'm not talking about chatbots. I'm talking about actual agentic systems that sit inside a business architecture and do real work. That kind of focus matters if you're an enterprise leader trying to get something real into production, or an AWS rep trying to move a customer from interested to deployed. Request an AI briefing at robotsandpencils.com, one conversation with robots and pencils and you'll know. Here's a harsh truth. Your company is probably spending thousands or millions of dollars on AI tools that are being massively underutilized. Half of companies have AI tools, but only 12% use them for business value. Most employees are still using AI to summarize meeting notes if you're the one responsible for AI adoption at your company, you need section Section is a platform that helps you manage AI transformation across your entire organization. It coaches employees on real use cases, tracks who's using AI for business impact, and shows you exactly where AI is and isn't creating value. The result? You go from rolling out tools to driving measurable AI value. Your employees move from meeting summaries to solving actual business problems, and you can prove the ROI. Stop guessing if your AI investment is working check out section@sectionai.com that's S E C T-I-O-N-AI.com OpenAI and Anthropic are both launching enterprise AI consulting efforts because everyone is realizing that the challenges and the capabilities of AI are the challenge is getting individuals in the organization actually ready to use it. The truth though, is that all the forward deployed engineers in the world aren't going to help you if you don't actually have a coherent strategy based on an understanding of your actual AI readiness. Super Intelligent Maturity Maps give you a chance to see where you stand relative to the industry on deployment, depth, systems integration, data access, outcomes, people and governance. And from there, our customized AI planning assessments can help you figure out what you need to do to improve your readiness and how to sequence it. Go take your own Maturity Maps quiz@BESuper AI and send us a note if you want to go deeper Weekends are for Vibe Coding it has never been easier to bring a passion project to life, so go ahead and fire up your favorite Vibe coding tool. But Monday is coming, and before you know it, you'll be staring down a maze of microservices, a legacy COBOL System from the 1970s, and an engineering roadmap that will exist well past your retirement party. That's why you need Blitzi, the first autonomous software development platform designed for enterprise scale code bases. Deploy the beginning of every sprint and tackle your roadmap 500% faster. Blitzy's agents suggest your entire code base, plan the work and deliver over 80% autonomously validated, end to end tested, premium quality code at the speed of compute months of engineering compressed into days. Vibe code your passion projects on the weekend. Bring Blitzi to work on Monday. See why Fortune 500s trust Blitzi for the Code that Matters at blitzi.com that's blitzy.com. Now Writing a good goal is more than just having an outcome, although that's part of it. When it comes to the outcome itself, it's really important that evidence can decide success or completion. Evidence can be tests, citations, matrices, logs, rubrics, artifacts. But there's more to writing a good goal as well. A good goal prompt is going to provide boundaries like which files, tools, or data can be used. And it's likely going to explain things like when the harness should actually stop and explain that no defensible path remains. OpenAI's TIP document says that the strongest goals usually define six the outcome or what should be true when the work is done. The verification surface, which is the test benchmark report artifact, command output or source material that proves it the constraints in other words, what must not regress while Codex works the boundaries which files, tools and resources Codex can use the iteration policy, how Codex should decide what to try next after each attempt and the block stop condition or when codecs should actually stop. But what about scope? How broad or narrow should a goal be? Early experiments do suggest that there is sort of a Goldilocks zone where you can be too narrow, that is Fix this one line or you can be too broad, that is Improve the whole system with the challenge of being too narrow being that even if that's the thing that you actually want to change, it doesn't give the system enough flexibility to discover where the real issue is, especially if it's in some related dependency or upstream in some way. Whereas on the other end of the spectrum, if it's too broad, it's much harder to provide the kind of concrete evidence that's going to allow Codex or Claude Co to know if it's actually successfully accomplished. The task just right is obviously in between those two extremes. Relatedly, defining the output artifact can be the difference between a successful goal run or not a weak artifact, in the same way that prompting too loosely can produce underwhelming results. If your goal artifact is write docs for this feature, the inspectable output of the work might not actually provide the best evidence surface as opposed to a stronger artifact goal like produce a docs page that explains the lifecycle command surface and two examples verify that the page builds locally and all referenced commands match current CLI behavior. Now you're probably noticing that a lot of this terminology is still really anchored in the realm of developers. Well, how do we start to figure out what types of other non software engineering knowledge work might be a good fit for the goal primitive? One of the ways to think about it is when the output is not just an answer but an audit that might be a good place for a goal. A good non coding goal is going to produce a ledger of what was checked, what was supported, what was contradicted, what was weak, and what remains unknown. If that's the type of output that is valuable for your task, it might be a good fit for goal even if it's not a coding task. Now one of the interesting things is you branch from software engineering to knowledge work is how to think about where the definition of success comes from. Broadly speaking, there are two paths. In some cases there will be an externally definable rubric that could be existing published criteria, official docs. A third party data set, an existing set of logs or transcripts, or some project specific document like RFP questions. In many cases however, and this is where it starts to get really blurry As I was thinking about different projects that were going to be a good fit for goal, I noticed that sometimes I as the user needed to provide the rubric and I think that this is going to be one of the most common patterns in those types of knowledge work use cases where the user supplies the criteria for success. Think about, for example, hiring criteria. It's not going to be some external source of what you should be looking for. It's going to be you articulating in ways that are knowable by the AI and can be tested against by the AI. What are the hiring criteria that matter to you? A similar example is vendor scorecards. You're not looking to some external standard for what the vendor should be, at least not entirely. You're probably looking for the AI to mirror what you specifically or your company specifically are looking for in the vendor. Same can be true for editorial standards, lead qualification rules, investment diligence, priorities, et cetera. In fact, you can almost work backwards from here and notice that when you have a knowledge work task that implicitly comes with some rubric or criteria of success that might be a good place to look to see if it is a good fit for the goal primitive. Now, for the sake of this particular episode, I'm not going all the way through an entire use case, but I did want to provide a set of examples that I think might be good fits or good areas to look as you are thinking about how you can experiment. So 10 areas of knowledge work that I think might be a good fit for goal include literature reviews, market landscapes, vendor evaluations, due diligence, claim audits, policy research, interview synthesis, timeline reconstruction, spreadsheet audits, and even strategy memos if the goal of the work is to take a whole bunch of messy inputs and put them into a more structured format. Double clicking on three examples claim audits strike me as a really clear fit that even if that's not a use case for you, hopefully gives you some more insight into the type of structure you're looking for. So imagine a prompt goal audit this memo claim by claim. Verify each claim against the provided sources and reputable external sources, which, by the way, you'd probably want to provide. End with a table labeling each claim as supported, contradicted, partially supported, or unverified with citations and uncertainty notes. So you're seeing here that output of an audit trail you're in that Goldilocks zone where you're articulating well enough what you want is the output. And it works because every conclusion the AI makes can be traced back to evidence. Now what about a market landscape? Isn't that just sort of a normal AI research question? Well, imagine that the goal is create a market landscape for X market verified by cited company pages, filings, analyst reports, pricing pages and product docs, and with a comparison table, confidence levels and gaps where evidence was unavailable. So what takes us out of the realm of a general research project and into the realm of a goal project is that idea of moving to an audit as the process and output. The artifact that you're trying to go for is a comparison table that shows you what can be verified, what's inferred, and where the evidence runs out. Similarly, a slash goal shaped literature review is one where you're living with complexity and diversity highlighting rather than flattening conflicting evidence and disagreement. Provide an evidence backed literature review on X topic. Build a source matrix covering methods, sample sizes, findings, limitations and conflicts end with confirmed themes, disputed findings, and open questions. Basically, this pattern is going to work wherever evidence can be inventoried and presented in complete form. My suspicion though, is that a lot more of the way that knowledge workers are going to use this, at least in the short term, is in this area where there are user provided rubrics. Whereas a prompt can be good for a single pass, goal can execute an entire review process. So something that might be well suited for a prompt as opposed to a goal would be review these five applications against this rubric, cite evidence, and suggest interview questions. It's a small set of inputs, straightforward criteria with one comparative read goal would allow that to become the architecture for an entire process that involved extracting evidence, applying the rubric, checking consistency, revisiting borderline cases, flagging missing information, and producing a continuously updated document as more entries come in. Still, it's really important to note that as you start to dig into this, not every task will end up making sense to be a goal. There will be lots and lots of times, perhaps even the majority of times, when the traditional interaction pattern is completely sufficient for what you're trying to achieve. Sometimes that will be because the outcome objective is small enough, but other times it will be because the criteria for success won't be as clean or definable as the goal primitive needs to do a good job. And this is why Jason's tips about codex maxing remain important even in the goal era. Because a lot of times you're not going to want to be as fully disconnected from the process as slash Goal allows you to be. Effectively There is a spectrum of interaction autonomy between you and the harness with different methods making sense for different types of things you're trying to achieve. Goal is a really great tool to begin to play with, and I think it is worth spending some time experimenting, even if it's with something outside the mainstream of your work, just to get a sense and a feel for what it can achieve for you and what it requires of you. As we get a little bit deeper into this paradigm, remember we're only a couple weeks after it's been fully introduced now. I'm sure we're going to have a lot more examples of how and where it is both working and not working in and around non coding use cases and knowledge work. And so at some point I'll come back and do an update based on all of that. For now though, that's going to do it for today's episode of the AI Daily Brief. Hope this one is helpful. I'm excited to see where your goals lead you. Appreciate you listening or watching as always. And until next time, peace.
Host: Nathaniel Whittemore (NLW)
Date: May 31, 2026
Nathaniel Whittemore delivers a deep-dive into the emergent "goal" primitive in AI developer platforms like OpenAI Codex and Anthropic Claude Code. He unpacks what the /goal command is, why it matters, and how it is shifting the paradigm from traditional prompt-based to more autonomous, outcome-driven agentic work. Designed for both technical and knowledge workers, NLW explains not just how goals work for coding, but how they’re likely to revolutionize broader knowledge work. The episode draws heavily from recent OpenAI documentation and community experiments.
Ideal Use Cases Have:
Types of Work Suited to /Goal:
Quote:
Quote:
Tip: Define tangible artifacts and checkable outputs (e.g., a built documentation page verified against live CLI commands, rather than a vague “write docs” request).
Quote:
Quote:
“I’m excited to see where your goals lead you.” – Nathaniel Whittemore (46:00)