Loading summary
Claire Vo
Foreign. Welcome back to How I AI. I'm Claire Vo, product leader and AI
Mercury Representative
Obsessive, here on a mission to help
Claire Vo
you build better with these new tools. Today I'm going to walk through my favorite feature in my most recent favorite AI product Goals in Codex. If you've been wondering how all these people on the timeline are getting their AI to run quick quote unquote overnight
Mercury Representative
or handle very complex long running tasks,
Claire Vo
I'm going to show you. Goals is the answer. We're going to walk through what it is, how I might use it, and a technical use case along with some non technical examples of how goals can
Mercury Representative
help you even if you're not coding.
Claire Vo
Let's get to it. This episode is brought to you by Mercury.
Mercury Representative
As an AI founder, I'm constantly tracking run rate, watching revenue growth, paying vendors and making sure I'm getting paid on time.
Claire Vo
Mercury makes all of it feel effortless.
Mercury Representative
The app is genuinely beautiful.
Claire Vo
It actually looks and works like modern software, which sounds obvious but apparently isn't.
Mercury Representative
When it comes to banking, what I use it for the most bill pay for my vendors is just clean and easy and wires and transfers, Getting paid from clients, moving money. Mercury makes it so simple. Everything you need is right there. No phone calls, no hunting through menus,
Claire Vo
no wondering if something went through.
Mercury Representative
I think about how much I've optimized every other tool in my stack. Mercury is the one where I don't have to think about it at all. It just works. Visit mercury.com to learn more and apply online in minutes. Mercury is a fintech company, not an FDIC insured bank. Banking services provided through Choice Financial Group in column NA Members FDIC Before I
Claire Vo
go into how to use Goal, I want to talk about what goal is and when it's appropriate and when it's not the right tool. The job so I'm looking at this blog post by the OpenAI developers team. It's called Using Goals in Codex and the first thing that they have in this blog post is this awesome diagram that talks about the difference between a prompt and a goal based loop and a prompt. You all are used to this. It's sort of the turn based request that we're all used to. You ask the LLM, the model, the harness to do something. It works. It returns to you its result and then it waits for you to prompt it again. If you're like me, the number one thing that you're saying in your coding tool is okay, what's next? And then it tells you and you say great, do it if you find yourself in that process, using Slash Goal in Codex might be a tool that you want to add to your toolkit. So what's the difference between this turn based one response wait and goal? Well, with Goal, when you give Codex a goal, it actually has something that it can work towards and it will continue to loop to the next step and verify until it can measure that it has met that goal. And so if you look at this, the goal is the overarching kind of description of the outcome that the model wants to get towards. And that will work, it will check its work, it will decide the next step, and it will continue that three step process until it can gather evidence that it has met the goal. Once it gathers evidence that it has met the goal, it will mark the goal as complete and then it will tell you it is done. Now if you've been watching people online talk about how they get these long running autonomous tasks out of codex, Claude code, etc. You're really talking about people who are using some framework of this goal. There's also a version of this called a RALPH loop that people were talking about. But functionally the framework is the same. It's saying keep going until X behavior or Y outcome is validated. Otherwise I want you to reprompt yourself and reprompt yourself until you're there. And what's really fascinating about goals, I've been using, you know, AI coding agents for many years now. And until Codex and Goal, I was not able to get these multi hour long running autonomous tasks. Now I don't have the most complex coding tasks in the world, I'm not building an operating system, I'm not doing complex mathematics. And so part of that was my problems were pretty well constrained. But I did have things that I thought a long running harness could really help me with. But until Slash goal was part of the Codex tool, I really just wasn't able to get my AI to self manage enough to do that autonomously over time. But the first time I used Goal, I was actually able to get a coding task running for about five hours and 45 minutes, which is longer than I've ever had anything run before. Now, quick introduction on how to use goal. There are four sort of ways to manage the lifecycle of goal goal. The one that I use is slash goal and then I walk away. So if you write write slash goal and then prompt it with your goal, it will start working. You can use slash goal to see what the current goal is. Again you can pause the goal, you can resume the Goal, and then you can remove the goal so you know, you don't have to let your AI run for 6, 12, 24 hours, whatever. If it gets off the wrong track. You can absolutely manage the life cycle. But it's a really useful tool. And I love that they give the example here because this is a hundred percent what I spend most of my time going. They say you really want to use goals when you would otherwise find yourself saying the same thing after turn, like, keep going, try the next thing, run it again. Now run the test, continue until it's actually done. So if you're micromanaging your AI and having to tap it on the shoulder and say, can you pretty please go to the next step? Goal is for you. Now, how do you prompt and design a real goal? This is where product managers tune in. Engineers that write success criteria tune in. These are where those skills on setting really measurable, well defined goals come into play. Because when you prompt something, you're really just saying, do this task right, like rewrite this code, redesign this page, etc. When you're talking about a goal, you want to talk about what the outcome is if that task was successful. And the technical example that they give here in this blog post is reducing P95 checkout latency. So if you know that a specific page is loading kind of slow and you want to reduce that below a threshold, and you know that can be measured because you can just load the checkout page over and over and over again and then you create a guardrail on it, like keeping the correctness suite green. That is a really great goal. It's measurable, it's testable, it has a guardrail on it, and there's a executable surface area that you know an LLM can be successful for. Writing goals is its own skill set. But OpenAI has given a really great outline to what makes a strong goal. And again, product managers, let's pay attention. If you've written an okr, developers, if you've argued that an OKR was not well written, this is where those skills come into play. The strongest goals, I mean for anything, but in particular for Codex, kind of have six things as part of it. It has an outcome. What should be true when the work is done. So once we're done, what is the outcome we're trying to deliver? Verification. How can you test it? Do you have a test suite? Do you need to pull up the browser? Is there a number that you're trying to go to or a measure? Constraints? What can't Regress While codecs works. For example, on our P95 checkout latency, you could delete the page, the latency goes away, but that's not what you want. So you want constraints, you want the features to stay the same, you want particular technologies to stay the same, the boundaries, so what tools and files and things it's allowed to use in pursuit of this goal. The iteration policy, how it should decide what to try next, kind of what would you try next? And then when it should stop and say, sorry, I just can't continue, I don't have a good next idea. And they give this great pattern here, which is slash, goal, you know, my end state verified by specific evidence. I need you to preserve these constraints. Please use these tools between iterations. Decide the next step by doing X, Y and Z. And if you're blocked or no valid paths remain, this is what you should do next. You should tell me, you should report, you should ask me for help. And so they give an example of how to make this P95 checkout latency goal a lot better. It's basically by saying bring it below a threshold which was already in the original prompt, but you're going to verify it by the checkout benchmark, you're going to keep the correctness suite green, you're going to use only the checkout system between iterations, you're going to tell me what changed what the benchmark showed and the next experiment to try. And if you can't come up with something else, stop and give me the evidence, the blocker and what you need from me. This is a really great goal and this is a technical goal, but you can also do this with non technical projects. And I'm going to show you a little bit of how that works. So again, a goal is a new way to prompt a LLM in this instance Codex to work autonomously in a loop of work, verify, check until it hits a goal. Goals written are a lot different than prompts. Prompts are an instruction of what to do. Goals is a description of what a good outcome is and how to get to that outcome. And then I've seen Codex be able to run these goals for a very long time. So I'm going to give a couple examples of how to use goals and what I think they're most useful for and some successes I've had with goals. And I'm going to kind of show you behind the scenes. I have Chat prd and in Chat PRD we have a tool call in our main AI writing loop and it edits specific parts of a prd and it's this diff based editor. It's very complicated and it looks for operation ranges inside a document and then tries to edit those operation ranges. And we were getting tons of errors, you can see here, tons of errors on applying specific edits because it couldn't find the right operation range. I'm just going to again, you know, tune out if this is boring to you, but because the documents we created were complex. They had tables in them, they had bullet points in them, they had bold, they had quotes, they have images. Actually, precisely getting a range of nodes from the AI was really, really hard. And we were just seeing a bunch of these errors over and over again. And we would like to find one example of why an error showed up in a very specific document. Fix that. But then another one popped up. So it's like that cartoon where like you plug your finger over here and another spot goes off. And it was driving us crazy. You can see here and then you can see basically the end of April, the beginning of May, they went away. Why did they go away? Well, we used goal to knock this out. So the goal that I used to solve this particular problem is I gave Codex access to Sentry. I gave Codex access to these edit requests, and I said, goal, Codex, go through every example in Sentry, every trace in Sentry of an invalid operation on the edit tool, categorize that issue and fix it. Then replay all of the century events that would have shared that same issue until you have fixed every issue and every historical example of an edit invalid operation is solved and it went to town. So what it would do is pluck one example. It would see what the root cause was. It would implement a fix for that root cause. It would then run through all the other examples to see how many of those it burned down. It would have some remaining. It would pluck the next one, it would do the fix, it would run through all the remaining examples. Burn it down, burn it, burn it, burn it down, burn it down, burn it down. And then look what we have. We have literally zero errors left. Now, this took several hours. And what was really nice is at the end of it, I didn't get like these band aid fixes all over our edit code. What I got was a systematic fix that integrated every example into a more intelligent framework for how Edit should be applied. And ultimately we've had zero edit errors from the time that we use GOAL here. And so I think this is a really great example. But let's do it Live, because this is how I AI. I'm going to give another example of how I might use this again for some of the more technical folks. So these are the Vercel errors. It looks scarier than it is. We have a lot of retries around this, but here are the errors that happen behind the scenes that we have to recover from in our main chat and from the last last two weeks. And I want to do the same thing with these errors. I want to say Codex, find these errors, classify them, ship a fix, validate against the existing data until basically there are none of these errors left. So I'm going to pull up Codex, I'm going to use GPT. This is not like a complicated deep thinking problem, so I'm going to use GPT 5.5 medium and I'm going to say goal eliminate errors on the API chat v2 endpoint that are showing up in the Vercel logs by going through each category of error, identifying root cause, determining if this is a user facing error. If it is, determine root cause and open a branch PR for fix. If it is not, reduce this error to a warning. Once all logs can be handled from the last two weeks, report to me all PRs to review and issues that could not be fixed or what you need from me. This is terrible. Prompt. This is fine. This is honestly a better goal prompt than I usually write and say success state is we have no user facing errors and no backend errors. That should be warnings. Okay, I'm pressing enter. It's compressed my skill descriptions but that's fine. Now Codex has hooked up with my Vercel plugin so it has access and can actually go access these logs. And so it's making this plan and I just want to pause and tell you kind of how goal works with a plan. So once it has a goal it makes. I've seen these like three to five step plans. So it's going to inventory the current repo, it's going to pull the last two weeks of Vercel errors and group by category. It's going to classify them as user facing errors and it's going to implement validate fixes or downgrade warnings by category and then it's going to publish the PRs and report to me again. This is very precisely, it's measurable, it actually has a list of errors it's going to burn down, it's observable, it definitely can eliminate those errors. So it can ship a fix, it can eliminate it, or it can run the same code and it can show that the error wouldn't be hit and then it has a success criteria and an ending state to me which is I want A list of PRs and any blockers or things that I need to review. And so it's going to go ahead and go through and try to find the right logs. It's going to continue to work on this. Now we are in a mini episode today. It's one minute into this goal. I suspect that this is going to take two to three hours to get through. I've run something very similar on this. It's taken about two or three hours to get through so I will have to put in the show notes or a follow up whether or not this was super successful. But it's just an example to you. I love this idea of just like Sentry 0Error 0 where you can point GOAL at any kind of like lingering errors that have really haunted your team and developers out there. You know that these exist and you can actually say just go get rid of these. And with GOAL it really is possible and I've seen very high quality success on using GOAL to burn down errors. So that is a technical example of how to use goal. But I want to make this more applicable to people who aren't developers because I honestly think GOAL for non coding use cases is even more exciting. Today's episode is brought to you by
Mercury Representative
Mercury, the banking solution I use for chat prd. I build AI tools.
Claire Vo
I talk about AI every day.
Mercury Representative
So when people ask what I use to run my business, Mercury is a genuinely easy answer because an AI founder
Claire Vo
who still deals with clunky, outdated banking is kind of a walking contradiction.
Mercury Representative
Mercury is how I track, run rate and revenue growth, pay my vendors through bill pay and get paid by clients, wires and transfers that used to feel
Claire Vo
like a whole thing.
Mercury Representative
Sending money, accepting payments, knowing it arrived. Mercury just makes it simple. The whole platform is clean, fast and modern in a way that most banking honestly isn't. I've banked with them for years.
Claire Vo
It's one of those tools where I
Mercury Representative
don't think about switching because it's never given me a reason to visit mercury.com to apply online in minutes. Mercury is a fintech company, not an FDIC insured bank. Banking services provided through Choice Financial Group and Column N A members fdic.
Claire Vo
For this next example I want to give you my favorite use case of slash goal. It has blown my mind and if you leave this episode with nothing else I hope you go do this which is use the goal to clean up all your Unread emails. So Codex has access to my Gmail plugin. That means it has MCP access. It means it can go through and read my email. I had yesterday truly 3, 900 emails, something like this. I'm gonna see if I can find the resume, the save chat. So I'm gonna type in goal and see what my goal was that I did yesterday. It is much worse. Written prompt Categorize all bulk promotion spam emails, unsubscribe from unnecessary emails and clean up your inbox. Ask for help while needing judgment. It ran for 3 hours and 52 minutes and it had it used about 6 million tokens. So it was not token cheap. I'm going to just show you what it did which is it just read like literally read every email, categorize them, put nice labels on them so then I could go decide including labels like needs judgment. Clicked unsubscribe links for me, gave me a list of unsubscribe links that I could use and at the end of the day I went from about let's actually ask how many emails did I start with uncategorized and how many are now left to filter. So it's going to go ahead and check its own work and you're going to hold me accountable to show that I did not make this up and it's going to show how many emails I started with and how many do I have left. I'm pretty sure it was about 4,000 and I think we got down to about sub 1,000 that needed to get done. Okay, it took a little prompting to remember what it did, but again we started about 3,900 emails. Now I'm down to 68 that I need to look at. So that's my Today project. So it categorized almost 4,000 emails for me and it put it in lovely folders. Again it unsubscribed for me. It gave me nice categories of emails that I needed to respond to. If you've been waiting on me for a couple weeks, you now got a response and now I have a much cleaner email that I can run over time. So again slash goal. My prompt was very simple. Just categorize all my emails, unsubscribe and clean up my inbox. It ran for four hours and now I have a much cleaner inbox to work with. Okay, I'm going to give one other example of a non technical use case that I think is going to be really useful for the product managers out there, which is I have let my linear my task management software go completely off the rails. This is partly an open claw problem, which is I gave my agents my open clause YOLO access to Linear and they created a bunch of tasks, not all which that they have done. And so I want to clean up my linear tasks and get them to only the ones that I need to complete. And I want this in particular for our podcast Linear because we had aspirations of all the things we would do with every episode, we usually do about 70% of those and I just want to clean it up. So I'm going to say slash goal clean up the How I AI podcast team issues in Linear. Anything from a previously released episode that is not marked as done should be marked as will will not do. Our goal is to have open only future tasks this week and forward for episodes not old tasks will never get around to. So I'm going to let that do that. It should have access to the Linear plugin. It's going to go through and again, I'm telling you, this is like hundreds and hundreds and hundreds of tasks it's going to go through and make this judgment call of can I close this? Can I update the data? If you want to have better task hygiene where you want to make sure everything is tagged correctly, assignments assigned correctly, this is a really good use case. And so it's found the linear team, it's going to work at the team level, it's going to identify stale episode tasks, it's going to go through, clean them up. The task status we want is not won't do. It's called canceled and it's just going to process through and go ahead and do that. So I suspect that this one will go a little bit faster, but will probably take 30 minutes to an hour to go through really high quality judgment. And at the end of it I'm going to have a much cleaner linear workspace to work with. And again, it's saying a clear rule is emerging. Keep current week, future episode work, cancel non done episode release work before Monday. It's going to scope the bulk update, it's going to validate that the outcome I wanted, which is a clean linear, is done and it will complete this over time. So Those are my three examples of how to use Goal 1 is a technical one. Again, it's continuing to run. So it's gotten through the first two steps here. The technical goal of looking at all my error logs and basically classifying them, fixing them, burning them down with the goal of having no more errors ever. There is the second very practical goal of clean up my email inbox and so I can actually read my email. That one took about four hours I think useful for everyone and I did not have to have a very good prompt there. And then my third one for project management. Make sure that my projects and my tasks and issues are clean, my backlog is clean, everything is labeled the way I want and I only have to focus on the things that matter to me. These are three ways I think you can use goals in Codex. Before we end, I want to take a step back and talk about when you shouldn't use goals and then what I think is next. So goals are not the right tool for every job. And I'm pulling up this blog post again because I think they say it better than me. Do not use goal for something that is a very simple one line edit it is just too big of a tool for the job. Your goal wouldn't be like make sure this line of code is removed. You really want an outcome, not an output, almost for it to be a good goal. Also, don't use a goal when the finish line is vague. So you can't do. I mean maybe you can if you're like slash goal make my customers happy. I think that is just a very vague goal. It's very hard to measure and there's no reliable definitive completion condition. And so that's not very good. The other example they give is like refactor this code not a good example of when to use slash goal. And in fact I'm doing a refactor this code initiative with Codex. But I'm not using a goal they say. And I just want to reiterate this for you. Goals are strongest when it has three properties. A durable objective, an evidence based finish line, and a path that may require several turns of investigation. So if you have an objective that stays steady over time, you know you want to hit that objective. It can be evidence based and you can measure it and you think getting there is going to require a couple turns. Goals are for you. So before we wrap, a couple thoughts on slash goal and why I'm just really excited about this framework of working with AI. 1 As I said at the beginning, this has been the first time that I've been able to get these autonomous long running tasks done. And so I really can set the LLM the AI up with a goal, step away and have it work over many hours on a problem that would be very annoying to babysit. So one I think my babysitting days are largely over with AI not completely over. I'm still babysitting a branch right now, but largely over with AI. I think the second thing is the impact that goal has had on quality of life things in my code that have been very hard and annoying to chase down. Yes, I probably could have gone task by task and said, please fix issue A, then fix issue B, then fix issue C. And I could have set different coding tools off on those problems. But this idea of just saying like error zero, go through all our error logs and fix them until they exist no more is incredibly powerful for in particular quality. So for engineering teams looking to burn down tech debt, fix flaky tests, look at really annoying, like client side errors that are maybe annoying to reproduce. I feel like slash goal is really powerful. The third thing is I think that product managers are really going to love goal again. We've had it drilled into us, outcomes, not outputs. You shouldn't be defining the work, you should be defining, defining what success looks like. I think as more and more teams start to use slash goal as part of their coding workflow, product managers are gonna have to get a lot better at prompting these AIs with good goals. And we have some of those skills already. But I think the technical level of validation that's required by Slash goal requires you to uplevel these hard skills in writing what a good goal actually looks like. And then finally, I'd say with slash goal and these long running tasks, and I felt this a little bit with openclaw, and I just see this becoming more and more true. Working with AI just continues to feel more and more like working with a colleague, a human colleague, in that you assign a human colleague a task you don't like, sit there over their shoulder and tap and say, okay, next step. Okay, next step. What you really do is you give them a goal, they go away for the time required to hit that goal, and then they come back to you with the completed task and you give feedback. And so again, it's this form factor. Even though the AI is maybe faster than a human would be on some tasks, they may be slower than humans because they have the patience to go to the edge cases of things. But either way, they're using the time necessary for the task to get it done. And it really feels like I'm much more in manager mode than builder mode. And honestly, I'm not sure that I love that. When slash goal came out, I found myself kind of like twiddling my thumbs and looking for the job that I could do in the coding work because so much of the job had now been handled itself. So in conclusion, I really suggest you try Slash Goal. If not in Codex, try a similar loop in whatever your favorite AI tool is. Let it run and let it solve bigger, more complex problems for you and come back to you when it's time to review the work. This is how I AI. I'm so excited to see what you build and I'm going to get back to my logs and see if we've actually eliminated all these errors. Thanks for joining.
Mercury Representative
Thanks so much for watching. If you enjoyed this show, please like
Claire Vo
and subscribe here on YouTube or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify or your favorite podcast app. Please consider leaving us a rating and
Mercury Representative
review which will help others find the show.
Claire Vo
You can see all our episodes and learn more about the show@howia pod.com See you next time.
In this episode, Claire Vo dives deep into the "Goals" feature of OpenAI’s Codex, an advanced tool that enables AI models to autonomously work towards complex, multi-step objectives—potentially running overnight without human intervention. She explains the differences between traditional prompt-based interaction and goal-based loops, offers technical and non-technical use cases, and shares live screens and actionable tips for maximizing the power of AI in real work scenarios.
"With Goal, when you give Codex a goal, it actually has something that it can work towards and it will continue to loop to the next step and verify until it can measure that it has met that goal." (Claire Vo)
"If you're micromanaging your AI and having to tap it on the shoulder and say, can you pretty please go to the next step? Goal is for you." (Claire Vo)
"Goals are strongest when it has three properties: a durable objective, an evidence based finish line, and a path that may require several turns of investigation." (Claire Vo)
"Working with AI just continues to feel more and more like working with a colleague, a human colleague...I'm much more in manager mode than builder mode." (Claire Vo)
| Timestamp | Segment | |------------|---------------------------------------------------------------| | 03:00 | Difference between prompt and goal, and how goal loops work | | 08:50 | Components of a well-designed goal | | 12:20–16:50| Technical example: fixing all document-edit errors | | 18:51–21:50| Non-technical example: Cleaning up email inbox | | 22:20–25:20| Non-technical example: Tasks/project management in Linear | | 26:30 | When not to use goals | | 27:10 | Reflections on impact and shifting towards manager mode |
Memorable closing quote ([28:48]):
"Working with AI just continues to feel more and more like working with a colleague…I'm much more in manager mode than builder mode." (Claire Vo)