Summary7 min read

The Growth Podcast with Aakash Gupta

Episode: How to Run Evals in Claude Code with Aparna Dhinakaran, Founder and CPO of Arize
Date: May 22, 2026

Episode Overview

This episode brings together three of the hottest topics in AI + product management—Claude Code, agents, and evals—into a practical, hands-on walkthrough with Aparna Dhinakaran, CPO and co-founder of Arize. Aparna demonstrates, step-by-step, how to create an AI product feedback agent using Claude Code, trace and evaluate its results with Arize Phoenix, and set up automated improvement loops for your agents. The conversation is a must-hear for AI Product Managers, especially those wanting to master observability and evals as career-defining skills. Along the way, Aparna and Aakash explore how the PM role is evolving, the specifics of state-of-the-art self-improving agents, and concrete actions listeners can take to accelerate their learning this weekend.

Key Discussion Points & Insights

1. The Evolving Role of PMs in AI-Native Teams

Product "Taste" as the New Alpha:
- In the age of generative AI, code is cheap—it's product discernment and user empathy that set PMs apart.
  - "Today, people, especially product managers, there's all this hype around, you know, is it going to be the death of PMs? I'll tell you this, we're hiring more PMs than ever...The ones that stand out are those that actually have an opinion and a taste around what to go build." (Aparna, 03:20)
Lines Blur with Engineering:
- AI-native PMs are nearly indistinguishable from engineers. Comfort with code and tools like Claude Code is essential.
  - "The gap between a PM and an engineer is indistinguishable." (Aparna, 63:37)

2. Building a Product Feedback Agent in Claude Code: Live Walkthrough

Setting Up:
- Create a new directory, initialize Claude, and grant access via GitHub token and Anthropic API key.
  - "Go ahead and create a repo or create just a directory and you can go ahead and initialize Claude inside of that directory." (Aparna, 09:29)
Feedback Aggregation:
- Aggregate user feedback from multiple sources (GitHub issues, discussions, Slack, Twitter, analytics, etc.) to fuel your agent’s “taste.”
- Start simple with just GitHub data, scale up by integrating more sources.
Prompting and Reports:
- Give Claude clear instructions: prioritize issues by severity, feature requests, recency, reactions. Output as a markdown report, ordered by priority.
Looping and Scheduling:
- Use Claude's “loop” skill to keep the agent running on a schedule (e.g., hourly) and always up-to-date.
  - "Ideally you want this kind of running all the time, consistently. Every time someone adds a new bug report...it’s always doing this." (Aparna, 15:52)

⭐️ Notable Quote:

“I can guarantee you the level of control and iteration you’re going to get by just doing this in your terminal...is going to be worth a little bit of that learning pain in the beginning.” (Aparna, 07:54)

3. Observability and Tracing with Arize Phoenix

What is Tracing?
- A step-by-step playback of your agent’s decisions and actions—key for debugging and iterating.
  - "You can think about a trace really just as it is the step by step playback of what this agent actually did." (Aparna, 13:44)
  - “Span” = Individual step in a trace.
Instrumentation Made Easy:
- Instrumentation can be set up automatically using Arize's open-source “skills” and Claude Code’s commands, no heavy engineering required.
  - "You're not going to need to wait for your engineering partner to have to go and do all of this lift...just give it the skill and it does the rest." (Aparna, 16:44)
Phoenix (open source) vs. Arize Enterprise:
- Open-source Phoenix is ideal for teams needing self-hosting due to PII or compliance. Enterprise platform scales with volume and offers richer data capabilities.
  - (See [70:59] for a full breakdown.)

4. Hands-on: Setting Up and Running Evals

Baseline Evals:
- Have Claude suggest baseline evaluations—e.g., report groundedness, priority alignment, actionability—using your agent’s trace data.
  - "Once I have the traces...I can actually ask, can you suggest a good eval for my agent?" (Aparna, 24:06)
Granular Priority Accuracy:
- Move beyond just evaluating end reports; check if individual issues/features are scored according to your definitions of criticality.
  - "I want to get a little bit more granular in the beginning and start to understand for every single issue...did it actually give it a right score?" (Aparna, 27:36)
Iterative Refinement:
- Start with “vibe” evals (AI-suggested), then overlay human-in-the-loop rigorous annotation (“axial coding”) for reliability.
  - "Vibes are gonna fall short very, very quickly...not grounded on any actual human..." (Aparna, 43:51)

⭐️ Notable Quotes:

"A good eval is like, you're getting some healthy percentage right, but also healthy wrong so that you can make progress, right?" (Akash, 43:11)
_"I get excited when I see that evals are wrong because then it gives me a chance to know that there's improvement that could be made." (Aparna, 43:17)

5. Automating and Looping Agent Improvement

Self-Improving Agents:
- Automate not just agent runs, but also loops that fetch failed evals, diagnose issues, and suggest improvements to both agent and eval logic.
  - "You actually loop the improvement too, not just the agent." (Akash, 48:54)
Human in the Loop:
- Even for self-improving systems, key changes (agent code or evals) should be reviewed by humans for safety and alignment.
  - "There’s still code review, there’s still a human that looks at every PR that is actually being put up by this self improvement loop." (Aparna, 50:34)
Analogy:
- Agent observability = elite athlete watching footage to self-improve.

6. PM Best Practices, Career Advice & Common Mistakes

Top Skills for AI PMs:
- Intellectual curiosity and relentless experimentation.
- Deep user empathy—know your customers’ pain points by heart.
- Comfort with rapid tool change: "Onus of keeping up has become on the individual now..." (Aparna, 60:35)
Advice for Enterprises:
- Start building small, internal agents—even if not production-facing.
- Use observability platforms to study and improve your agents daily.
Common Mistakes in Evals:
- Not grounding evals in real trace data.
  - "I think the biggest ones is first not starting with actual trace data. I think if you're just starting with kind of what you think are problems, that's really hard." (Aparna, 77:32)
Concrete Next Steps [76:18]:
- "If you have any two hours this weekend, I would say literally what we just did right now...build an agent for yourself, whatever would take away a couple hours of your week every week. Like just something repetitive that you do every single week...Try to build an agent to go do that."

Memorable Quotes & Timestamps

Product Taste as Alpha:
- “The alpha today is product taste. So the people that understand product taste, understand what customers want...are just going to have an insane, insane velocity.” — Aparna, 63:37
On Rapid Shipping at AI-Native Startups:
- "Sometimes an issue will come in, your PMs will identify, it's important enough, either they will prototype or an engineer will prototype and make ready for production a feature and you guys will ship it in the same day? Yes, that is actually what's happening, guys. She said it herself." — Akash, 63:03
On Getting Started:
- "Any product person that has used observability and is looking at their traces and looking at your evals, you're probably already in the top 1% of PMs in the world right now." — Aparna, [00:00] / [76:18]
Iterative Evals:
- “Always start Claude generating it, but then you just give it ruthless criticism...And that's where the taste alpha that you bring can actually come back in.” — Akash, 47:50
Instrumenting Agents:
- "You're not going to need to wait for your engineering partner...just give it the skill and it does the rest." — Aparna, 16:44

Important Segment Timestamps

Intro to Evals Mindset: [00:00], [02:30]
What is Product Taste, Feedback Graphs: [03:20]
Live Agent Demo Begins: [09:29]
Automating with Loop Skills: [15:52]
Instrumenting Tracing in Arize: [15:52]–[19:30]
Alex, the in-product agent, and trace-based eval design: [21:34]
Creating Baseline and Granular Evals: [24:06]–[34:39]
Self-Improvement Automation Loops: [48:07]–[52:17]
Human in the Loop for Safety: [52:29]
Enterprise vs. AI-Native: State of the Art & Advice: [56:32]–[67:32]
Phoenix vs. Paid Arize, How to Choose: [70:59]
Two Hours This Weekend: What to Build: [76:18]
Mistakes in Evals: [77:32]

Practical Takeaways

For AI PMs:
- Get comfortable with Claude Code and live in your command line—this is the new locus of product alpha.
- Embrace observability and eval traces as tools for ongoing improvement—not just as one-off artifacts.
- Continually iterate: start with vibe evals, critically refine, loop and automate, but never fully remove the human-in-the-loop reviews for core agent or eval changes.
- For enterprise PM leaders: start small, internal proof-of-concepts; focus on aggregating cross-team context for your agents; and prioritize closing feedback loops with real user data.

Final Thought

Whether you're at an AI-native disruptor or a Fortune 500 trying to keep pace, mastering evals, observability, and the “taste” that ties great products together is what separates the top 1% of PMs. Build, trace, eval, and continually improve—this is the product loop of the future.

Loading summary

Transcript64 lines

[00:00]
Aparna Dinakran
Any product person that has used observability and is looking at their traces and looking at their evals, you're probably already in the top 1% of PMs.
[00:08]
Akash
What is the role then of the PM? Like, do PMs need to become engineers at this point?
[00:13]
Aparna Dinakran
At the AI native teams, I am seeing that the gap between a PM and an engineer is indistinguishable.
[00:20]
Akash
Aparna Dinakran is the CPO and co founder of Arise AI. $131 million raised and most of the smartest AI teams I know building their evals on top of it. I feel like good eval is like you're getting some healthy percentage right, but also healthy wrong so that you can make progress, right?
[00:38]
Aparna Dinakran
100%. Like I get excited when I see that evals are wrong because then it gives me a chance to know that there's improvement that could be made.
[00:45]
Akash
What are the things if somebody has just two hours this weekend that they should concretely go do and take away besides just they've watched this episode, but now they're gonna actually make impact in their career.
[00:55]
Aparna Dinakran
If you have any two hours this weekend, I would say literally what we just did right now, which is foreign.
[01:04]
Akash
Before we get into today's episode, I wanted to share that you can get a free year of my favorite AI tools, including Bolt, New Maubin, Arise Relay App, Dovetail, Linear Magic Patterns, Reforge, Build, Descript, and Speechify. If you join my bundle@bundle.akashg.com on top of that, I wanted to quickly ask you to please double check that you are subscribed on YouTube, Apple and Spotify podcasts. It's a free thing you can do that really helps support the show. And now into today's episode. So I've been doing a ton of episodes on Claude Code, a ton of episodes on AI agents, and separately episodes on Evals. What this episode we're doing today is, we're bringing it all together for you in one iterative loop. It's kind of like the product development cycle for AI products in a single shot. So you're gonna get to see front to back how we do it. I think we have a tremendous opportunity to learn from Aparna, so I'm gonna try to ask her the tough questions for you guys where maybe what she's doing, she's skipping some steps so that you guys can see it step by step and she's volunteered to be our guinea pig on this. So Aparna, thank you so so much for showing us the ropes of how to do cloud code evals.
[02:26]
Aparna Dinakran
I'm super, super excited to be here. Thanks so much for having me, Aakash.
[02:30]
Akash
So what are people getting wrong when you look at them? Building Claude code agents and trying to do evals?
[02:37]
Aparna Dinakran
Yeah, I mean, I think the first question I get asked a lot is when should I even start doing evals? Like, why is that important? Do I need to think about it before I even build my agent? And I mean, if I'm honest with you, most teams are starting, they're starting with just building. You gotta start by having a real product before you wanna you run evals on it. And so today, what I'm gonna actually walk you through is the full end to end loop of getting started with building a product. When does it make sense to actually, because of the data that you've collected, start to actually run evals and automate that.
[03:17]
Akash
Awesome. Let's see it in action. Where should we start?
[03:20]
Aparna Dinakran
So it's a little bit of a vision for anyone who's an AI PM today, code is so cheap to go create, which means that product taste is really the alpha Today, people, especially product managers, there's all this hype around, you know, is it going to be the death of PMs? I'll tell you this, we're hiring more PMs than ever. We're hiring more engineers than ever. The ones that stand out are those that actually have an opinion and a taste around what to go build. And so today, a little cheeky, but can we try to create taste? Can we try to have the PMs that are watching this have an upper hand to actually create that product taste? Well, where does product taste actually come from? You look at kind of some of the best products out there and what they're doing is taking in a ton of feedback. I mean, the best PMs do this. Best PMs. I mean, YC says this to every single cohort, which is talk to users and go build. And I think what we see is that in order to actually create taste, you need to be getting feedback from a ton of different sources. It could be everything from where your team stores those issues. It could be from GitHub discussions or in real life discussions from Slack and Discord from your actual community talking to you. But also we see teams building out, really a context graph with all of this feedback. Everything from Gong transcripts every time you talk to your customers, your product analytic tools from posthog and Amplitude and Pendo and Fullstory, even down to Twitter. If you Have a product that your users are tweeting about and sharing feedback on. These are all ways for you to actually create and cultivate that feedback source. And instead of having just a human consume it, you can actually have your agent consume that feedback. And so what we're going to do today is we're going to build a bit of a product feedback taste agent. This agent, you're a pm. Your job is to come in and kind of figure out what to go build. What are users asking for every day? This product taste agent's going to tell you what your biggest pains are, what your biggest priority should be, and suggest where your product roadmap needs to go. The product I'm going to work off of today and you can pick your own product that makes sense for you, but the product I'm going to pick is actually our own open source product, Arise Phoenix. Arise Phoenix is the leading open source observability and evals platform. You can actually get started and host everything entirely open source with Phoenix, but with Phoenix. And you're going to see what I do here is that we have a ton of backlog of issues. We also have a really vibrant GitHub discussions. We have our own Slack community, We have feedback from people. People were tweeting at us. And so what I'm going to try to do is actually aggregate a lot of that. I'm going to actually try to aggregate that feedback and use that to surface up. Where should we go and where should we build next? So the steps we're going to do here is actually first create this PM agent. We're going to do this using Claude code. The magic behind everything that we're going to use to improve is really tracing. We're going to. We're gonna trace everything. We're gonna get. Literally every step of what our agent does is gonna be visible to us. And then we're actually gonna run the evals. Aakash and I think this is kind of the big, you know, when people ask, when do I do evals? I always kind of point towards get the data, trace everything, get the observability. The evals can kind of help you then take you to the next level for your agent. So we're going to trace it, we're going to eval it, and then we're going to do this loop where we improve our agent and bring it right back. So pick your favorite product that you want to actually use. Pick a product that you have all the context of. You could start super simple. What I'm going to start with today is literally just the GitHub issues the GitHub discussions and use that to actually inform what my product taste or PM agent is going to look like. Let's do this. We're going to go ahead and build a PM product taste agent just using cloud code. So go ahead, kick up cloud code in your terminal for product folks. You know this might feel intimidating in the beginning, but I can guarantee you the level of control and iteration you're going to get by just doing this in your terminal and getting comfortable is going to feel just the unlock you're going to get is going to be worth a little bit of that learning kind of pain in the beginning.
[08:06]
Akash
I'll be honest, I've not always been the best with my email inbox and just thinking about it made me feel anxiety. But my anxiety has really never been lower since I started using Superhuman Mail Today's podcast sponsor. Their Ask AI feature is one thing that really stands out for me because I have so many contract details or deliverables buried 8 replies deep and I can just ask the AI. I also love the auto Drafts feature so that I have a draft to react and respond to and of course their follow ups and are a lifesaver. Now is the time to give it a try. Check it out@superhuman.com Akash Today's episode is brought to you by Vanta. As a founder, you're moving fast toward product market fit, your next round or your first big enterprise deal. But with AI accelerating how quickly startups build and ship, security expectations are higher earlier than ever. Getting security and compliance right can unlock growth or stall it if you wait too long. With deep integrations and automated workflows built for fast moving teams, Vanta gets you audit ready fast and keeps you secure with continuous monitoring as your models infra and customers evolve. Fast growing startups like LangChain, Rytr and Cursor trusted Vanta to build a scalable foundation from the start. So go to vanta.comakash that's V-A-N-T-A.com A-K-A-H to save $1,000 and join over 10,000ambitious companies already scaling with Vanta.
[09:30]
Aparna Dinakran
So let's do this. Go ahead and create a repo or create just a directory and you can go ahead and initialize Claude inside of that directory. Let's just go ahead and first give it a starter prompt. To actually build this agent. I'm going to ask it to build me a PM agent for the Arise AI Phoenix product and I can go ahead and actually just link the URL to that entire repo directly in here so that it has exactly context of what I'm asking it to build. Then I'm just going to go ahead and ask what context do I want it to happen. Pull recent GitHub discussions, pull all the recent releases and look at the GitHub issues. I'm going to start kind of piecemeal here first versus starting with contacts from one location, which is GitHub. As we scale this, you can add in context from, like I was saying, your gong transcripts, your product analytics, you can add context from literally your Slack convos, your Discord channels. Anything can be brought into here. And what I first wanted to do is first just figure out score the issues and the discussions based off of priority. First just figure out how important is the stuff that we want it to actually look at and build. So things to look at is like bugs versus features, reactions that people gave it, comments. I do want it to look at recency. So these are all things that I'm actually asking this product taste agent to take a look at and consider. Then call Claude or you know, I can be specific here. I can say call Cloud, Opus, whatever model I want. So call Claude with. You know, I could even ask it to go ahead and do some kind of like prompt caching so that it doesn't keep pulling down the issues every time that I run this loop. But just to keep it simple in the beginning, what I'm going to do is just call Claude and write down just a markdown PM report that has, that has as the output the top pain points feature asks and themes. Order this by P0 to P3 priority. So this is basically going to be like initial starter prompt for me to actually build this product taste. I can get super. You know, typically what I like to do is be really thoughtful about the plan that I'm giving my agent so that it, it's not just going off of nothing, but there's also times where you'll just have it go off, build something and then you're iteratively giving it feedback. And that's totally also okay. And then I'll just say, here, use my GitHub token and my anthropic API key. Let's see what it can come back just with that super simple. While this is going and kind of doing its thing in the background, what I'm actually going to show you is as you can see, it's going to interrupt and ask a ton of questions as we go through this. But what I'm actually going to show you all is just a simple one I built right before this and see if we can get the one we're building right now to just match up and see how close we can get in just an hour here. Okay, so this is basically a PM agent that is already built out and already kind of, you know, we've had tracing set up and ascending to arise already. And I'm just going to open one of these so I can show you all kind of what it looks like here. But this PM agent is. These are the traces of our actual PM agent. And for those of you who are like, what's a trace? Like, that's, that's, you know, new concept to understand. You can think about a trace really just as it is the step by step playback of what this agent actually did. In this scenario, this agent is first going ahead and pulling back GitHub discussions. It's pulling back the GitHub issues, it's figuring out what are all the releases that were recently released. Then it's going through and it's actually looking at every single issue that is inside of that project and it's actually consuming all these and coming up with a score of how important each of these issues that it's raised are. As a product person, this is kind of the first thing you need to understand is like, how important are all of these asks that are coming from your users? What is the pain that it's solving? And, and so the first thing I'm just asking you to do is figure out, well, can you score? Basically, how important is each one of these asks that are coming back for this project and what I'll actually do as it scores. I want to actually have an eval that will evaluate how good was the score that my PM agent actually came up with, and is it accurate or inaccurate? Based off of the context that I have around how I want to prioritize bugs, how I've historically prioritized feature requests, I actually want to write an eval that will help teams evaluate the quality of this initial PM agent that we've built. Go back and check on our agent here and see how far we've gotten. So, still kind of thinking, so when
[15:34]
Akash
somebody's setting up this repo correctly, like basically you created a new GitHub repo, you gave it your anthropic API key and you just. And I guess to create the repo, you have to log into GitHub. Those are the main steps people have to do before this Correct, correct.
[15:53]
Aparna Dinakran
And I'm happy to go ahead and send you guys the sample repo if you want to get started doing this yourself so that you can follow along with a project of your choice. But in this case you can see. Great. Okay, so it's gone ahead. It's actually built this agent. I'm going to go ahead and it just looks like it's updating what the. Okay, great. So it's actually just updated. It's using my GitHub token, it's using my anthropic API key, and now it's actually going to go ahead. It's pulled 40 discussions, 60 issues, eight releases, and now it's going to go ahead score each item and then based off of the score that it gives every single one of these issues, it's going to go ahead and give me a report about what the most important things to actually top pain points, feature requests, themes, what shipped, and give me a game plan that I can then start use as a starting point when I come in a really useful feature that you'll do this once today, but ideally you want this kind of running all the time, consistently. Every time someone adds a new bug report, adds a new issue, it's always doing this. So what you can do is actually just say, can you run this in a loop? Can you run this in a loop? And you can specifically say using the Claude loop kind of skill. This is really awesome because what Claude does is that it spins up. Essentially a cron job. Well, what's a cron job? It's basically you asking Claude to be able to run some type of workflow that you do every day in a loop. And so in my case, every day, every hour, every. You could set this to every five minutes if you wanted to. It'll go ahead and it'll go ahead and actually run this loop every, you know, however cadence you set so that it actually does your job. Every hour you have the latest report of what you should be prioritizing for your agent. So let's go ahead. Oh, it looks like I need to go ahead and set my GitHub token. So give me one second and let me do that and then we can actually go ahead and run this agent and you can watch it live. So this is actually going ahead and running my Phoenix PM agent. I'm going to show you guys how to do this so that you can also do it. But I've also kind of already set up traces. What does that actually mean? Tracing is the way for teams to actually get visibility into everything these agents are doing. This is a really hard thing to debug because Claude is spinning off a bunch of different things and running this in a loop. And you might not always know if it comes back with slop or it comes back with something great. How do I go and improve it? How do I go and figure out how it did that? And so tracing is a really awesome way to understand what your agent's doing today. What I'm gonna actually show you is that I'd say tracing used to be really hard. You had to kinda go call your engineering partner to have to go and set up tracing. I think with AI, it's probably never gotten easier to do this. So what we have is essentially skills. We've released a kind of a series of call it skills that you can actually just give to your coding agent. This is kind of a set of ARISE skills. You just go in, install NPX skills, add, I'll show you, we'll go ahead and do this. But. But once you actually add this, you can just ask Claude Code to go ahead and instrument the entire agent that we asked it to go build. Right now you're looking here at a whole bunch of different skills. One of them is the ARISE instrumentation skill. For those of you who are curious, it's literally just in English, telling what CLAUDE code should do to actually send trace data over to Arise. It makes it super easy. I'm going to show you. It's going to feel super magical. And you're not going to need to wait for your, you know, you're not going to need to wait for your engineering partner to have to go and do all of this lift to go get data from your agent to your observability platform. So let's go do this. What we're going to do actually is from here I'm going to say, can you help me instrument this agent? So I'm going to go ahead and actually ask it to instrument this agent. So what this is actually going to do is call the ARISE kind of instrumentation agent. So you can see here, sorry, the instrumentation scheme that we just talked about. So it's going ahead, it's calling the skill. This instrumentation skill will actually first look at the code base and understand how is this agent built, what's actually calling the LLM calls, what's actually calling the tool calls. And it'll go ahead and it'll figure out kind of, you know, this case, the language that it was written in as Python. The LLM provider was anthropic. Here's the library to go use. Here's what it's actually going to go do to set up the different calls and it says, cool, everything is already wired up, sending to arise and is there anything else specific you'd like to go change? So now let's go ahead and just see run my agent, see if it sends recent traces and I should be able to go pop over to the platform, my observability platform, and go look at traces. We'll see if there's ones that are going to show up right now from my recent run. But it should go ahead and actually start streaming in traces from the last area. This is everything from the last 15 minutes that's just showing up here. You basically get a way to do all of this and it figures out everything from here's the individual alum calls, here's the actual tool calls that were made, here's the. It had to go and fetch stuff from GitHub, it had to go score every single individual LM call and then it finally had to come back with that report that I asked for, which was what are my top pain points? What are my top feature requests? What was already kind of shipped. And so you can see here, it's giving me an executive summary, my top pain points and kind of the things that it scored really, really highly for me to go and prioritize for my product. And so literally I didn't open any ide, I didn't open anything. I literally just asked Claude code to build me an agent, gave it a really good prompt, and then I asked it kind of what I was hoping for and then I asked it go instrument my agent with her eyes using the skill and boom, now I have visibility into my agent. Everything's probably not going to be perfect and I can probably already guarantee you that, that it's not going to be perfect. But what we can do is actually start using this as a way to understand, well, how would I improve this agent. What I'm going to actually show you right now is actually an in product agent that we've built called Alex. Alex is an agent that sits inside of our product and you can ask all sorts of questions like help me figure out the common types of issues that are coming up. This will actually go through. It'll look across all of the data, like the inputs and the outputs, and it'll start to surface up common types of issues that users are asking from my traces. I can use this to actually first figure out what types of evals should I actually be running on Top of my agent. And the reason why that's interesting is that you're starting your evals from a place of actually looking at your traces, looking at your errors, and trying to understand, well, did it actually score some things correctly? Did it not score some things the way that I would have prioritized things? How many times have you had someone on your team kind of say something was super important, super priority, but you wouldn't have given it that high of a ranking for yourself? The next thing that I really want to show is really for teams, is how can you use claude code to actually help you figure out a baseline eval for these agents that you're building? You can have it start, just build a baseline eval and use that to actually iteratively improve your eval so that you're not starting from complete. So scratch. So what we can do here, you can do this in our product, you can also do this, you know, kind of, you can also do this using Claude code. Again. So kind of in the theme of today, I'll actually do this using CLAUDE code and show you how you can set up evals directly from your terminal. But you're going to see here, once I have the traces centerize, I can actually ask, can you suggest a good eval for my agent? I want it to. I can just start with that. Can you suggest a good eval for my agent? Let's see what it comes back with. What this will actually do, it'll call the skill the evaluator skill that actually looks at, looks across the traces and suggests kind of, there we go. Okay, so looks across the skill and it suggests, okay, well, these are kind of three evals that you might want to do. There's report groundedness checks, whether the quotes, the issues in the final PM report are grounded in the actual data fed in. It runs kind of across everything. So it's almost, you know, I think about this almost like an eval on the final report that was created. You could do an eval on priority alignment checks, whether the p0 p1 kind of in the report matches the top scored issues from kind of what you're expecting or something around report actionability. Okay, well, I could do these, but these are all things that are kind of looking across, almost like the end product. What I actually want is something different as a pm. What I want is actually to look at every single. I want to get a little bit more granular in the beginning and start to understand for every single issue that this kind of, for every single one of these Issues here. Did it actually give it a right score? Like in this case it said that it gave it a priority of a 3. In this case, I don't know, let's pick another set of them. This case it gave it a zero. It said this integration is not that important. It gave this Privacy Question A3. And so there's all of these. It's kind of making up these priorities. I actually wanted to first just evaluate is the score that it's attaching to kind of determine how important these issues are. Is that actually something that I would have set by myself? So I actually wanted to run something like a priority score. A priority kind of eval on is the score that it's actually saying how important these GitHub issues are? Are they actually accurate based off of how I want to weight them? So let's go back to cloud code. I can actually just ask it to help me come up with a way to eval this. And this is very normal where you're kind of doing this back and forth with Claude and you're actually asking it to go back and repeat yourself and get really specific about what you want. So in this case I can ask, can you help me build an evaluation to evaluate if each issue is actually scored correctly or it's each issue's priority maybe is a good way to say this each issue's priority is actually scored correctly.
[29:31]
Akash
I think that's option two, right? Priority alignment.
[29:34]
Aparna Dinakran
Yeah, yeah. This is. Oh, well, this is. This is slightly more about at the end because it looks like it's checking at the end in the very report if the top scored issues are kind of what I would have picked. But what I'm looking for is something slightly more nuanced, which is not just the top issues, but every single kind of individual issue is actually given its appropriate kind of weight. So it's kind of giving me this priority accuracy evaluator. It'll go ahead, it'll create a way to run this evaluator on top of the actual traces. In this case, it's already picking one that I've actually already created to do this just to show you guys kind of how this works, but it'll kind of suggest, hey, there's this eval you've already created, which is kind of doing this like row level, issue, level kind of priority. And then it's actually going to use this to go and run it on top of those kind of traces. So in this case it's saying, hey, it's running it on older data. Do you want to go ahead and run it from today's issues, like the new issues that you just created grab from today. So it'll go ahead and start running it on the newer spans and you can see here every single kind of GitHub issue that has come in. It's going to go ahead and give it a score of how important it actually is and then it'll evaluate whether that score that it was given was actually an appropriate eval or not.
[31:15]
Akash
I hope you're enjoying today's episode. Are you interested in becoming an AI Product Manager? Making hundreds of thousands of dollars more joining OpenAI and Anthropic then you might want to do a course that I've taken myself. The AIPM Certificate ran by OpenAI product leader Miqdad Jaffer. If you use my code and my link, you get a special discount on this course. It is a course that I highly recommend. We have done a lot of collaborations together on things like AI product strategy, so check out our newsletter articles if you want to see the quality of the type of thinking you'll get. One of my frequent collaborators, Pavel Hearn, is the Build Labs leader. So you're going to live build an AI product with Pavel's feedback if you take this AIPM certificate. So be sure to check that out. Be sure to use my code and my link in order to get a special discount. Here's the dirty secret about prototyping. You spend two weeks building a prototype. You validate your assumptions. Engineering loves the direction. Then what happens? You throw the whole thing away. Bolt changes this completely. When you prototype in Bolt, you're not building throwaway mockup. You're building real front end code that integrates with your existing design system. So when you hand it to engineering, they don't throw it away. They they ship on top of what you've built. I use Bolt every single day. I host my land PM job cohort on it. And honestly I'm up till 2am some days just vibing in the tool, having fun and building. That's when you know a product is good. When you're using it past midnight. Not because you need to, but because you want to. Check out Bold at Bolt New link in the show Notes I used to think I had a retention problem. Turns out I had a messaging problem. I was sending the same onboarding emails to every new user whether they activated on day one or never logged in again. I had no idea who was slipping or why customer IO changed that. Every message I send is now based on what users actually do in the product. Someone hits a key activation moment, they get nudged to the next one. Someone goes quiet, they get a different path entirely. Their AI agent makes it fast. I describe the campaign I want and it builds the full journey for me. Triggers, timing, copy, even branching, logic. And when I want to know how something is performing, I just ask the agent directly and it tells me what to do next. They also have an MCP server, which means AI tools like Claude can see directly what's happening in your Customer IO workspace, your segments, your customer data, your attribution, all of it. So instead of explaining your business context every time you need help, Claude already knows it. Notion. Use Customer IO to personalize their onboarding and hit nearly 50% open rate. Improved conversion by 6 to 6, 7% with localized campaigns, and pushed open rates up another 20% through AB testing. The idea is simple. Customer IO helps you deliver more impact from every message you send. If you're a PM or founder and your onboarding is still one size fits all, try Customer IO at Customer IO. I'm keen to see what evals it creates. I guess the traditional sort of evals teaching literature is all about, like, you finding production traces that you feel like there was an error. So I guess that line of thinking would say, you'd go to the trace dashboard in Arise. You'd look at those priorities, you'd say, oh, this is a zero, but this really should have been a four. And then you'd pick up like 50 of those errors. Then you'd group them and say, like, okay, These are the 10 errors that it does. So are we trying to replicate that process, but have Claude code basically do it itself? Is that what we're doing here?
[34:39]
Aparna Dinakran
Exactly, exactly. So basically what Claude code is doing is it has access to all of the traces in Arise because the skills, basically it can go and call an API and I can kind of share what it's doing under the hood so we can talk about it, because it does feel a bit. A little bit magical when we just talk through it. So give me one second, let me share the secret sauce of what's happening here under the hood. All of these skills are actually calling APIs, specifically the APIs that skills tend to call. Is that what we've realized is that these coding agents are really good with command line or CLI interfaces. So what it's doing is basically under the hood calling and fetching all of the traces. And you've seen kind of Hamel and Shreya tell you, hey, go through line by line, look at where the individual traces failed. That is totally a great way to do this. You can of course, go in and get started and start doing annotations and start doing. Did it actually answer the question is this. You can write freeform text and just write freeform text about what was good, what was wrong about this. It's absolutely a great way to do that. I'm also someone who. I love to see if cloud code can help me cut some of that time and surface up some insights for. For me. And so what I'm actually doing here is trying to understand just with Claude code and if I can give it access to my spans and my traces, what are some insights from this that I should have to go and learn? Help me go and tell me what's wrong with my agent. And sometimes it's just being super honest. Sometimes it might not come back with something amazing as your first eval, but what I typically like about it is that it gives me a place to actually start thinking about problems and start thinking about areas of improvement. So in this case, I've gone ahead and created this, like, priority accuracy. Like priority accuracy eval. And it's running, it's now running. It's run across all of my new spans. And I can go in here and just say, show me everything where the label is actually inaccurate, where Claude code thinks that the priority. You can see the scores here, the priority that it's come up with is actually wrong. And why is it wrong? This is probably something that you're going to hear all the time from folks who do evals is, was my eval wrong or was my agent wrong? And you will definitely have scenarios. And there's a whole process that Hamel and Shreya actually talk a lot about, which is aligning your evals so that your evals are grounded in that kind of human feedback. What I'm sharing is kind of a way right now of, can you start. It's almost like, can you start with the vibe eval and then modify it and improve it so that it becomes something that you can trust and go from? And you can do either approach. You can go through the axial coding approach, surface up all the issues, have the human in the loop and identify categories of pain. But as a product person, you might already know what types of things you definitely want to catch. For me, what I want to catch is, is every single issue that this agent is prioritizing, is it right or is it wrong? Is it accurate? Is it giving it an accurate score or is it not giving it an accurate Score. And I can start off by saying, well, let me see if I can just have it go and create an eval to suggest kind of what that priority accuracy looks like. You can do it through a skill. You can also, if you do have human annotations that are built through here, the skill will look at those human annotations and use it to actually build you an eval as well. So I, in this scenario, didn't have any, but if I had one, it would go through and do the whole process that Hamel and Shreya kind of walked through of like aligning the evals. So it's gone ahead. It's run kind of the priority accuracy evaluation. It's comparing the accuracy of. It's something that's looking at the score that was assigned to each of the issues, and it's surfacing up. Is this an accurate score or is this not an accurate score? Again, this is just based off of a simple first pass of this eval. I am going to refine this eval now because this eval is completely based off of just Claude looking at my traces and trying to identify problems. And the whole point of this is like, how do we get this loop kicked off? This loop is meant to kind of give you a starting spot. It is not meant to be your end all, be all kind of state for your evals or your agent. Your evals will adopt, will kind of get better, and your agent will get better. And that's kind of what we're showing in this workflow today is kind of how do you get started, how do you get unblocked, and then how do you do that improvement loop so we can make this better. So in this case, I kind of have a very simple small eval here, which is, okay, looking at the accuracy of the score. These are ones that clog. Things are not accurate. I can actually just directly ask here when my priority accuracy is inaccurate, what are common issues or reasons for that? So this will actually kick off. And now look at, well, what types of things is my PM agent not prioritizing correctly? So I have my agent kind of kicking off looking at the data. And what we're trying to do here is really go from you built an agent, you have traces set up automatically through Claude. You have Claude kind of suggesting what an eval could actually look like. And now these are already scenarios that Claude thinks are not right, accurately scored. This is a great starting ground for me to say, okay, well, what can I go to understand how to go improve this agent? And you Barely had to write. You didn't have to write anything. You kind of had to ask Claude a couple. Couple things. So let's go ahead. And this is Alex kind of giving me suggestions of kind of what to go do here. So in this case, there's whole categories of issues that it's looking at. So there's some. Where there is a feature request scoring, there's a legacy scoring system, there's bugs priority scoring, there's low priority scoring, there's data fetch. Okay. So there's a lot of different categories of where it's actually suggesting that my scoring might be off. And it's giving me a whole bunch of spans to go look at, to go debug and understand kind of what are some actual problems that this PM or taste agent might have in prioritizing issues that are coming.
[42:27]
Akash
And what's a span exactly? It's a group of traces.
[42:30]
Aparna Dinakran
A span is really an individual step in a trace. So in this case, what you're looking at here is this entire interaction where it did this whole report is what you'd call a trace. A span is a single individual step or a single individual issue that I had to go look at.
[42:49]
Akash
Got it.
[42:50]
Aparna Dinakran
Yeah.
[42:51]
Akash
And it's weird, isn't it a bit weird that Claude rated, like, everything it did inaccurate.
[42:57]
Aparna Dinakran
I. Some of them are accurate. And some of them are accurate.
[43:00]
Akash
Okay. Some of them are accurate.
[43:02]
Aparna Dinakran
Okay. If it did, then that would probably be a good spot for you to understand. Okay, well, maybe I shouldn't trust that eval from Claude.
[43:11]
Akash
I feel like a good eval is like, you're getting some healthy percentage. Right. But also healthy wrong, so that you can make progress. Right.
[43:18]
Aparna Dinakran
100%. And so you want that feedback of, like, I get excited when I see that evals are wrong because then it gives me a chance to know that there's improvement that could be made. But when everything's wrong, then, you know, it's. Obviously, that's definitely a scenario where you need to start looking at your eval to understand what. What to go improve.
[43:39]
Akash
And when. When can we do the vibe evals? When do we have to do the Axio coding? Or can you always start from vibe valves and then layer an axial coding, talking to the agent later?
[43:52]
Aparna Dinakran
So my take is that vibes are gonna fall short very, very quickly. It's. And the reason for it is that it just doesn't have any. It's not grounded on any actual human that is involved in curating that taste, again, of your agent. And so what you really want Is something that helps you. I think that it would be hard to say, hey, you have to go and immediately start by having a bunch of vibe evals and using that to evaluate your agent. The signal to noise ratio there is going to be really, really low. And so having something where you have maybe a simple thing that gets kicked off. But then now what I'm going to actually go do here is that process where I have a simple eval and I'm now going to make sure, okay, well, is this eval that I've created actually something that I can trust? And it's not going to be. It was a one shot eval that's out of the box. I'm going to actually go through and figure out, well, where do I disagree with it? Where do I not disagree with it? How do I actually. And you would do this process even if you did axial coding, even if you did axial coding and you did individually human annotated every single span and every single issue and you were able to put together this amazing ground truth data set. Your eval will get misaligned over time as you see more and more data. And so it is super important that you regularly align those evals to the data that you're actually seeing on the ground with your users. What I'm going to do right now is actually walk through a process where I've created a very simple eval out of the box. Claude, just one shot at it for me. Now I'm going to start asking, okay, well, is this an issue with an eval? Is this an issue with my agent? You have examples in this scenario. It looks like. Bug category items using the new scoring system with category 4 are also commonly inaccurate. It feels like there's scenarios where bugs maybe are not getting categorized or given the accurate score that I want it to. In my world, I want bugs to always be super high. Because if it's a bug and a customer hits a bug, that's just a really bad experience with the product. I would prioritize bugs over even new feature work. This gives me a way to say, okay, well let me go look at some examples of where the bugs are being prioritized really low and just gives me a category of problems to start looking at and start debugging and understanding how good this agent is. And what you can do is for some teams, these evals end up as they get really good and as they get really better. You can immediately ask Claude to going back to using Claude code with evals, say, hey, go grab everything where this Eval failed and suggest an improvement and go improve that eval. For me, I think it's unfair to say people aren't using clients Claude to create evals. And I think that's maybe one of the pain points that I see with always saying start a coding is that in reality you will always do it. But I think it's okay to start with Claude suggesting what a good suggestion of an eval could be. And these models have gotten so good, like having it go through and look at your answers and suggest, hey, that probably is something you should flag and look at. I would trust it. I would trust it as a first pass. Like go tell me what my evals should be.
[47:51]
Akash
Yeah, that's my favorite workflow. Always start Claude generating it, but then you just give it like ruthless criticism and I just turn on dictation mode and I'm like, well, you misjudged this for this reason. You must judge this for this reason. And that's where the taste alpha that you bring can actually come back in.
[48:07]
Aparna Dinakran
Totally. And I think what for me is like, how do I quickly get into that lose loop is get data in, get an eval set up, give it criticism and let it go run on a loop. So I showed earlier there's kind of the Claude loop kind of skill that Claude has. And so what you can actually do here is now that you have this eval, you can create a whole other skill that's just like every day, go through, fetch everything that was inaccurate and go that was inaccurately prioritized and go fix and improve my agent and you can go and create a skill that actually will then go suggest improvements to your agent from the evals that you just ran on top of this.
[48:55]
Akash
So you actually loop the improvement too, not just the agent.
[48:59]
Aparna Dinakran
Because then you get to a world of self improvement and that's where to be honest, I think we're all headed. Is that the data that we all collect, the evals and observability is the foundation for self improving agents. And so you get your observability in, you build an initial eval. It's a first pass. You're going to make it better. You're going to have to give it ruthless criticism to kind of make the agent better or make the eval better. And I think what teams are doing right now is they're kind of doing that iteratively. You can just create a loop that essentially starts to look at the evals identify. You know, I just asked right there, give me the common reasons why the Priority accuracy is inaccurate. Oh, it's because the way I prioritize bugs is. Doesn't look right. And so what I can do, go back to my PM agent and just say, hey, go fix this issue and then go fix the issue. Ship a new agent now go collect traces from the next rev of that agent. That improvement loop can actually run inside of Claude code as a loop skill.
[50:08]
Akash
So that is all fine and dandy for your internal agents that are assisting you in your work. How does this all change for the AI agents in your product you just showed us, Alex? So maybe you can go under the covers of how that worked when it's actually a product. You're not going to be shipping self improvement to Alex every day because you don't know, it could just go off in some weird direction. So where do the human in the loop parts come in there?
[50:34]
Aparna Dinakran
Totally. I mean there's still code review, there's still a human that looks at every PR that is actually being put up by this self improvement loop. But you know, maybe what I can ask you back is, but isn't that the vision, isn't that the future that we all want to go to is that I should be able to see someone file a bug and on, you know, Alex didn't give me. Somebody gave a response that Alex gave a thumbs down. Alex is able to immediately. And this is kind of what we're doing internally already is you're going to hear a lot more about us talking about about it in the next couple of weeks. But Alex has already taken that feedback, spinning up a whole debug workflow and using the eval, using the trace to debug what went wrong. And then in some scenarios, like we talked about, it's the eval that's wrong. And in that case it's a refinement on the evaluation and in some cases. But that's great, right? That's basically a little bit of what you hear all about the axial coding of figure out what are the reasons why that eval wasn't good and then use it to go improve that eval. In some cases the eval was right and it really was the agent that needed to be handle a specific scenario better. In that case, what we can do is just very simply go in and do an improvement, say, hey, go fix this and actually go in and improve the agent. And so what I can do is
[52:17]
Akash
we want ideally like it's happening like in real time across millions of users automatically. So I guess how do you do that safely? So code review is one step. What else do you need to like where do you need to put the human in the loop?
[52:30]
Aparna Dinakran
So I think there's a couple maybe places where that needs to happen. One is as the eval changes, that's also a really important step to actually having the human kind of curate that taste of what is good and what is not good. So the human is typically involved in eval changes, they're involved in the agent changes. There's a lot that's happening right now around making sure that the skill that's actually being used to do the improvement workflows, that is one that is typically designed by a human. So what does that improvement skill need to look like? What is all of the context that it needs to have access to in order to be able to know what the improvement is? In this scenario it might not have all the context because all I gave it was just GitHub issues. But if I could then layer in my product analytic metrics, I could layer in my traces, my actual entire traces, it could actually end up using that information to build its own context of what went wrong. How do I need to go fix it and leverage that information as well? Basically context for the improvement loop.
[53:54]
Akash
Got it. So there's human in the loop at any agent change, any eval change, but outside of that you can actually use loop commands within cloud code or whatever if you're in more production database, a real cron job and every day or whatever cadence. And so what you get to work with like all of the best companies, uber doordash, you name it. What are the, what is the state of the art looking like for this self improvement? How fast are people moving and how fast do they need to be moving to be competitive?
[54:23]
Aparna Dinakran
I mean, I think it's going to come very, very quickly if I'm honest with you. I think the best teams are already doing this in their, you know, call it like a radius that they're comfortable with today. But that radius is going to get bigger and bigger is there, you know, maybe the initial improvement is around improvements to the agent that are kind of more simpler, more around the prompts, the tools. Does that radius then become about giving entire workflows that the agent didn't have access to do. So the radius of those changes I think is going to become, is going to become increasingly bigger, which we're excited about. But it's just that self improvement loop is not going to happen without having really good data, really good data and really good evals. If you think about and just to try to maybe take an analogy for something that's so different. But if you think about like some of the best sports players, what do they do? Like I'm talking about like the Nadal's the betterers. If you're a tennis fan, like your Novaks, what they're doing is actually looking at their plays. They're looking at their previous games. They're looking and studying their behavior of what they did and using that as a way to understand what went well and what didn't go well in their games to go make improvements. This is kind of studying your plays is kind of what agents, you know, self improving agents or self improving harnesses have to do is they kind of have to study their own play plays to understand what did the human say was a good response or what did the human not say was a good response and use that to actually figure out how to improve their own gameplay in some way. And that's what we're actually, that's why the evals and the observability are kind of the foundational layer in order for teams to actually build that self improving loop.
[56:33]
Akash
So I personally have encountered PMs that I feel like are in one of three buckets. And I think you have customers in all three of those buckets. So there's the AI native customers, you have like Handshake and the AI companies. Then there's like the digital first companies customers you have like Uber and Reddit and Roblox and then there's like the normal companies who have tech arms, Pepsi, Conde Nast, normal type of companies. So you get to, you work with all three of those groups. And so what I want to understand is usually the AI native groups, they're going to be doing the quote unquote best way or like the right way of how to do things. So what are the AI native groups doing? And specifically not just with like how they're building their evals, but the role of the pm. What is the role of a PM in an AI native company versus a company who hasn't gotten there yet and how does that company bring their pm?
[57:27]
Aparna Dinakran
Yeah, I think the role of a PM is like completely changed in the last year. The role of the PM is almost like the. You're the tastemaker for this product and in order to become a really good tastemaker you really have to understand the outcomes of the agents, especially the AIPMs, where the product is the agent, the product is the agent's that's being built. You have to spend a lot more time the AI native PMs, they are almost indistinguishable from engineers in some ways because they're comfortable living in cloud code. Like this entire workflow that I just showed, where they're able to build even just a simple internal agent to help them do their daily tasks where they can. And you're not doing things. We kind of say this internally and I think it's true. It's like if you're doing things the same way you were doing things last year, then you haven't caught up yet. And I think that. I deeply do think that if you're kind of looking at your old board of like, here's my priorities, and you're kind of manually scanning them and manually kind of understanding every single kind of doing what you used to do, it's just different because now with the advent of cloud code, I can actually have it. You're not limited by how many individual meetings and gong calls that you can personally kind of hear. You can have cloud code go through. And it has access to all of these customer calls that you might have never been able to consult all by yourself. But can it help surface up the one or two that are like super critical? You need to put your eyes on that because that's going to help you unlock your next 10, 15 customers. And so I think in these AI native companies, what we're seeing is that the PMs were able to leverage cloud code to do everything from understand, use data and user feedback better surfacing that back into what does a really good product experience look like, get really close from idea to solution. So it's not like, hey, I'm handing it over to an engineer. It's like they're able to effectively almost put together a plan for what that build needs to look like. Those are the PMs that I think are really going to be 10x or whatever multiplier PMs in any, in any team.
[60:06]
Akash
So we're talking about working with those AI native companies. You are yourself one of the AI native companies, and you refer to this that you yourself are hiring more AI PMS than ever. So what does the new profile look like if I want to land an AIPM role at an AI native company that has raised $131 million? What are the skills I should be developing? What is the depth of technical knowledge and topics I need to cover?
[60:35]
Aparna Dinakran
1. And I have always believed this is like just the curiosity is the number that's like, for me, the number one most important signal. Like this person is trying all the New tools, they're kind of exploring the boundaries of what they can and can't do. Because that's something that, you know, there's kind of the old way of doing things is that there used to be trainings and you'd go to these trainings and someone would walk you through how to use a tool. What if the tool is Claude code and it's had shipped, you know, 90 features in like 30 days? Like, there is no old way of doing things where you can have like a daily training for a product that's moving that fast. And so it's kind of the onus of keeping up has become on the individual now to actually keep up with the tools, keep up with what's changing. And if something, not everything is going to be useful to what you do, but if something can give you an ability to, hey, that used to take me an hour and now it can take me 10 minutes, like that is an advantage. And being able to identify those and use them to your advantage is deeply, deeply. I think it's built off of curiosity at this stage. I think too. The other big one is it's still really important to care and understand the user. And customer empathy is something that I don't think AE like the best PMs and the best product tastemakers, you know, understand. You could ask them, how is that customer using the product, what's their biggest pain points, what do they, you know, and they would be able to rattle them off to you. And I think what's now changed is that you can actually get even deeper. You can, you know, customer asks for something, it could have taken a week to go build that, two weeks to build that. In the past, that could be delivered that day if you're able to ship at that velocity. And so being able to get even closer and deliver to customers even faster is no longer just like a. It's no longer a pipe dream. It's actually how the best products at AI natives are shipping right now.
[63:04]
Akash
So 99% of people aren't in an AI native company. So they don't believe us. So I need to just confirm this is true. What you're saying is that sometimes an issue will come in, your PMs will identify, it's important enough. Either they will prototype or an engineer will prototype and make ready for production a feature and you guys will ship it in the same day? Yes, that is actually what's happening, guys. She said it herself. What is the role then of the PM? Do PMs need to become engineers at
[63:38]
Aparna Dinakran
this Point I think that at the AI native teams I am seeing that the gap between a PM and an engineer is indistinct, distinguishable because when code has become so much easier to actually produce, then actually, you know, this goes back to where we started today's podcast with which is the alpha is the alpha today is product taste. So the people that understand product taste, understand what customers want, understand how to deliver a really amazing experience, are just going to have an insane, insane velocity. So PMs who can kind of go from here's the pain point, here's what I would, I think is a really amazing experience. And they are a triple threat where they're like, I could probably go build that today and figure out what that, you know, talk to cloud code and figure out what to go build like that is, you know, it's, it's a triple threat. In this, in this environment right now,
[64:49]
Akash
what are you seeing at the enterprise level? Because they're not even close to there. So if you're at a big enterprise, if you're at a Pepsi or something like that, you're still trying to take on the best practices. What realistically, what can they take on and how do they take them on?
[65:06]
Aparna Dinakran
Yeah, I mean, I think what I'm seeing in enterprises is like there's still innovating at. I don't want to say that there's no innovation happening there at all. Right now all these teams are all using the coding agents and I think feeling the unlock of those tools in their own day to day workflows. And so I think what I'm seeing coming out of the teams right now even there is like one, amazing products that use AI to make the experience of that product useful. Two, I think there's usually a massive, especially larger companies, you have silos of data and people who might have access to, to some information other teams don't have access. And there's actually a really great piece that Jaya Gupta, somebody you should follow on Twitter, kind of shared a couple weeks ago now that's gone super viral around context graphs. And what a context graph is, is essentially can you give your agent access to. The agents are only as good as how much context that they actually have. And then of course the harness that's built on top of that has access to that context. And so instead of all that information and data being in completely different silos and people operating in these silos, can you give one unlock for agents is that, can you give it access to context from different environments? And what that does is it Actually makes people kind of, kind of bridge the gaps across, across different teams in ways that probably weren't possible before. And so figuring out how agents consume the context within an organization is going to be probably one of the biggest problems. I mean it's probably one of the biggest unlocks challenges and unlocks that we're going to see this year.
[67:17]
Akash
So if you're a product leader at one of the enterprise companies, you're seeing what you just demoed for us, you're saying, okay, how can I bring my company towards that? What's sort of the step by step roadmap I should be implementing over the next say 12, 24 months?
[67:32]
Aparna Dinakran
Well first I think as an individual IC, I think build like building and what I just shared right now of like you'll read a lot of just stuff on AI Twitter of everyone kind of, you know, everyone kind of sharing every latest new model and every latest new tool out there. I think what I would just highly recommend for any IPM is start by building. Start by building very simple. Like this example that we just did today. It doesn't even need to be an external facing agent that you need to publish. Can it just be an internal, internal kind of tool that you use to actually help you unlock, make one big unlock today? That's huge. That's huge. Because think about, you know, if this tool that we just vibe coded in an hour now I'm gonna go use it to figure out, okay, well what are my top pains? Like what are this like you can imagine the next step after that is well, can I get an agent to actually go and put up a draft PR for one of these? Can I get an agent to actually then review that PR and do the code review on that and then you know the process to go from identifying a pain point to solve and then releasing that could have taken months in the past can now that entire thing can be shortened to a span of like we were saying a day. And if you just started with like if that could be your day and what does everything need to look like in order to deliver on that? I think it changes the game for individual ICs. So first I'd say start by building. It's the most biggest unlock 2 as you're building it's kind of important to figure out what are the systems that you need in place in order for you to. It's easy to kind of build something and then say oh, it doesn't work. Like I'm just gonna. How many times that's happened to you? Where you're like, it's not working. Like I'll just kind of scratch the idea and kind of let it sit. I think the most curious and the most kind of curious of the PMs are typically this is where having a data layer like Arise and the observability platforms are really helpful is that you might not know why your agent gave you a bad response or why the outcome wasn't that great, what it was doing. And so getting observability to understand, kind of like we were talking about with just a simple example of the tennis players, how do they look at their plays and figure out what went wrong and how do they that 1% better every single day if you could n1 your output every single day. I think that the story is no longer about observability. Oh, looking at your data. The story is about self improvement and improvement of yourself as a pm, but also improvement for the products that you are building.
[70:44]
Akash
So we used Arise as open source sort of Phoenix platform and then we used Arise, the paid platform to do this. Those are two options. How does somebody make a decision, what does the overall ecosystem look like and why would they choose Arise?
[71:00]
Aparna Dinakran
Yeah, great question. So Arise Phoenix, which was kind of the open source one that we pulled all the GitHub issues from today, is an amazing option if you cannot send your data to an external platform. And for most enterprises, most teams building any agents that have any PII data, it's just a reality is that they want to self host some initial observability so that they can get a feel and get started and get an unlock. And so Arise Phoenix is. I think even Hamel's tweeted this before. His most favorite open source tool for observability is Arise feature Phoenix. It's got super permissive license, it's got almost everything that you just saw in the demo today out of the box for you. And all the skills that I shared using cloud code, all those skills exist for Phoenix too. So you can just go open up, build an agent and say hey, help me instrument it, help me figure out insights from for my traces, help me go write evals. Phoenix will actually go and do all of that for you today. Typically where teams start to feel the paid kind of platform, the enterprise platform kind of makes sense is when obviously data volume starts to scale. We have teams that send us just the volume of. I think it's a good thing is that these agents are starting to find, find product market fit in this environment. Right now it's at LLMs. The Models are getting better, products are starting to find product market fit. And so we're starting to see almost like terabytes of data. And so it is the volume and the scale is a big reason why for teams that need that as our agents start to get material, it makes a ton of sense for you to kind of have a more scaled out platform for observability. This is where arise AX is kind of uniquely fit to solve that problem. We do this really well because we've actually had to invest in our own data store that we've been building for a while now, adb, and it's a data store that's designed for AI workloads from data.
[73:25]
Akash
So let's say I'm figured out I need to pay for it, I have the huge amount of data, how do I decide who to work with?
[73:32]
Aparna Dinakran
The reason to pick a rise is really we're the open and independent, most independent platform out there. We are independent of framework. We don't actually care what framework you use. We have teams using, you know, everything from LangChain, the cloud, agents, SDK to teams that are building without a framework. And so we're agnostic of whatever framework you use. The second thing is we deeply believe in the independence of your data. All of our trace data that we collect lives in open formats. You can actually using our ADB data fabric, that data can be directly sent back to your data warehouse. And the reason that's really powerful is because you don't want your agent trace data, which is so valuable to be locked inside of a proprietary platform. We make it accessible so that you can actually use the agent trace data as part of your context graph. We're also independent of instrumentation. If you don't know. We're actually the inventors of open inference. Our competitors, every single one of them that you mentioned, all use our instrumentation and they've actually linked to it in their docs. And so we actually own kind of, we built probably the richest telemetry and it kind of shows in the fact that our instrumentation is widely adopted in the ecosystem. And then the last one I think is just, I think we've been consistently one of the most innovative in the market. We were actually the first to shipping LLM as a judge. If you go back to 2023, you'll look at Phoenix and the repo and you'll see kind of LLM as a judge. We were the first to release open inference instrumentation, Alex, that you saw in kind of the product. We were the first to actually have an agent built into our product. The skills that you're actually looking at that we kind of showed how you use all of those skills, we were actually the first to have and release them. Hamel actually did a talk with Mikio on this about it, our open source lead. And then I was mentioning we have kind of the first and only way right now in market to actually take all of these agent traces and have them as standard formats as part of your context graft. And I think it just shows we're probably the fastest innovator in the space right now.
[76:04]
Akash
What are the things if somebody has just two hours this weekend that they should concretely go do and take away besides just they've watched this episode, but now they're going to actually make impact in their career.
[76:18]
Aparna Dinakran
If you have any two hours this weekend, I would say literally what we just did right now, which is build, build an agent for yourself, whatever would take away a couple hours of your week every week. Like just something repetitive that you do every single week. And by the way, this isn't just for like, if you're someone who's in product marketing and you're writing release notes every week, like, what is just a workflow that you do every single week that takes a couple hours of your week, Try to build an agent to go do that. And I think what you'll learn out of that is one, how insanely easy it is with cloud code. And then you'll also, on the other hand, realize how much work it takes to actually make it really good. And so to make it get better past that initial vibe code, the evals and the observability are so important. And so I said this in the beginning, but any product person that has used observability and is looking at their traces and looking at your evals, you're probably already in the top 1% of PMs in the world right now.
[77:29]
Akash
What are the biggest mistakes PMs are making when they do evals?
[77:33]
Aparna Dinakran
I think the biggest ones is first not starting with actual trace data. I think if you're just starting with kind of what you think are problems, that's really hard. Like even the skills, for example, that we use today that Claude was using to build the evals, what's powerful about it is that it's actually we're trying to instill best practices. It's actually looking at all of the trace data to help and suggest what the right evals could be. So I think PMs need to look at the evals. Don't just come out of magic. They come out of your they come out of traces.
[78:14]
Akash
All right everybody, I'm going to put up Arise's pricing page for you. This is how much Arise costs. Now here's the cool thing. If you want to get ax Pro for 12 months for free for your team because you're convinced you want to create self improving agents, you can do that with Aakash's bundle or you can just use the free options that she's talked about right now, Phoenix and axfree to get started. It's that simple. I highly recommend every AIPM master the AI Eval skill Arises, one of the easiest ways to do it. Aparna, thank you so much for lending your expertise.
[78:52]
Aparna Dinakran
Awesome. Thank you so much Akash and it was awesome to be here.
[78:57]
Akash
I hope you enjoyed that episode. Couple things you can do to support the show one Comment to review those ratings and reviews really help other people understand the value and the production that we are putting into this. Right? This wasn't an easy episode to produce. We put in a ton of pre work. We edited it for you. We brought in the best guests. If you don't mind sharing a rating and review, sharing the episode with others, making sure you are subscribed, that really helps the show do bigger and better productions. I'll see you in the next episode. Here's one of those that YouTube thinks would be a great fit for you.