Transcript
A (0:00)
Welcome to Lenny's Reads, where I bring you audio versions of my newsletter about building product, driving growth and accelerating your career. In part two of our in depth series on Building AI product sense, Dr. Merilee Nika, who is a longtime AI PM at Google and Meta and an OG AI educator, shares a simple weekly ritual that you can implement today that will rapidly build your AI product sense. Everything from here was written by Merilee and narrated by me. Let's get into it. Meta recently added a new PM interview, the first major change to its PM loop in over five years. It's called Product Sense with AI and candidates are asked to work through a product problem with the help of AI in real time. In this interview, candidates aren't judged on clever prompts, model trivia, or even flashy demos. They are evaluated on how they work with uncertainty, how they notice when the model is guessing, ask the right follow up questions, and make clear product decisions despite imperfect information. That shift reflects something bigger. AI product sense is becoming the new core skill of product management. It's the understanding of what a model can do and where it fails, and working within those constraints to build a product that people love. Over the past year, I've watched the same pattern repeat across different teams at work and in my training. The AI works beautifully in a controlled flow, and then it breaks in production because of a handful of predictable failure modes. The uncomfortable truth is that the hardest part of AI product development comes when real users arrive with messy inputs, unclear intent, and zero patience. For example, a customer support agent can feel incredible in a demo, and then after launch, quietly lose user trust by confidently answering ambiguous or under specified questions instead of stopping to ask for clarification. Through my work shipping speech and identity features for conversational platforms and personalized experiences like on device assistance and and diverse hardware portfolios. For 10 years, I started using a simple, repeatable workflow to uncover issues that would otherwise show up weeks later, building this AI product sense for myself first, and then with teams and students. It's not a theory or a framework, but rather important practice that gives you early feedback on model behavior, failure modes and trade offs, forcing you to see if an AI product can survive contact with reality before your users teach you the hard way. When I run this process, two things happen quickly. I stop being surprised by model behavior because I've already experienced the weird cases myself and I get clarity on what's a product problem versus what's a model limitation. In this episode, I'll walk through my three steps for building AI product sense. 1 map the failure modes and the intended behavior 2 define the minimum viable quality or MVQ and 3 design guardrails where behavior breaks. Once that AI product sense muscle develops, you should be able to evaluate a product across a few concrete how the model behaves under ambiguity, how users experience failures, where trust is earned or lost, and how costs change at scale. It's about understanding and predicting how the system will respond to different circumstances. In other words, the work expands from Is this a good product idea? To how will this product behave in the real world? Let's start building AI product Sense. First, map the failure modes and the intended behavior Every AI feature has a failure signature, the pattern of breakdowns it reliably falls into when the world gets messy. And the fastest way to build AI product sense is to deliberately push the model into those failure modes before your users ever do. I run the following rituals once a week, usually Wednesday mornings before my first meeting on whatever AI workflow I'm currently building. Together, they run under 15 minutes and are worth every second. The results consistently surface issues for me that would otherwise show up much later in production. Ritual 1 is asking a model to do something obviously wrong. This usually takes two minutes. The goal is to understand the model's tendency to force structure onto chaos. Take the kind of chaotic, half formed, emotionally inconsistent data every PM deals with daily think slack threads, meeting notes, Jira comments and and ask the model to extract strategic decisions from it. That's because this is where generative models reveal their most dangerous pattern. When confronted with mess, they confidently invent structure. Here's an example. A messy slack thread stripe failing for EU users again, Alice says. Ben replies, no idea. Might be webhook. Then Sarah chips in. Can we not rename the onboarding modal again? Kyle responds, still haven't figured out what to do with dark mode. Alice re enters the chat. We need onboarding out by Thursday. Ben replies, wait, is the banner still broken on mobile? Then Sarah says, I can fix the copy later. I asked the model to extract strategic product decisions from this thread and it confidently hallucinated a roadmap, assigned the wrong owners, and turned offhand comments into commitments. This is the kind of failure signature every AIPM must design around. A visual summary of the chat is included in the written version of this episode. It looks authoritative, clean, structured, and it's completely wrong. Now that you have the obviously wrong results, you'll need to generate the ideal response and compare the two responses to understand what signals the model needs to behave correctly. Here's exactly what to do. First, rerun the same Slack thread through the model. Use the same messy context that caused the hallucination. One example you can paste into the Slack thread is based on this Slack discussion. Draft our Q4 roadmap first, let's say the model invents features you never discussed. Great, you found a failure mode. Second, now tell the model what good looks like and run it again. Add one short line explaining the expected behavior. For example, Try again, but only include items explicitly mentioned in the thread. If something is missing, say not enough information, run that prompt against the exact same Slack thread. A correct trustworthy behavior would be an answer like this one. This conversation doesn't include any confirmed decisions. The participants surface problems, questions, and frustrations, but nothing in the thread reflects a finalized choice or assigned ownership. Here are the five themes discussed without adding or inventing details. 1. Possible Stripe failures for EU users with uncertainty about whether the webhook is involved. 2. Onboarding instability, including repeated renames and a push to ship by Thursday. 3. Lack of direction for dark mode raised again without resolution. 44 A question about whether the mobile banner is still broken 5 A PM note about needing data on onboarding drop offs. There are also uncertainties. No one confirms the cause of the stripe issue. There's no agreement on onboarding scope or timing, only a mention of urgency. Dark mode has no decision timeline or next step in the thread. The mobile banner bug is raised as a question, not a known issue. Before I refine this further for the VP of Product, do you want me to identify open questions, blockers, or specific action items from these themes? This answer acknowledges the lack of clear decisions, asks clarifying questions, and surfaces useful structure without inventing facts. It avoids assigning owners unless explicitly stated, and highlights uncertainties instead of hiding them. Next, compare the two outputs and the inputs that led to them side by side. This contrast of the two outputs above confident hallucination versus humble clarity is what teaches you how the model behaves today and what you need to design toward. And that contrast is where AI product sense sharpens fastest. You're looking for what changed? What guardrail fixed the hallucination? What does the model need to behave reliably? Is it explicit constraints? Better context? Tighter scoping? Does the good version feel shippable or still brittle? What would the user experience in each version? Finally, capture the gaps. This becomes a product requirement. When you see a failure mode repeat, it usually points to a specific kind of product gap and specific kind of fix. A visual of common failure modes is included in the written version of this episode. Now you know where the product fails and its intended behavior. Later in this guide, I'll show concrete examples of what prompt and design guardrails and retrieval look like in practice and how to decide when to add them. Moving on to Ritual two, which which is asking a model to do something ambiguous. This takes roughly three minutes. The goal is to understand the model's semantic fragility. Ambiguity is kryptonite for probabilistic systems, because if a model doesn't fully understand the user's intent, it fills the gaps with its best guess, that is Hallucinations or bad ideas. That's when user trust starts to crack. Try, for example, to input a PRD into notebooklm and ask it to summarize this PRD for the VP of product. Here's how to try this in 2 minutes, open NotebookLM, then click Create a New Notebook. From there, upload a prd. Next ask Summarize this for execs and list the top five risks and open questions. Does it over summarize? Latch onto one irrelevant detail? Ignore caveats? Assume the wrong audience the model's failures reveal where its semantic fragility is in what ways. The model technically understands your words but completely misses your intent. Other examples could be if you ask for a summary for leaders and it gives you a bullet list of emojis and jokes from the thread. Or you ask for UX problems and it confidently proposes a new pricing model. What you're learning here is where the model gets confused, which is exactly where your product should step in and do the work to reduce ambiguity. That could mean asking the user to choose a goal like summarize for whom, giving the model more context or constraining the action so the model can't go off track. You're not trying to trick the model. You're trying to understand where communication breaks so you can prevent misunderstanding through design. Here are a few ambiguous prompts to try along with the different interpretations you should explicitly test. Try Rewrite this for your target users. Then test for ambiguity. Run it multiple ways. New vs power users vs enterprise vs accessibility what you observe confidently writes for the wrong user intended behavior. User selects the target Persona or the model asks for the audience. Here's the next prompt to try. Improve this onboarding flow Ambiguity to test, Run it multiple ways. Shorter versus Safer versus More Guided versus More magical. What you observe proposes random changes with no goal intended behavior. User defines the success metric or the model proposes options. You can also try this one. Sort these requests by importance ambiguity. To test, run it multiple ways. Revenue vs Retention vs Usability vs Delight what you observe optimizes for an unstated metric. Intended behavior model requests a prioritization lens. Pick metric, audience or time horizon before sorting. Or try group these support tickets into themes. Ambiguity to test, run it multiple ways by intent vs severity vs product area vs frequency what you observe clusters inconsistently or mixes levels. Intended behavior model applies a consistent taxonomy or the user provides a schema. Another prompt to try is this Draft a roadmap for this feature. Ambiguity to test run it multiple ways. Vision versus Phasing versus Resourcing versus Dependencies what you observe hallucinates ownership and timelines. Intended behavior model flags missing data and asks for specific clarifiers. Now you have another batch of design work for the AI product to help guide it toward predictable and trustworthy results. Okay, Moving on to ritual 3. Ask a model to do something unexpectedly difficult. This takes roughly three minutes. The goal is to understand the model's first point of failure. Pick one task that feels simple to a human PM but stresses a model's reasoning, context, or judgment. You're not trying to exhaustively test the model you're trying to see where it breaks first so you know where the product needs organizing structure. Where it starts to go wrong is exactly where you need to design guardrails, narrow inputs, or split the task into smaller steps. Note that this isn't the final solution yet it's the intended behavior in the guardrail section. Later, I'll show how to turn this into an explicit rule in the product prompt. UX fallback behavior let's look at an example. Group these 40 bugs into themes and propose a roadmap. The following framework examines model behavior across four dimensions. What you're asking the model to do, what you observe, what that tells you, and the intended behavior. In scenario one, you're asking the model to cluster 40 bugs and propose a roadmap. What you observe drops earlier bugs as context grows. What that tells you Long context limits. Intended behavior model processes inputs, indigestible batches, or prompts for scoping. In scenario two, you're asking the model to cluster and create consistent themes. You observe that clusters inconsistently. That tells you the constraints are unclear. The intended behavior is that the model maps outputs to a fixed taxonomy or requests a category list. In scenario three, you're asking the model to propose prioritization. What you observe it invents priorities without evidence. What that tells you Importance is underspecified. Intended behavior model surfaces a prioritization lens or asks for specific weighting. Let's look at another example Summarize this PRD and flag risks for leadership in scenario one, you're asking the model to summarize for leadership what you observe sounds confident but misses real risks. What that tells you Risk detection isn't automatic Intended behavior model evaluates content against specific risk categories Privacy Cost Trust ops In scenario two, what you're asking the model to do Identify key risks what you observe over indexes on obvious sections what that tells you Model follows surface level cues Intended behavior model explicitly searches for unknowns and edge cases. In scenario three, you're asking the model to communicate uncertainty. What you observe smooths ambiguity instead of surfacing it. What that tells you Uncertainty needs to be designed Intended behavior model highlights confidence levels and requests missing context with the results from all three rituals. You now have a complete list of product design work that needs to happen to get the results you and users can use and trust. Over time, this kind of work also starts to surface second order effects moments where a small AI feature quietly reshapes workflows, defaults, or expectations. System level insights come later once the foundations are solid. The first goal is to understand behavior. Even when you understand a model's failure modes and have designed around them, it's nearly impossible to entirely predict how AI features will behave once they hit the real world. But performance almost always drops once they're out of the controlled development environment, since you don't know how it will drop or by how much. One of the best ways to keep the bar high from the start is to define a minimum viable quality or MVQ and check it against your product throughout development. A Strong MVQ explicitly defines three 1 An acceptable bar where it's good enough for real users, and 2 a delight bar where the feature feels magical 3 a do not ship bar the unacceptable failure rates that will break trust. Also important in MVQ is the product's cost envelope, the rough range of what this feature will cost to run at scale for your users. A concrete example of MVQ comes from my firsthand experience. I spent years working in speech recognition and speaker identification, a domain where the gap between lab accuracy and real world accuracy is painfully visible. I still remember demos where the model hit over 90% accuracy in controlled tests and then completely fell apart the first time we tried it in a real home. A barking dog, a running dishwasher, someone speaking from across the room, and suddenly the great model felt broken. And from the user's perspective, it was broken. With speaker identification for AI features coming from smart speakers, the MVQ of the ability to identify who is speaking would look like this. An acceptable bar correctly identifies the speaker X percent of the time in typical home conditions. It also recovers gracefully when unsure I'm not sure who's speaking. Should I use your profile or continue as a guest? To hit a Delight bar, you don't need a perfect percentage, but you look for behavioral signals like Users stop repeating themselves or rephrasing commands or corrections drop sharply. Here's a good rule of thumb if 8 or 9 out of 10 attempts work without a retry in realistic conditions, it feels magical. If one in five needs a retry, Trust erodes fast. MVQ also depends on the phase you're in. In a closed beta, users often tolerate rough edges because they expect iteration in a broad launch. The same failure modes feel broken for the speech recognition feature. Here are the examples for assessing delight first, the background chaos test. Play a video in the background while two people talk over each other and see if the assistant still responds correctly without asking, sorry, can you repeat that? Second, the 6pm kitchen test dishwasher running, kids talking dog barking and the smart speaker still recognizes you and gives a personalized response without a I couldn't recognize your voice interruption. Third, the mid command correction test. You say set a timer for 10 minutes, actually make it five, and it updates correctly instead of sticking to the original instruction. A Do Not Ship bar misidentifies the speaker more than Y percent of the time in critical flows. It forces users to repeat themselves multiple times just to be recognized. You may have noticed I didn't actually assign values to each bar. That's because the specific thresholds for MVQ your Acceptable Delight and Do Not Ship bars aren't fixed. They depend heavily on your strategic context. So let's look at five strategic context factors that raise or lower your MVQ bar. Here are the five factors that most often determine where that bar should be set and how they change your product decision. In Scenario one, you're the first to ship this capability. What it means for your mvq you have more room for error. Users have no comparison point, so an imperfect feature can still feel magical and valuable. How you adjust your MVQ Ship earlier with clear beta framing and obvious undo edit. In scenario two, you're the second or third entrant. What it means for your mvq Your MVQ must meet or exceed the existing baseline. Shipping almost as good usually feels worse than not shipping at all. How you adjust your MVQ Require higher accuracy consistency before launch. In scenario three, your strategy is Launch early and learn what it means for your mvq. You may tolerate a lower MVQ to validate demand as long as failures are visible, recoverable, and don't break trust. How you adjust your MVQ Limit scope to one workflow and add tight feedback loop. In scenario four, your use case has high risk exposure. What it means for your mvq Your MVQ must be significantly higher. Small errors in finance, health or family contexts compound quickly into lost trust. How you adjust your MVQ Add human review confirmations or read only mode. In scenario five, your brand promises it just works. What it means for your MVQ your MVQ needs to be conservative. Users expect reliability, not experimentation, even if the feature is novel. How you adjust your MVQ Prefer conservative rollout and stricter failure handling One of the most common mistakes new AIPMs make is falling in love with a magical AI demo without checking whether it's financially viable. That's why it's important to estimate the AI product or features cost envelope early. The cost envelope is the rough range of what this feature will cost to run at scale for your users. You don't need perfect numbers, but you need a ballpark. Start with five questions. 1. What's the model cost per call? Roughly? 2. How often will users trigger it per day month? 3. What's the worst case scenario? Power users edge cases? 4. Can caching smaller models or distillation bring this down? And 5 if usage 10 times, does the math still work? One example is AI meeting notes. The ER call cost is roughly $0.02 to process a 30 minute transcript. Average usage is 20 meetings per user per month, which costs around $0.40 per month per user. But heavy users are 100 meetings per month, which costs $2 per month per user. With caching and a smaller model for low stakes meetings, maybe you bring this to 25 to 30 cents per month per user. Now you can have a real conversation. A feature that effectively costs a $0.30 per month per user and drives retention is a no brainer. A feature that ends up at $5 per month per user with unclear impact is a business problem. This is a core part of AI product sense. Does what you're proposing actually make sense for the business? Now that you better understand where a model's behavior breaks and what you're looking for to greenlight a launch, it's time to codify some guardrails and design them into the product. A good guardrail determines what the product should do when the model hits its limit so that users don't get confused, misled, or lose trust. In practice, guardrails protect users from experiencing a model's failure modes. At a startup I've been collaborating with, we built an AI feature to increase the team's productivity that summarized long slack threads into decisions and action items. In testing, it worked well until it started assigning owners for action items when no one had actually agreed to anything yet. Sometimes it even picked the wrong person. Because my team had developed our AI product sense, we figured out that the fix was a new guardrail in the product, not a different underlying model. So we added one simple rule to the system prompt. In this case, just a line of additional instruction only assign an owner if someone explicitly volunteers or is directly asked and confirms otherwise surface themes and ask the user what to do next. That single constraint eliminated the biggest trust issue almost immediately. Here's what good guardrails look like in practice. This is the end of your free preview. To hear the full episode, become a paid subscriber@lenny'snewsletter.com subscribe if you're already a premium member, you can add the private feed to your podcast app by going to add.lenny'sreads.com thanks for listening and see you on the.
