Summary6 min read

Podcast Summary: Lenny's Reads

Episode: How to do AI analysis you can actually trust
Host: Lenny Rachitsky
Guest Author: Caitlin Sullivan (content written by Caitlin, narrated by Lenny)
Date: February 17, 2026

Episode Overview

In this episode, Lenny Rachitsky presents an audio edition of Lenny's Newsletter, featuring a guest post by Caitlin Sullivan, a leader in the use of AI for customer research. The main objective is to address a critical challenge: Why AI-powered analysis of user research data is often unreliable, and how to extract genuinely trustworthy, actionable insights from tools like ChatGPT, Claude, and Gemini. Caitlin shares her most effective techniques and prompting frameworks to avoid misleading AI-generated conclusions—skills honed from advising companies like Canva and YouTube and training hundreds of product professionals.

Key Discussion Points & Insights

The Core Problem of AI in User Research

Confident but Wrong Outputs: AI models present results with high confidence, even when their analysis contains invented facts, hallucinated quotes, false insights, or biased recommendations.
- "The problem with AI is that the output always looks confident, even when it's full of lies, made up quotes, false insights, and completely wrong conclusions." (00:20, Lenny reading Caitlin)
Confirmation Bias in AI: AI may cherry-pick certain quotes or feedback, giving an inaccurate representation and leading to poor decisions.
- Example: Two LLMs analyzing the same data can output wildly different narratives, each delivered as if they're unquestionable truth.

Why Is AI Analysis So Difficult?

Interviews:
- Unstructured and messy; AI models impose tidy summaries too quickly, missing contradictions and nuance.
- Real analysis requires sitting with the “mess,” catching shifts, tangents, and contradictions.
- "LLMs handle this by imposing structure and jumping to conclusions a bit too fast." (07:02)
Surveys:
- Are not as clean as they appear; metadata, codes, and sparse responses can confuse AI.
- Without clear instructions, AI may misinterpret internal tags or generic answers, leading to poor signal.

Four Major Failure Modes in AI Analysis (and Solutions)

1. Invented Evidence

Types of Hallucination:
- Completely fictionalized quotes
- "Frankenstein" quotes (merged snippets from various speakers)
Quote Verification is Crucial:
- LLMs generate text based on probability, not retrieval—so “verbatim” quotes may be made up.
- Even participant IDs and timestamps can be fabricated by AI.
Solution:
- Define explicit quote rules in your prompts (what counts as a valid quote, use of qualifiers, etc.)
- Always verify quotes in the output—manually or by using another LLM check.
“For each quote in the analysis above: Confirm the quote exists verbatim in the source transcript. If the quote is a close paraphrase but not exact, flag it and provide the actual wording. If the quote cannot be located, mark as not found.” (32:12)

2. False or Generic Insights

Issue:
- AI outputs are often "true but useless" (e.g., “price matters,” “users want reliability”) but lack depth or specificity.
  - "The AI analysis just told me what I already know." (39:41)
Why It Happens:
- LLMs are pattern-finding machines, biased toward consensus and training set priors.
- Sparse survey responses (e.g., “It’s not for me”) get lumped into overly broad themes.
Solution:
- Provide thorough and specific context in prompts, covering:
  1. Project context (the scope—what decision are you making?)
  2. Business goal (what are you trying to achieve? Attract new vs. alienate existing users?)
  3. Product context (domain knowledge – e.g., screenless wearable competing with Apple Watch)
  4. Participant overview (who is making the statement?)
"Effective context loading has at least four components that shape how AI interprets everything that follows." (52:01)

3. Insights That Don’t Guide Decisions

Symptoms:
- Output themes are so broad or irrelevant they cannot inform a real business decision.
- Clusters found in survey analyses may aggregate information in ways that don’t correspond to actionable next steps.
Solution:
- Use context and pointed objectives in your prompt so LLMs are constrained to analyzing information relevant to your current decision.

4. Contradictory Insights Not Surfaced

Issue:
- LLMs often flatten nuances and contradictions in user responses, losing the valuable tension that informs good product decisions.
Solution:
- Explicitly instruct the model to highlight contradictions in data, preserve original nuanced language (hedges, qualifiers), and extract both quotes and participant context.

Choosing and Using AI Models for Analysis

Model Differences:
- Claude: Best for in-depth, nuanced analysis; covers more ground, but themes need validation afterwards.
- Gemini / NotebookLM: Strongest at generating highly evidenced themes and analyzing video (unique ability), but less complete—needs multiple prompts.
- ChatGPT: Most creative for formatting and communication, but least reliable for evidence—often combines or summarizes rather than quoting verbatim.
Model Recommendation:
- For analysis work, Claude is preferred thanks to thoroughness and coverage.
- ChatGPT is commonly used but most prone to failure modes discussed—prompting fixes shared improve results across any LLM.

Memorable Quotes

“AI finds themes that are too broad and generic to act on, or biased by what you accidentally primed it with.” (41:27)
“Confirmation bias is not a human-only thing; AI is easily led.” (23:11)
“LLMs don’t retrieve quotes; they generate text that looks like what a quote should be.” (30:45)
“This often takes just an extra five minutes... but it catches errors that would otherwise undermine the evidence behind your product decisions.” (37:41)

Actionable Prompting Frameworks

Quote Rules to Add to Your Analysis Prompts

"Start where the thought begins and continue until fully expressed. Include reasoning, not just conclusions. Keep hedges and qualifiers—they signal uncertainty. Include emotional language when present. Cite with participant ID and approximate timestamp. Do not combine statements from different parts of the interview. If a quote would exceed three sentences, break it into separate quotes." (35:10)

Verification Prompt Example

“For each quote in the analysis above: Confirm the quote exists verbatim in the transcript. If the quote is a close paraphrase, flag and provide the actual wording. If the quote cannot be located, mark as not found...” (36:05)

Context Loading Checklist (for analysis prompts)

1. Project context: What is the specific decision or feature being explored?
1. Business goal: What are we trying to achieve or decide?
1. Product context: What domain or product-specific details are relevant?
1. Participant overview: Who are the users/customers generating this feedback?

Timestamps of Important Segments

00:00–02:00 — Introduction and episode premise
02:00–09:00 — Why AI analysis fails with interviews vs. surveys
09:00–12:30 — Illustrative examples of misleading vs. trustworthy AI output
14:40–17:00 — Four common failure modes in AI analysis
18:00–32:00 — Failure Mode 1: Invented evidence and how to fix it
32:10–41:00 — Failure Mode 2: False or generic insights and prompting solutions
41:30–52:00 — Prompt structure and context loading for actionable AI insights
55:00–End — Closing thoughts, preview cut-off

Overall Takeaways

Verification is non-negotiable when using AI for research analysis—never trust AI outputs at face value, especially for quotes or nuanced insights.
Prompting quality makes or breaks results—clear rules, context, and verification steps significantly increase the trustworthiness and utility of AI-driven analysis.
Model choice matters, but process matters more—any major LLM benefits from these prompting strategies.

For visuals or detailed walkthroughs, check the written version linked in the show notes.

Loading summary

Transcript1 lines

[00:00]
A
Welcome to Lenny's Reads, where I bring you audio versions of my newsletter about building product, driving growth, and accelerating your career. The problem with AI is that the output always looks confident, even when it's full of lies, made up quotes, false insights, and completely wrong conclusions. As today's guest author Caitlin Sullivan puts it, these mistakes are invisible until a stakeholder asks a question you can't answer or a decision falls apart three months later, or you realize the customer evidence behind a major investment actually had enormous holds. Caitlin has been at the bleeding edge of using AI for user research. She's trained hundreds of product and research professionals at companies big and small on effective AI powered customer research and advised teams at companies like Canva and YouTube. In this episode, she shares her four most effective techniques for getting real, trustworthy and actionable user insights out of ChatGPT, Claude, Gemini, or your LLM of choice. Everything from Here is written by Caitlin and narrated by me let's get into it everyone's analyzing customer data with AI, but everyone's also getting answers full of slop hallucinations, wrong conclusions, and insights that just parrot back what you already told it. Put the same customer conversation transcripts into two models and get a choose your own adventure experience in return. Each model will give you a different narrative, different evidence and and wildly different product recommendations with the same high level of confidence. The written version of this post includes 2 real outputs from a hypothetical experiment where a woo PM asks AI about building screens in their product. One is misleading, one is trustworthy. Side by side, you might spot the problems with the one on the left, but that's not how this works in practice. You get one output, it reads confidently, and you build your next decision on top of it and never see what's missing. This is exactly why verification matters. Here's what separates these answers. The left output cherry picks three enthusiastic quotes and leaps to a confident recommendation. Yes, build a screen without questioning whether those quotes represent the full data set. It looks persuasive, but it's the AI equivalent of confirmation bias. The right output does something harder. It challenges the surface level request. Do not interpret screen feedback as a single literal feature request. It also segments users by actual need and flags pricing risk with specific participant timestamps. You can verify it's messier and doesn't oversimplify things, but it's real. The difference between the two examples above comes down to crucial steps in my workflow to address common failure modes of AI analysis. Those steps force LLMs to maintain the customer's exact words. Dig deep beyond superficial patterns and catch contradictions in customer stories that will skew final recommendations without those checks. False but convincing looking insights go into a deck and influence a million dollar decision in the wrong direction. In this episode, I'll show you how to get relevant and verified insights you can trust. You'll learn about four failure modes that silently break your AI supported insights 1 invented evidence, 2 false or generic insights, 3 signal that doesn't guide better decisions, and 4 contradictory insights. I'll also teach you my prompting techniques to prevent and catch these errors before they lead to the wrong final decisions. These tactics work across Claude, ChatGPT, Gemini, and with interviews, surveys, or any qualitative or mixed data you're trying to make sense of with the help of AI. Before we get into failure modes, you need to understand what's difficult about this kind of data for AI in the first place. Models fail with interviews and surveys in different ways. Interviews are unstructured and messy. A 45 minute research interview is a messy, wandering conversation. A participant may contradict themselves. They go on tangents. They say something important at minute 8 and reframe it completely at minute 35. LLMs handle this by imposing structure and jumping to conclusions a bit too fast. They find clean themes, immediately, pull quotes that fit them best, produce tidy summaries, and call it a day. But real analysis requires sitting with the mess, noticing contradictions, weighing tangents, and catching tone shifts without explicit guidance, AI flattens all of that into something that looks like insight but misses what actually matters. Compared to interviews, surveys might look structured, but they're not. You'd think a CSV would be easy to parse rows and columns. What's complicated about that? A lot. A column of 200 responses to why did you cancel? Is just as messy as interview data, Maybe worse, because you have none of the context in an interview, you remember that they hesitated or had just complained about a specific feature. In a survey, you get it wasn't for me and nothing else. Your CSV may also not be as clean as you think. Different tools export differently. SurveyMonkey might put question text in headers, while Qualtrics exports headers with internal codes. Some exports even include metadata, columns, timestamps, internal tags sitting right next to customer responses without clear differentiation. If you don't tell AI which columns contain the customer's voice and which to ignore, it analyzes everything as signal. I've seen AI treat an internal note flagged for follow up as something the customer said. Even structured columns hide complexity. A header that says Q3 churn probability tells AI nothing about the scale, the question, wording, or whether 5 out of 5 is good or bad. When analyzing interviews, AI models require help with structure, evidence extraction and contradiction detection. With surveys, they require help with interpretation, column disambiguation, and understanding what sparse responses actually mean. The four failure modes below hit both data types and anything similar. Fixing these will typically 10x both the reliability and relevance of your AI analysis Results not all LLMs are equal for analysis work I have run the same analysis process across Claude, ChatGPT and Gemini over 100 times and worked with discovery tool product teams like maze testing prompts across models to see what delivers. Here's what you need to know about each model. Claude is best for thorough analysis with depth and nuance. Delivers more quotes and covers more ground with less pushing the trade off. It gives you the whole brain dump. So themes aren't always proven. You get breadth, not just the safe patterns. Gemini and NotebookLM are best for highly evidenced themes and now video analysis gives you fewer themes but with stronger grounding. Expect a prompt multiple times to get completeness and to ask for longer quotes. Unique advantage. It can analyze non verbal behaviors in video which the others can't yet. ChatGPT is best for final framing and stakeholder communication. Most creative of the three, including with verbatim quotes. Unfortunately least reliable for real evidence. Combines quotes but excels at packaging relevant findings for a specific audience. A visual demo is included in the written version of this post, which is linked in the show notes. Unless I give these models more instruction, there are meaningful differences in output. There's theme overlap, but ChatGPT misses the user's value, price sensitivity reaction and all three models give different confidence scores and quotes, some verbatim, some summarized. This becomes obvious when we can see all three side by side, but most teams have one LLM Enterprise account and won't see the shortcomings of the one they use. ChatGPT summarizes and mashes together verbatim quotes. Claude is more conservative with confidence scores and Gemini often chooses too short snippets of customer voice. If you have a choice, my recommendation is to use Claude for analysis work. It covers more ground while staying rooted in the actual data. You get depth and breadth without as much pushing. The trade off is that it doesn't filter for you. You'll often get validated patterns and half formed hypotheses presented on equal ground, and you'll need to verify that themes are well evidenced. But that's a better starting point than having to prompt three times before being sure Your analysis partner hasn't missed something for consistency, the examples throughout this episode typically use ChatGPT. It's still the most widely used model among my client teams and students, and it's also the most prone to the specific failure modes I'm covering. The fixes work and improve results across all three models. Visual examples are included in the written version of this episode. After more than 2000 hours of testing customer discovery workflows with AI, I found that there are four distinct failure modes for AI analysis and reliable fixes for each one that consistently work across platforms, data types, models, and workflows. Failure Mode 1 is invented evidence despite massive improvements across most reasoning models, hallucinations are still abundant when When I look over the shoulders of product people running analysis, I see two hallucination types all the time. First, completely fictionalized quotes. This still happens among all three major LLMs. Second, Frankenstein quotes sewn together from multiple sources that somewhat represent what the user was saying but isn't actually their words. This is particularly common in ChatGPT. Both types go unnoticed unless you're checking every quote manually, but both are often caused by the way you prompt. You can pretty easily and accidentally trigger ChatGPT to combine multiple customer quotes in ways that can harm our understanding of what the customer was saying. When you add phrases like max 100 words or for each theme, give a punchy and representative quote that captures it in less than 12 words. You'll almost always get mashups. So why does this happen? LLMs don't retrieve quotes like a search engine. They generate text that's statistically likely given the context. Generation and retrieval are fundamentally different. The model predicts what a quote should look like. If the context is about phone checking frustration, it generates plausible phone checking frustration language, sometimes that matches the original, sometimes it's a near miss, and sometimes it's fabricated. Verbatim is also an ambiguous word to prompt a model with exact characters. Can punctuation differ? What about filler words? Where does the quote start and end? The model fills these gaps with assumptions you never see. Even Participant IDs and timestamps can be faked. A citation like Participant 3 at 14 minutes and 30 seconds, looks authoritative but means nothing if the quote is invented. The solution to this problem, no matter your model, data type, or workflow, has two parts. First, define what a valid quote actually looks like your quote rules, which removes the ambiguity that lets AI fill in gaps. And then second, verify that quotes in the resulting AI analysis actually exist. To define your quote rules, add this rule list to your Analysis Prompt Start where the thought begins and continue until fully expressed. Include reasoning, not just conclusions. Keep hedges and qualifiers they signal uncertainty. Include emotional language when present, cite with participant ID and approximate timestamp. Do not combine statements from different parts of the interview. If a quote would exceed three sentences, break it into separate quotes. This removes ambiguity. The model now knows what verbatim means to you, where to start, where to stop, what to include, what not to combine. I always encourage client teams and course participants to think critically about what makes a quote good to them. You'll likely get much better results right away with my prompt snippet. It's my favorite copy paste inclusion, but even better results if you add your own definitions after your initial analysis. Use this verification prompt to have the LLM confirm that these are real quotes for each quote in the analysis above. Confirm the quote exists verbatim in the source transcript. If the quote is a close paraphrase but not exact, flag it and provide the actual wording. If the quote cannot be located, mark as not found Output Format Quote Status verified Paraphrased not found if paraphrased, show actual wording, then location, participant id, timestamp, or line number the written version of this episode includes a screenshot of ChatGPT's output when you run that verification prompt. To summarize this screenshot, the majority of quotes were paraphrased, not original verbatim Customer Statement this happened with a request for a small set of quotes. So imagine what happens when you're digging through 20 interviews and getting just as many patterns back. Without verification, quotes like these end up in your deck attributed to a real participant. Sometimes it's no big deal. Other times it's the difference between product language that strongly resonates and messaging that doesn't convert. This often takes just an extra five minutes, depending on how much data you're dealing with, but it catches errors that would otherwise undermine the evidence behind your product decisions. Let's look at failure mode number two False or generic insights AI finds themes that are too broad and generic to act on, or biased by what you accidentally primed it with. In interviews, you get themes that could describe any product in your category. I hear this constantly from PMs. The AI analysis just told me what I already know. Or these insights are too generic. I can't do anything with them. They get outputs like price is a factor in decisions, people value reliability or users want more real time information. True? Probably, but useless for tough decisions. We need to get deeper than that. These themes could come from so many studies out there, since I'm working with my fake whoop data here, they could also easily come from any wearable study. Themes like these don't tell you whether your users want this new feature you're exploring enough to justify the investment, or whether adding it would alienate the customers who chose you specifically because you're different. So why does this happen? AI defaults to finding consensus because LLMs are pattern finding machines. They surface the obvious patterns that easily rise to the top, finding what multiple participants mention, and then they generate a pattern match theme. The truly most important insight might be something only a few people said in this particular batch of interviews, but that if shared by more customers, would be a noteworthy business signal. Or the most important insight might even be the tension between what people say they want and what their behavior suggests. LLMs also bring priors from training. If the model has seen thousands of churn analyses where price is the number one theme, it will weight toward price even if your full data set doesn't support it. In surveys, that tendency to superficially pattern match is even worse when someone writes it's not for me when canceling AI has to guess what that means. Without guidance, it'll likely lump that response with others into a generic value perception theme. But not worth it could mean too expensive for what I'd get, too data intensive and I'm not a serious enough athlete. I don't want another device to charge or I need a screen and whoop doesn't have one. It's one response with four completely different implications for your product decision. Multiply that ambiguity across hundreds of survey responses and your themes become meaningless averages that don't make decisions any easier. LLMs are trained to find consensus and compress information. Specificity and valuable edge cases get lost, and if your prompt mentions pricing issues, watch how many responses suddenly get coded as pricing related. In some cases that can be helpful because the model makes sure all outputs are relevant to the specific thing you're working on. But in many cases it can be biased cherry picking from the start. In one example about top themes on whether WOOP should add a screen, the output could describe any wearable study. We can't make a hardware decision from this. In another example, I asked ChatGPT to find me theme clusters and counts for the churn survey question, what did you hope to do that WOOP failed to support you with? The results are clusters that don't help us make this decision either. So 18% of churned respondents need more actionable guidance, which sounds like a job to be done, but there are too many possible directions within that cluster for us to make a decision more easily. Should whoop, focus on clearer metrics or workout plans, or both? Plus, most of this has nothing to do with our screen decision. When we ask AI to cluster survey responses, we need to give it clear direction with context or we're leaving room for mediocre decisions and more manual work. Here's the fix. Most people are used to prompt frameworks that have sections like Role Context, Task format, and so on. Context to most of us means including a few lines of background information in the prompt somewhere near the beginning. When we're using AI for analysis, that often focuses on the point of this current customer discovery, think objectives, hypotheses, and what part of the product we're working on. In the past year, I've seen more and more people turn the Prompts Context section into four paragraphs of anything they could think of about their work often dictated in a stream of thought while eating lunch. But neither three lines of objectives nor the whole unstructured backstory is good enough. Effective context loading has at least four components that shape how AI interprets everything that follows. 1. Project context tells AI the scope and stakes. Exploring whether to add a screen is a specific decision with constraints. Doing customer research is vague, so AI defaults to generic analysis because you gave it no frame. 2. Business goal tells AI what you're trying to achieve. If you need to know whether a feature would attract new users versus alienate existing ones in order to prioritize building it, say that AI will weight evidence toward answering your question and addressing your decision, not the decision it assumes you're making. 3. Product context gives AI domain knowledge. Without it, AI interprets I want to see my data generically with it. AI understands that statement in the context of a screenless wearable competing against Apple Watch a completely different interpretation. 4. Participant overview tells AI who's speaking I need real time data from a churned Garmin switcher means something different than the same words from a loyal user who's never tried a competitor. AI can only weight evidence correctly if it knows who the evidence is coming from. The good news is that a lot of what I see people add to the context in their prompts is superfluous. You often don't need as much information as you think, but it needs to be clear, direct and relevant information like the four items I just mentioned for interviews. Put this context into an analysis prompt or use as a single prompt. This is the end of your free preview to hear the full episode. Become a paid subscriber@lenny'snewsletter.com subscribe if you're already a premium member, you can add the private feed to your podcast app by going to add.lenny'sreads.com thanks for listening and see you on the next show.