Summary6 min read

Just Now Possible with Teresa Torres

Episode: Building Earmark: How a Two-Person Team Turned Meetings into Finished Work
Date: February 5, 2026
Host: Teresa Torres
Guests: Mark Barbier (Co-founder & CEO, Earmark) and Sanon (Co-founder & CTO, Earmark)

Episode Overview

This episode explores the creation and evolution of Earmark, a productivity AI tool developed by a two-person founding team to transform meetings into actionable deliverables for product builders. Teresa Torres hosts a deep-dive conversation with Mark Barbier (CEO) and Sanon (CTO), covering the impetus behind Earmark, the specific pain points it aims to solve, its unique technical architecture, real-world usage, and the lessons learned as they iterated on both product and workflow.

Key Themes and Insights

1. Defining the Problem: Meetings That Create More Work

The "Infinite Workday"
Mark references a Microsoft Research paper defining the "infinite workday": back-to-back meetings, endless context-switching, and little time for focused work. Earmark addresses the resulting "fight or flight" response, where practitioners struggle to do their actual jobs (02:17).
Core Customer
Earmark is tailor-made for product builders—product managers, engineers, designers—especially teams overwhelmed with the administrivia of tracking and following up on meeting outcomes (01:07, 03:02).

2. What Makes Earmark Unique vs. Other Meeting Tools

Beyond Note-Taking: Finished Work, Not Just Summaries
Earmark distinguishes itself by tracking not just notes but generating finished artifacts (product specs, Jira/Linear tickets, live prototypes, slides) in real time, reducing manual follow-up (03:51).

"At the end of the meeting...it really attempts to give you finished work before your meeting ends." — Sanon (03:51)
Real-Time, Live Functionality
Multiple AI agents operate in parallel, providing live artifact generation and support, including on-the-fly translations of engineering jargon, custom compliments, and more (04:30–05:39).

"As you are speaking, you actually have essentially multiple agents that are running in parallel." — Sanon (04:30)
Templates and Personas
A suite of live templates includes acronym explanations, "make me look smart" prompts, actionable minutes, and agent personas (e.g., Security Architect, Accessibility, Legal) that ask contextually relevant questions as if key stakeholders were present (06:01–08:12).

3. Product Evolution: From Apple Vision Pro to AI Meeting Agent

Early Pivot
The initial idea was an immersive AR/VR tool for communication skill-building on Apple Vision Pro—a pivot was necessary due to limited market size and actual user behaviors (17:21–19:02).
Validation Through Usage, Not Data Storage
The first web versions didn’t store data; everything was ephemeral. This “no storage” approach surprised them by helping win over enterprise customers concerned about privacy and served as early validation. Enterprises liked evaluation-friendly, non-persistent data (16:10–16:41).

"You're telling me you're not going to store or train [on] data. We couldn't even do it if we wanted to. That actually has helped us get in the door..." — Sanon (16:10)
Iterative Prototyping Based on Customer Conversation
Five major product iterations were made with continuous customer feedback loops, closely monitoring synchronous Slack channels and proactively adopting customer language (19:02).

4. Technical Architecture and AI Strategies

Live Meeting Processing with Multiple Agents

Speech-to-text via Assembly AI streams every ~30 seconds.
Delta-based transcript batching minimizes LLM costs.
Uses OpenAI (primarily GPT-4.1 for “prose quality,” but tests various models) (23:13–25:07).

Prompt Caching for Cost Efficiency

"Prior to prompt caching, the economics of this tool were actually completely untenable—at one point it was around $70 for an hour meeting... now, it's sub a dollar." — Sanon (23:13–24:58)

Prompt caching is essential for live, affordable operations. OpenAI now offers prompt caching similar to Anthropic APIs (28:26).
Only the transcript is passed as conversation history; cards/templates are top-level, fresh prompts—preventing prompt pollution and optimizing cache utilization (32:11–34:30).

No Speaker Diarization

Chose not to do live speaker attribution due to technical reliability issues—wrong names are more harmful than none (21:57).

Data and Privacy

Ephemeral by design; data is not written to a persistent store by default, favoring enterprise privacy. Can be set globally for organizations (52:30).

5. Information Retrieval & RAG Challenges

Beyond Simple Vector Search
Earmark is pushing past basic vector search or keyword RAG for multi-meeting analytics and synthesis.

"A lot of our users are actually wanting to do more analysis-based questions...the answer doesn’t live in the transcript—RAG with cosine similarity won’t help." — Sanon (35:47)
Hierarchy and Pyramids of Data Borrowing ideas like the "data pyramid" (raw transcripts at the base, increasingly abstracted insights above)—to narrow search spaces and precompose likely-needed summaries (42:30).

"Ideally we can find the answers at the top of the pyramid, but if we need to, we can walk all the way down to the transcript." — Sanon (44:14)
Agent-Native Workflows
Implements Dan Schipper’s "agent-native architecture": not just RAG but full agents that tool around databases, keywords, metadata, and pre-computed summaries for nuanced queries (35:50–39:56).

6. Artifacts, Templates, and Customization

From Talking to Working Artifacts Any meeting conversation can become:
- Product specs (pushed to Jira/Linear)
- Prototypes via V0 or Cursor
- Presentation decks (“generate presentation from these three conversations, sprinkle in some emojis”)
- Detailed requirements, custom prompt-based outputs
"...what used to be a week in duration or two weeks...now it's just immediate...all the artifacts required from the kickoff ready by the end." — Mark (48:24)
Iterative & Inclusive Use
- Templates offer a “magic moment” MVP; power users graduate to custom prompts that better fit exact use cases (46:22–47:36).

7. Measuring and Maintaining Quality

Hallucination Control Explicit escape hatches in prompts (“if you don’t know, say so”) are crucial for LLMs to avoid inventing facts (49:32).
Trust Through Provenance Teresa’s suggestion of requiring LLMs to provide timestamped or line-referenced proof of any fact—improves both hallucination prevention and user trust (54:28).
Human-Centric Evals Evaluations are currently conducted by hand—looking for usage signals like artifact copying and manual review in the absence of stored data. Tech debt includes formalizing evals after establishing user habits (49:32–51:47, 56:39).

Notable Quotes & Moments

Mission-Driven Product Building

"We can't imagine not building for people like us." — Mark (00:37)
On Data Ephemerality’s Surprising Value

"Enterprise prospects saw that as almost a feature...‘Oh, like you're telling me you're not going to store, not going to train…’" — Sanon (16:10)
On Their Philosophy of Innovation

“The best pivots take you back home.” — quoting Dalton Caldwell (19:02)
Prompt Engineering Wisdom

"It's a lot better to give the model the specific content rather than just stuff it with as much context...which yields worse results.” — Sanon (35:08)
AI as Chief of Staff Vision

“Nothing falls through the cracks. Deliverable quality is there...to provide comfort and folks feel truly supported in these servant leadership roles.” — Mark (58:29)
UI/UX Learning

"We noticed...our users fixated on the transcript. If it misspelled a name, they'd want to edit. What if we solve this through UX: minimize it to subtitles?" — Sanon (20:33)

Important Timestamps

[01:16] Mark describes Earmark and problem space (“productivity suite where work completes itself”)
[03:51] Sanon explains real-time, finished work creation—difference from Granola
[05:39] Teresa probes live agent functionality
[13:58] Vision for multiplayer/team features
[17:21] The original Apple Vision Pro idea and user research/pivot
[23:13] Deep technical details—speech-to-text, transcripts, prompt caching
[32:11] Challenges in agent prompt caching, preventing context pollution
[35:47] Why basic RAG (vector/keyword search) isn’t enough
[42:30] Teresa’s “data pyramid” analogy for scalable synthesis
[44:33] Concrete example: meetings to engineering specs/prototypes
[49:32] Evaluation, hallucinations, and privacy
[58:29] Vision: the future AI chief of staff for product teams

Concluding Summary

Earmark stands as an example of high-leverage, customer-obsessed AI product development: A tiny team is shaping the future of work for product builders by moving from capturing meeting notes to creating real, actionable, finished deliverables in real time. Their journey—from early AR/VR experiments to a privacy-forward, AI-driven productivity suite—shows both the promise and challenge of building thoughtful human-centered tools in the age of agents, LLMs, and workflow automation.

For those building with AI:
This conversation is packed with concrete technical strategies (prompt caching, agent architectures, RAG/retrieval synthesis), product philosophy (build for “people like us”), and practical lessons (embrace ephemerality, privacy as an architecture, experiment relentlessly), making it an essential listen/read for anyone designing the next wave of AI-augmented collaborative software.

Loading summary

Transcript101 lines

[00:04]
A
Welcome to Just Now Possible with Teresa Torres.
[00:09]
B
Hi, my name is Mark Barbier. I'm the co founder and CEO of Earmark. I'm responsible for sales, customer success, discovery, content or go to market.
[00:17]
C
Hi, I'm Sanon. I am the CTO and co founder responsible for products and engineering.
[00:24]
A
Exciting. If I remember right, you're a two person team. This is it.
[00:28]
B
Yeah, we actually have a little bit of help like on the go to market side and we have a number of friends that are just assisting in all sorts of ways. Mr. Bigger is included.
[00:35]
A
Yeah, big LinkedIn advocate for you.
[00:37]
B
Yeah, exactly. Yeah. But our backgrounds are. Both of us came from a company called Product Plan, which is a SaaS platform for product managers. So we've been building in the product management space for quite some time. And before Product Plan we were both at Mindbody, which is a large SaaS wellness platform. And we've just been really fortunate to, to have the careers we've had now, like the ability to build solutions for product teams and people like us these days I think we can't imagine not building for people like us. So it's been really gratifying.
[01:04]
A
And you mean people like you being product teams that are building products?
[01:08]
B
Yes, builders, you know, product management, engineers, engineering, leadership, designers, disciplines.
[01:14]
A
Yeah, perfect. Okay, and give me the high level. What does Earmark do?
[01:17]
B
Earmark is a productivity suite where the work completes itself. So what Earmark does is it listens to your meetings and in real time turns what's said into finished docs, finished work tickets, updates and next steps. Unlike generic meeting tools. So product teams can move forward without all the manual follow up.
[01:33]
A
This is actually really good timing because right before we started recording I was reading a new book that's going to come out by Barry O'Reilly. Do you guys know Barry?
[01:41]
B
Yep.
[01:42]
A
Yeah, so he wrote Unlearn. He was one of the authors of Lean Enterprise. He has a new book out for leaders and it's really about, I think it's called Artificial Organization and it's for leaders about how they should be leveraging AI. And one of the big themes he talks about is we spend so much time capturing and following up and writing notes and that this is definitely a task that we should be letting AI do for us. And especially like what you talk about to free up the. Okay, if we could just get all the inputs organized and sorted, maybe we'd have time to think.
[02:18]
B
Yeah, that's exactly right. The way that we think about it, there's actually this Microsoft Research piece that was really influential to us called where they basically framed this term called the infinite workday, where all day, every day is back to backs. Every 30 minutes is different, contextually, every audience you're speaking to, every 30 minutes is different. And the artifacts and the deliverables per audience is different. So that's like sort of State of the Union, right? You're in meetings most of the time, you're doing this high context shift thing. Real work is elusive, deep work is elusive. And then even when you get home, right, people are slacking you at 10pm and what ends up happening is folks are just in this fight or flight response or almost go malaise around, can I just do the thing that I actually need to do? And that seems really elusive. And that's something we're really passionate about as a service in terms of solving.
[03:02]
A
So it sounds like clear problem space. We all just have so much work that isn't really our work, the core of our work, clear target, customer of product builders. And really just out of this desire to build for people like you. People hate it when I do this and I'm still going to do it anyway. The product that comes to mind when I went to your website was Granola. It looks like you have a lot of overlap with what Granola does. I know you have a more targeted audience, which I like, and it looks like you do much more than just take notes in meetings. But I'm going to use Granola as my foundation just because I know a lot of listeners are familiar with it. So let's talk a little bit about what does your product do? And like, maybe start with is it the same as Granola? And then more is it different? Let's just maybe use that as a hook that folks are probably familiar with.
[03:51]
C
So Earmark basically is listening in to your meetings, but at the end of the meeting, besides just generating almost like your summary that you would normally expect, it really attempts to give you finished work before your meeting ends. So what I mean by that is think of product specs or think of tickets for Jira, or for linear, even things such as we recently just added a feature where, say if you're in a brainstorming meeting and you're talking about maybe, hey, could I add this thing to my mobile app? Like, we have just quick buttons like build with cursor or built with V0 so that you can actually prompt those and push that out into these tools. So, like during the meeting, actually capturing that context and being able to see these prototypes live, being able to see things like slides live I think is a really big differentiator. And I mentioned live, and that is another piece that we do have. This is a very live tool. So as you are speaking, you actually have essentially multiple agents that are running in parallel. So I can see things like, hey, I'm having a topic discussion with my engineers. I feel like this conversation is going a little bit over my head. Here's almost like this engineering translator for me, So I can almost have this thing spell it out for me in a way that I can understand where maybe I have imposter syndrome, and I might feel maybe a little bit embarrassed to stop interrupting the meeting and ask, hey, could you clarify this for me or not? Likewise, we have a particular user who actually really loves to use our product to give pointed compliments to his teammates. And as he's facilitating the meeting, it's actually difficult for him because it's like split brains trying to keep everyone on focus. But he wants to make sure to give, like, this perfect, like, bespoke compliments. And as he's going, it almost gives him in real time these things that he can read and make sure that he doesn't, doesn't forget.
[05:40]
A
So there's a first, a few things, like, if I understand right, your tool is running live. So if I'm having a conversation with the engineers, is it like, is it guessing what I might not understand? Am I asking it things about what I'm hearing? Is it like proactively trying to give you context for what's being discussed? Or do you have to ask it tell me a little bit about what that piece looks like?
[06:02]
C
Yeah, absolutely. So we have a number of templates that come with the product. So for example, the one that I mentioned, the engineering translator, it tries its best to almost preemptively pick out all of the engineering topics and explain them, but you can also guide. So we have this kind of concept of what we call Vibe docking. So as you can see these things coming in live, you can also steer it in a direction like, hey, maybe it's like this particular topic I don't understand, and then it'll refocus onto that.
[06:33]
B
Okay, when we think about our solution, this idea of, like, most AI meeting tools that are out there generates these generic summaries that nobody really reads. Right. That tend to be pretty low value in most circumstances. And the other piece for AI note takers, which is a big space to your point, they actually don't produce real work. Right. So those are two contrasting attributes. There's a big part of just equipping people to be like, knowledgeable and capable, like in the moment.
[07:00]
A
So there's this live component of you're in a meeting, you're having a conversation with someone. There's these templates that we can layer on. And it sounds like there's a diversity of templates. Do you want to give me a sense of what else is in your template base?
[07:13]
C
Yeah, for sure. So we have one of our favorite ones that we like to start off with is the acronym explanation. So for folks joining organizations who might have just a slew of acronyms, imagine if you were kind of a new hire and you're hearing these. The. This is something that you can input in an organizational setting. Be like, hey, here, here are certain acronyms. These are our own company's definitions of those. The make me look smart one is basically just tease up a question for you to ask in the meeting. Maybe you've. In a, like a really realistic sense, maybe you're in a meeting and someone pings you in Slack and you look away for a second, someone asks you a question, it's, whoa, hold on, wait a minute. So in a way, it's trying to make sure that you're always prepared and like always ready. We do have real time meeting, real time meeting minutes, action items, of course, but we're really trying to create a focus around the product manager role and also adjacent roles. For example, like the engineering slash, engineering manager or even project managers as well.
[08:13]
B
Yeah, Teresa, one other piece that's really important is the. We also have this concept of Personas which essentially enables you to essentially like introduce, let's say maybe your security architect is the most like in demand person, like in the company. Right. And they can only attend one out of every five meetings. What it does is it enables you to have an agent there that would ask questions that person would likely ask if they were there. So you can actually have a little bit more rigor around conversations and decision making. We have one for accessibility, we have one for legal right. So just this idea of, so if you're a PM and if you just wanted a little bit more scrutiny in some area, could you have a Persona basically help you as you go along and poke. Poke the right or ask the right questions to the audience you're interacting with.
[08:58]
A
Yeah, I think this really resonates too, because I put a lot of energy into developing what I now refer to as my personal operating system. I'm a big cloud code nerd. I've got all sorts of agents for anything and everything. And it sounds like A lot of your product development is we're going to do this for product teams so they don't have to figure it out themselves. And we'll just look at the thing that I'm hearing that is really impressive is just the focus on very specific use Cases of you're in a meeting and this thing is happening, you're distracted and you get asked a question, a technical term comes up that you're not familiar with. Like I love, I think in opportunities. And it feels to me like you are solving for very specific customer needs, which is really nice.
[09:43]
B
Yeah, yeah, exactly right. That's exactly right. Yeah, yeah. The other piece that that's really interesting is like we talk about it almost like an industrial design problem where you kind of start with the extremes when you're developing a product. An example might be if I'm developing, say a potato peeler, I'm going to develop it for arthritic hands and I'm going to develop it for children's hands, the extreme users. And if I get to that, those boundaries, it'll solve for the middle. So for us, as we're going through our market validation and addressing the ICP that we're solving for, we believe that if we solve it for that group, we'll be able to solve it for everybody.
[10:16]
A
Well, I'm trying to understand like the full footprint of your product. So we've got, I'm in a meeting, I've got these templates I can apply. I'm assuming it's like granola and then it's running in the background, it's hearing all the audio, it's transcribing it. Is there more?
[10:28]
C
We have these unlimited agents that are running with these tasks. One thing that that is currently in active development, which I'm actually just super excited about, is this what we're calling projects. Imagine being able to group like related meetings into a project, say if I'm working on a mobile app. But the real power of this comes when your team is also using Earmark as well. Because over projects we're able to have these agents essentially run asynchronously for you. And what I mean by that is, say, for example, maybe you're not in a meeting, maybe you have a team that is in a different time zone from you, the engineers. I mentioned a blocker. Maybe the project is going to get delayed up front. What if you could get a heads up or some sort of notification like, hey, like my project was on track. But suddenly like almost seeing this in real time as like the Project owner so and so mentioned that there was a blocker here. This is something that you should probably be aware of and address. So one of our big ways that we are trying to essentially market ourselves, we're not quite there yet is almost like this chief of staff for your delivery teams like your personal chief of staff there. So imagine almost like having these observers like when you come into work for the first thing in the morning, hey, this is what changed. Someone so and so on your team mentioned this. You should be aware of this. Or we're working on integrations right now with your slack or email. So maybe like hey, a vendor came in with a different negotiation rate than you thought was possible, so maybe you should focus on that. Or a conversation on slack bubbles up. That's where we're going. We're starting with meetings as our primary input source. But if you can imagine things like your email, your slack through docs and all of that.
[12:15]
B
Yeah. So Tree is like a true AI chief of staff that goes beyond automating deliverables. Right. So to San's point, like proactive task identification. Right. Being able to spawn tasks like as conversations happen on its own and then things like strategic thinking support. So we're talking about a lot of tactical execution things now but with deep research and more advanced models there's probably ways for us to basically queue up tasks that might be more maybe like competitive analyses, right. Based that come up in conversation and maybe you need it just in time, just in need output right around a competitive analysis for this other product you're talking about. That's one example. And you kind of go a little further. There's there when you have all conversational context across all meetings and have this multiplayer mode you can do things like portfolio reporting. Right. We think about just the answer of where are we today. That today is usually like you have a product manager, they're going to speak to all their delivery leads. Right. They're going to have a once a week or bi weekly or monthly in terms of are the trains running on time? Right. And the fact that like those things are just promptable now, right. This like heavyweight high activity thing that we just do and assume that's no longer something you have to really facilitate facilitate for like in the enterprise or in the future work. Yeah.
[13:27]
A
So this big piece around team collaboration seems like a really big one. I hear from people every day that's great that you can do all this stuff locally, but what happens when you share with teams and it sounds like especially with your projects that are upcoming There's a lot around. The more people that use this in your company, maybe the more value you can get out of this. Tell me a little bit about what exists today, what's your vision around this? And then I want to dig in, I want to dig into kind of some of the technical details under the hood.
[13:59]
B
Yeah. So the vision today is our approach for the product has been can we create a daily and essential tool for product folks where this idea of if can you not imagine not having this co presence like automating this work in the background. We're big Slack fans and like Stuart Butterfield had that essay, we don't sell saddles here. And this idea of like true behavioral change, can we achieve that state? And what's really interesting is for a lot of our early customers, they like, they can't. I think I mentioned this before, but they just can't imagine like not having like unlimited task agents in the background doing the work as conversations take place. So that's been really great. There's actually a couple other things too that are really interesting with the product in its current form, which is just doing tactical documentation and insights. Real time is early on we actually had a product that was web based that didn't support headphones and we had folks basically change all their normal patterns right. Of headphone use and they'd basically drop their headphones and still continue to use the product even with that limitation. So that's one really interesting data point. The other thing that came up in our earlier versions of the product was there was just no database. We actually had a completely ephemeral solution which basically provided it was just all in session and you would go and use the tool day in, day out and then you would basically export like all of the artifacts and transcripts, whatever, if you wanted to leverage it later on. And that's a huge limitation in the offering. Right. And the fact that people still used it. And then we run this, the superhuman like product manager or product market fit survey where you know, when you, when we ask people, you know, how, like how would you feel if it went away tomorrow? 78% of our respondents on our PMF surveys say they'd be super bummed if it went away tomorrow. So we have like good signal, like I know we're nowhere, we're not confident by default, which probably makes us good at what we do. But that's so some good evidence for us to build upon.
[15:48]
A
Yeah. What's really clear to me is that I love how specific your target market is. I love that You've identified really clear use cases. This idea of no storage and people still used it is a pretty good indicator of value. And I also love that. Tells me you focused on the value moment before you focused on is it what we would consider like a full product? Which is entertaining as well.
[16:10]
C
What's really interesting about the no storage aspect is that a lot of our enterprise prospects actually saw that as almost a feature in a way to easily get in the door to actually start an evaluation. Oh, like you're telling me that you're not going to store, you're not going to train. We couldn't even do it if we wanted to. So that actually has helped us get in the door versus we thought it was initially like, that was like, okay, let's focus on the product side. Like we could do the database first, but that's going to be a big lift. It was almost a very nice surprise to us as we went on.
[16:42]
A
What's cool about that too is so many products end up being limited by their data model because they get designed before you know what it's going to actually end up being. And what a great way to just. It makes me think of Magic Patterns, which is the AI prototyping tool that literally just builds a front end and therefore it works way better than a lot of the other tools because it's not making the wrong data model. And then you ask for something different. That's amazing. Okay. From what you've described, your product has a pretty big footprint and I can imagine it just getting bigger and bigger over time. Tell me a little bit about, like, how did you decide where to start? What's the first thing that you did?
[17:22]
B
Yeah, this is a crazy story. So we actually started the initial product version of the product was an immersive AR VR experience for the Apple Vision Pro. The original concept earmark was basically improving communication skills for product and engineering leaders. You think servant leaders need to influence others above all else. Like how do we enable them to rehearse and just get better at communicating concepts? So you like the experience is fully built out. Right. Modeled conference rooms, auditoriums. You could take photos of the conference rooms at your own work if you wanted and have that mapped out as an environment. And then you could cycle through Google Slides real time and then it would provide real time feedback and prepared you for like when you think about prompting about breathing, because I'm just rambling and I haven't breathed in two minutes, they would do that. It would prompt you to speak up if you know how to enunciate better. So there's all these things that we were working through. And then we went through a bunch of discovery with some early stage products, our product, and conducted like 60 interviews with product managers, communication coaches and all these adjacent Personas. And the key learning we got from that, like all those conversations was that few people actually prepare for presentations. So we created a preparatory tool for people unwilling to prepare. And then the other piece that was really interesting is obviously this is all, it's all in the rearview mirror now, but the Vision Pro is total addressable. Market obviously was pretty terrible. And we joke that if we had done everything perfectly with that product, we would have made about $500.
[18:46]
C
Yeah,
[18:49]
A
the Apple Vision Pro, maybe it will turn into something compelling, but yeah, tough market space. Okay, so you pivot from Apple Vision Pro to where you are today.
[19:02]
B
Yeah. So we took the basic concept of real time feedback and insights and then we ported it to a web based solution because we figured web as a channel, like broadest possible reach. And then the idea there was okay, if we can deliver real time insights to make product people more informed, like in the moment, that was the thread that we were pulling as a service that we're trying to validate out in the market. And then we've gone through five major product iterations, like in the idea maze. And now we just have a much more refined experience based off of daily customer conversations. Like we have everybody in slack threads. We're talking to prospects all day, every day. Right. So we're just kind of inundated with conversational reps to understand like where what we need to do from a roadmap perspective. Oftentimes customers will reflect back to us language that is actually better than our own positioning. Right. When they describe the product experience. So that's been really great. And then what's really wild is today we basically have the product we always wish we had back in when we were engineering and delivery leads and product leads. And there's this Dalton Caldwell quote from Y Combinator where he always says that the best pivots take you back home. So we started with the Vision Pro product doing rehearsals, right. And then now we have a product that automates the deliverables that makes up the majority of the drudgery of the roles we have.
[20:13]
A
Yeah, I really like that. Okay, let's get into some of the technical details. Okay, so we start I. We started with Apple Vision Pro. When you moved to the web, what was like, what did your first version? What's hard about version is I'm Sure. Your version is continuous, but give me a sense, like in your earliest days on the web, what did this product look like?
[20:34]
C
We actually had the transcript take up around 50% of the screen. And it was just like an assumption on our part. It's okay, people are going to see what they say. And we noticed that because it was forefront and for center that a lot of our users fixated on the transcript. For example, if the transcript misspelled someone's name, they'd be like, okay, like I want to go back and be able to edit and change that name, or I want to clip these things out. And we realized, what if we can actually just solve this like through ux? Like, what if we just minimize like that transcript into almost like a subtitles type field? So you can see that it's working, you can see that it's going. But transcript tools, like, even today, as powerful as they are, they're not a hundred percent accurate. And we didn't want our users to just focus on that because the LLMs are actually really good at inferring what it is that you should or think it is saying. And it'll actually do that correction for you. And just switching that to that UX a bit has completely removed all of that feedback. That was just one thing that I thought was interesting.
[21:39]
B
Yeah. And even things like speaker attribution, right. Shouldn't you label everything? And since we have control audio, right, that we're collecting audio source data from, we can't actually reliably do speaker attribution because there's no bot or agent joining the call. But it's just, it's just really funny when you expose something that you open it up for scrutiny when it actually doesn't solve any problem really for the customer.
[21:58]
C
Which is really interesting on a technical note, because we're not a bot and we're not connecting into any APIs, we would have to rely on speaker diarization to get those names. And that stuff, even today is still not accurate. And what we found is that if we were to enable that, passing in a wrong name is actually worse than passing in no name at all. Because the LLM can infer almost like who was saying what, or different topics or different threads. But if you seed it with the wrong information, it will go off of that and then that will get compounded. So today that's. We actually don't have anyone's names in here. Now you can see them from the context where you can get them from the calendar, in which case those are correct, because that's kind of your source of truth. But we don't do any sort of speaker diarization because it actually makes it
[22:47]
A
worse in terms of live, in terms
[22:50]
C
of the quality of the transcript.
[22:52]
A
Afterwards, you can do some more sophisticated matching. Who is in the meeting, get the name right, infer from.
[22:58]
C
Correct.
[22:58]
A
Who said what kind of stuff.
[23:00]
C
Yes.
[23:00]
A
Yeah, fascinating. Okay, so let's get into like what's under the hood here. You mentioned there's a number of agents. It looks like one of your. Each template is tied to an agent. Give me a sense for just from an architecture standpoint, how does this work?
[23:14]
C
So we're using, currently we're using an OpenAI right now for a main LLM source. It's plumb to use just a variety of different tools. So I actually bounce around between different models. When new models release, test those out. But right now, what's actually occurring is we're actually making heavy use of prompt caching because it is live and we're having to take these kind of raw, unstructured transcripts and turn them into structured artifacts. Is that as you can imagine, if every 30 seconds I'm taking the aggregated transcript and sending it to an LLM. That's a lot of tokens. So for comparison, an hour meeting transcript is about 16,000 tokens compounded. That could be a lot within the hour. So what we're doing is so we're sending these transcripts, they're actually powering the multiple artifacts. It's only kind of one history in there. But because prompt caching is really supporting us, we're actually only having to pay like that. Delta. And prior to prompt caching, the economics of this tool were actually completely untenable. There was like one point in a very early version where we were like, it was around $70 for an hour meeting. In terms of API cost, we were able to get that down obviously to like sub like a dollar. But what's really interesting is like right now, like our highest costs are the actual transcription costs over the AI costs. So that has been really fascinating as we've built this out. Almost like building into the future. It looks almost not possible. And these models and the concept is going down. The intelligence has been going up. So that's been really interesting to see.
[24:58]
A
Yeah, let's dig into this a little bit. So I. First of all, you're using OpenAI models. OpenAI models now support prompt caching.
[25:08]
C
They do. Some of the models do support prompt caching. We are. I'm currently using GPT 4.1 as our default, which sounds crazy because that's actually a legacy model today. What's really interesting about that model is we have found, and our customers have found is that model actually writes with pros that our customers expect in GPT 5.1, which is no longer 5.2. Now it still loves to give bullet points, almost like an outline format or bullet and bullet nested points. So when folks are reading these summaries or the reading, reading these things in real time, what actually ends up happening is you actually get these really long lists. And we've tried a lot of different techniques for prompting, being able to reduce that, but I still find that GPT 4.1 is the best overall. It's just writing really nice sounding words. It's not good at the nuance that 5.2 can actually extract from a meeting. But using the GPT 4.1 model as at least the writing model to deliver the final output has been good for us. In terms of what our customers have gravitated to
[26:24]
B
is that since there is a multimodal architecture that sort of future state, the ideal solution for us would be basically mapping the right model to the type of artifact that we want to generate and then having enough just, and just having the flexibility in the platform to map those things. And then when there's like foundational model changes, it's easy for us to move wherever we want to CNN's point, we actually went live with 5.2 right. With like the, the multi bulleted embedded like long form format that it would generate. And first like in our own internal evaluations we're like eh, that's pretty good.
[26:56]
C
Right.
[26:56]
B
And then we got it out there and yeah, it just lots of learning. Right.
[27:00]
C
For model release we have to pull that going back.
[27:03]
B
Yep.
[27:04]
A
Okay, so let me make sure I understand. So obviously there's a speech to text piece and that's what you're talking about, Sanon with transcription.
[27:12]
C
Yes.
[27:12]
A
Much more expensive than just a normal LLM call that's happening real time in the course of the conversation.
[27:20]
C
Yes, that is there is a speech to text that's occurring real time. We are using Assembly AI to run those and every 30 seconds or so sometimes we have a variable slider on this. Well actually we'll just batch up the delta in the transcripts since the last interval and send that on to the thread which that thread has been prompt cached. So in terms of the only additional cost is that new, that new transcript delta and then that will power each of the individual artifacts which you could have multiple going at the same time, because the prompt for each artifact is different and that also gets appended to each transcript thread. If I'm kind of mirroring that correctly with my fingers here.
[28:04]
A
Okay. And I don't think I've had an episode where we've gotten into prompt caching. I do have some experience with this because I use heavy prompt caching in my interview, Coach. I use the Anthropic API, and at the time that I bought it or the time that I built it, I don't think the OpenAI models had any prompt caching. Is it similar to Anthropics?
[28:27]
C
It is very similar to Anthropics. You are correct. Anthropic did come out with prompt caching first. And it was, I want to say it was a few months or so before OpenAI came out with their version for some time. Yes. With Anthropics, the API, it was a little painful because you had to specify exactly where it looked with the breakpoints, exactly where and what you wanted cached. The OpenAI one, actually, you don't have to specify breakpoints at all. It'll automatically do that on its own on their servers. The only thing that you have to make sure is that the message history that you send has to be identical to what you've already set previously. Otherwise that cache is then cleared and then you're paying for the full history.
[29:06]
A
So you've got this one thread of speech to text writing a transcript. Every chunk of new transcript, you're sending it to OpenAI, it's getting cached so that you're only hitting cache to cache reads on your subsequent calls. I imagine the thing that's in my head, you just said you have to send the exact same thing to start. I know on the Anthropic API, you can say it's cached up into a certain point and then everything after that is the new stuff. If you have different tools and different tools are using the same cache, what happens when there's a new chunk of transcript? Are you able to fork the conversation, add that part, and then all the tools now start to use that part. Walk me through a little bit, the mechanics of how that works.
[29:57]
C
Yeah, that's a great question. So right now I only have essentially one. Basically one tool is responsible for essentially sending that history and getting that aggregated, basically whatever artifact that you want back. And the reason why is because exactly how you say this, that we wouldn't want multiple tools sending the full transcript back, otherwise we're going to have to be paying for all of these different threads. So ideally, if we have one tool or one agent responsible for doing that, then that can then go ahead and pass along almost like a summary or a much lighter version of whatever that is to, to get to the close respective answers. But yeah, that, that really allows us to actually cut down on the cost. Like I mentioned like earlier was like the 70 plus dollars. That's when there was no prompt caching and there was like a tool per card, almost like you saw. So each one was just. OpenAI had no knowledge of that. There were like 10 cards but it was still consuming all of that data. So that's where it got really expensive. So now they're all shared via this one and that's really helped.
[31:09]
A
So it sounds like we've got a transcript. This is the bulk of the tokens. And I can imagine over the course of an hour meeting, two hour meeting, this is just going to grow and grow. I'm happy to hear OpenAI has prompt caching. I was shocked that they didn't when I was looking into it. It saves me a ton of money. I'm still a little bit confused. And maybe this is maybe. Does OpenAI allow you to have a part of the prompt that's cached and then a part of the prompt that's not cached? And you can append to both because I guess here's what I'm trying to work through in my brain. You've got a meeting going on. Somebody decides, I want to run the engineering translator. It has its own prompt that has to use that transcript. And then maybe at a different point in the meeting, somebody says make me look smart and they use that tool. Like you said, there's one agent that's managing both of those. But it seems like you would have like transcript chunks interleaved with tool instructions or are all the tool instructions just part of your original system?
[32:11]
C
Oh, I understand what you're saying. I get it now. So I think one thing to. I think the biggest point here is what I'm actually doing and what I discovered is that giving the LLMs too much context actually produced worse results. So I think what you're saying is if I set the engineering card and then that engineering card would be passed into the history and someone else was like, maybe it looks smart and then that's passed into the history. What I'm doing is I'm actually not passing those into the history at all. So the history is just comprised of the transcript, which is the source of truth. So what OpenAI is doing is it's looking at this history, it's determining, okay, the next call, is there a match. I can match the starting point all the way up to when it was last called and then the user, and then I basically add on, okay, what is the particular card that we're asking for? So that part's not cached, but that's not sent into the history at all. And the reason why it's not set into the history, I'm glad you made me clarify this is because if we're in a meeting that's live and so this was a good learning for me. By the way, if we're in a meeting that's live and say I just, okay, give me a summary. But we're at the maybe 15 minute mark of a 60 minute long meeting. It's only going to give me the summary of the first quarter of that meeting. But because it's in the history, everything else is going to get biased off of that early response. So at the end of the meeting when I asked for a summary, it actually gave me a really condensed summary because it didn't really think much happened. Even though the entire transcript was there. It was being like, but you have summary over here. And I'm really focusing on that because of all of these keywords that are in here. And because of that I just decided maybe for every 30 seconds, for every card there's it's just going to be fresh content. I don't send that into the history. As far as OpenAI sees, it's just sees the transcript and it sees like that top level card and it gives you a fresh result every time. I found that that actually gives you better results not only at the end but also mid meeting. And then you're right, that enables me to do prompt caching. If we were to add that into the history, you're 100% right. Like it would totally blow up the prompt caching bit.
[34:31]
A
I do this a lot in my interview coach, because I do, I take one transcript and then I run a whole bunch of LLM calls against it and I don't want to pay for all those tokens. Multi a transcript could be 30,000 tokens. But this is the interesting stuff that I know when I was learning this I. You don't think about these things. If you just use ChatGPT in the window, you never have to think about, oh, this is going to muddy the conversation. This is going to be, it's going to be skewed by this earlier summary. But what I'm learning is that Just by using these tools, a ton of you actually start to see these problems in your usage. And then when you go to build your products, you have this intuition of I can see why it's getting it wrong.
[35:08]
C
Exactly. Yeah. It's a lot better to give the model the specific content rather than just stuff it with as much content context as possible, which yields worse results. Which actually is our biggest technical challenge in our upcoming feature, which is just search and retrieval.
[35:29]
A
Yeah, do you want to talk through that a little bit?
[35:32]
C
So, as I mentioned, like the biggest problem that we have is finding the right context. And what we learned is that RAG is just not enough for what PMs or anyone on the delivery team might ask of us. So for example, okay, let's get specific
[35:47]
A
because everybody uses RAG differently.
[35:51]
C
So imagine that I am wanting to query, maybe ask a question over a series of multiple meetings that I've reported in yearmark. Now let's say I ask what day do we agree to ship the mobile app using rag? We'll say vector database. Maybe I've chunked maybe 500 or so tokens. Just super naive. That would actually do a pretty good job at finding that answer because it is a really specific answer. It exists in the transcripts. It's looking for words that are there, like mobile app, like version number. What gets really interesting is a lot of our users are actually wanting to do more analysis based questions. So for example, how can I improve my discovery calls over the last month? And what's really interesting about that is that the answer to that actually does not live in any of those transcripts. And so if you were to use RAG and you were to do like the cosine similarity, it would actually come back with nothing really, unless that was specifically mentioned somewhere. And so what we're finding with that is that what really excels, there are reasoning models. So if you could put these transcripts in the reasoning models, that can give you that answer. And that's great until you realize, oh, I've suddenly exceeded the context window because these transcripts are so large. So if you have 10 meetings, 20 meeting, cool, great, that's possible. But if your team is using it or you've been using this for a year, suddenly your answers are going to degrade. And I think this is like the biggest problem is like how do you find the right transcripts to pull in so the model could actually run its analysis on it? And Dan Schipper came out with this great article the other week, which was agent native architecture and just a quick tldr of that is basically agents using tools in a loop where essentially your features are your prompts instead of actually hard coding workflows into your products. And in a way this is what Claude code does. This is what Codex does, where instead of using rag, and they do use rag, by the way, they like that is a particular source, but they primarily use search. So they like search through your code. The agent uses a variety of different tools to find relevant files, find relevant functions, which is great essentially for if you use like a clock code on it, like GitHub, repository of relevant documents, great at search. So what we're looking at is what if we can essentially replicate almost like that flow like those agents using these different tools in a way to search over transcripts. So now imagine maybe we do use rag. That's for very specific cases where that keyword or those phrases are used in, in the, in the actual transcript. BM25 that actually does a really good job too at finding keywords using metadata, like for meetings over the last month. Great. We can use, have an agent actually create a query to query the database and pull those in and even leveraging like bespoke summaries at ingest time. So depending on who our user is. So if our user is a product manager, maybe we can anticipate that they might ask, okay, but the status on like this type of thing and actually having summaries built in that almost anticipate what someone might ask and actually having that as part of the search. And now you can imagine, okay, the agent is using all of these different tools and then it can find relevant transcripts and then pull that and then do an analysis over that. That's currently what we're working on. But that definitely has been our hardest problem is okay, great, if you have a few meetings, this is easy, we'll just put them in the context. But once you exceed that context wind, it becomes a really challenging problem. It's almost like agents are moving to. I don't want to phrase this. It's. It's as a customer, like you are the, you are the user of the APIs, but the agent is also very much a first party user of the APIs. And designing and architecting your product in that way is the way to go, I think.
[39:57]
A
Okay, we are getting into my favorite part of building AI products. This has come up a ton on this podcast. I think there's two pieces to this. I'm going to be the language police a little bit because I see a lot of like Twitter wars about this rag is retrieval augmented generation, which I know, but people equate it with vector search. Right. And so I think what you meant initially is vector search didn't work for you.
[40:23]
C
Yes, I am sorry.
[40:25]
A
I would argue that like Claude code grepping and using awk and so that's all retrieval augmentation. There's just huge Twitter wars about this. So I wanted to be the language police a little bit. Lots of people have shared, I think because vector search became like a hot topic in AI products. Lots of teams have come on this podcast and shared. They started with vector search. It did not meet their needs. Turns out keyword search is really great for a lot of specific things. It turns out agentic searching through bash commands is really powerful. The piece that you ended on is my favorite part of AI products and this idea of we have a large volume of data, we need the LLM to be able to synthesize across that data, act on that data, find the right data. And many of the teams that I've talked to, there's two episodes I'm going to call out for both of you and also for other folks that are interested, the Zen City episode and the Incidentio episode, both of them get into this idea of like layers of data. So in Zen City, they take in millions of data points from city residents about what they think about their city and how their city is run. And then they have a chat window where a city council person can type in anything. Okay, you can't just have an LLM search millions of data points. And they basically built this data pyramid where the base is raw data points. And then on top of that they start to generate insights. And then on top of that they start to generate like a theory. And then on top of that, and it's just this, like, pyramid that helps the. Gives a smaller search space for the LLM and then lets them walk it all the way down to the underlying data, which I just. I don't know, maybe I'm just like an information structure nerd, but I love this part of it. And I could see with your transcripts, there's this. The piece that like lit me up about what you said was this idea of we can anticipate the types of summaries they might need, we can pre create those and throw it back in the knowledge base for the LLM to work with.
[42:30]
C
Yes, I love that you mentioned the pyramid. So, like in how we would relate, that is, our transcripts would be at the bottom. Like, we view the transcripts as the source of Truth for the conversations and as you just mentioned, like at ingest time. So imagine, okay, my meeting is over. Great. Could we have some models running for you in the background that are adding to that context, like just that information that you would care about. Maybe it learns over time or maybe we just take a first pass guess based off of your role of what you might ask. So it could look there first, so it could pull in like a smaller document versus having to pull in from the transcript. Like ideally we can find the answers at the top of the pyramid, but if we do need to, we can walk all the way down to the transcript. And what's really interesting about that is as you can imagine, what we're building is not just like a chat bot because these answers are going to take a considerable amount of time to come back. And what we're finding is because our offering is like, hey, we're going to try and give you either completed work or more realistically like you're 80% there drafted, you're going to get this back. We want you to approve it before, before it's sent out. We think that's really important that people are actually okay, waiting for that and almost like spinning up these asynchronous tasks and then going on their normal job or normal day to get back these better high quality answers. So I'm actually glad that things have shifted that way because if speed of response was super important, like people were expecting it like under a second or a few seconds, then a lot of that would not be possible.
[44:14]
A
Oh, give me an example of you mentioned this early on and I didn't follow up on it, this idea of doing the work for them. So like I know on your website you talk about you turn meetings into actionable work. Maybe give me, walk me through a scenario of what's the type of artifact you create. And is this different from the templates in your meetings?
[44:34]
B
They're both Right. So the templates, it solves for basically the clean sheet problem. Where do I start? And what it does is it's basically an interface prompt to get people to a magic moment in short order. So that's what the templates are for. Now the archetypes artifacts that it produces are going to be, I think Sienna mentioned this earlier, there's going to be use cases that are organized by different jobs. One might be run a meeting, one might be artifacts for any SDLC or pdlc. Right. Maybe it's strategic documentation, maybe it's something like incident response. And there's incident response templates as well. But just the idea of you have multiple templates that support those jobs and you can select whatever you want. Now, that's one portion of it. But what's really interesting is once people get into it, once they get a taste of the templates themselves, they. They migrate to prompting exactly what they want as a base behavior. Right. So we see like first time use maybe first two weeks, month template, template. Or they'll do pedestrian tasks like give me a summary. But what ends up happening is people get more and more experimental with the solution. They'll do things like generate presentation from these three conversations, sprinkle in some emojis. Right. And we'll create slide content for them, which they'll take and put it into. Into gamma or whatever their presentation tool is. So that's a really. That's a workflow that we see emerging, which is really fun to watch. The other thing is that since you have unlimited task agents that you can spin up at any time, people like, there's no pressure to get anything like 100% accurate. Right. When you think about the experimentation. But there's. You can vibe with the doc or an artifact at any point, but I can just keep re prompting to get the artifact that I want. So the system almost like welcomes high, like high repetition or iteration rate, like as people create artifacts themselves.
[46:23]
A
I see. Okay. So I can start with a template, but I don't have to start with a template. I can also just create my own custom prompts, my own custom agents to do the types of things that I need. So if I'm going to have a conversation with my engineers about what the underlying data model for a feature might be, we can just talk through it. Maybe there's a whiteboard. Maybe we're just describing it. Maybe we take some pictures. I'm not sure if you do pictures. And it'll document our decisions and what we decided and maybe even turn it into requirements for me.
[46:54]
B
Correct? Yeah. That's like the idea is unstructured conversations to workable artifacts that enable product development, as an example. Right.
[47:04]
A
Very cool.
[47:05]
B
We think about the jobs that we're trying to solve for too. You think about creative writing where you might have a very specific style. Right. And there's no LLM that will basically replicate you and the essence of you. Right. I think for technical artifacts, as an example, that's a little bit less of a requirement for us, even though we're working through some things that basically enable context that will create like something that's more stylized, but that's Just an example of kind of like where our sweet spot is. Like, these artifacts are actually maybe the most LLM compatible, like, out of any artifact you can generate.
[47:37]
C
My favorite way to use the product is when myself and Mark are just having a brainstorming call and we might be talking about a new feature, and earmark is transcribing that. And in the middle of that call, just pulling up, okay, give me the engineering specs. And that artifact is unique because in there is a button to build in cursor or build in v0. So what's really interesting about that is, while we're still meeting, we actually have a V0 prototype of what we were talking about that we can. That I can screen share and we can riff on. Which prior to that, we would have to be like, okay, like after the meeting, all right, I'll wipe up something, then we'll send it over. We might continue that conversation on Slack. Now we can actually do that all in the meeting and just get a lot more done and go on.
[48:25]
B
The best way to think about it is just cycle time, right? You know, what used to be a week in duration or two weeks in duration, now it's just immediate. So if I get to have a kickoff, can I actually have all the artifacts that were required from the kickoff ready by the end? Right. And that's like the great unlock for our customers.
[48:40]
A
So let's get into the. I think this is a good point to talk about just quality and how you're measuring quality, because it seems like a meeting could be about anything and everything. It also seems like accuracy is pretty important. We talked already about speaker identification, but I imagine hallucinations would be a pretty significant problem in this context. And then the other thing that stands out to me about this is, like, the precision in terms or the precision in requirements. There's things in this product space in particular, it can't just get the field name when you're talking about a data model. And I feel like this is where LLMs get a little bit lazy and they assume they know what you're talking about and that go off in the wrong direction. So tell me a little bit about what you do to just manage, like, the quality of the agents in the system.
[49:32]
C
Yeah, that's a great question. Hallucinations. The biggest thing that we found with hallucinations is giving the LLM an escape hatch, like, specifically in the prompt, like, if you don't know the answer to this, I know this sounds super, super basic. Or if this. The particular answer doesn't exist. In the transcripts say that be like, hey, I don't know what the answer is. What's really interesting about that is that if you've tried to force an answer, like with the prompt, it's going to give you an answer, and that answer probably is not going to be the correct one. So giving an escape hatch is important at essentially every layer that we go. That's part of kind of the prompts that we do in terms of evals might be surprised by this, but a lot of our evals are actually, it's human driven right now. So I have a series of meetings that, that I have that I run as a test on, on every update that I know. And I'll basically just look through a series of artifacts and read, make sure that those are okay, make sure the style and the prose is okay. But we do rely a lot on customer feedback. When we do releases. We have analytics around usage of our product. So for example, hey, are people actually taking the artifact and copying out the response? So today we still can't see the response. Because actually, even today, like, even though I talked about all of the search and retrieval and stuff that we're still working towards, the database piece should be out in two weeks. They missed less words. But, but because of that, because we can actually see that data, we're having to rely on, okay, like, how are people using it? Are they copying these artifacts out? And if they are copying these artifacts out, that's actually a pretty strong signal to us that they find that artifacts valuable. So if that number does regress, be like, okay, like something must have happened. But yeah, we don't do any evals currently with any production data. One, because we can't. But two is, it's also kind of like a nice stance that we like to take around privacy, especially because our customers are enterprise. We talked about that, that ephemera mode. But when we do build out the more robust like agentic search things, we definitely will have egos, but those will probably just exist on kind of my own environments.
[51:48]
B
Yeah. One example, Teresa, is we would never do a wrapped campaign, right? We wouldn't like re reflect people's usage or analytics or patterns or habits and create a marketing campaign from it. Never.
[52:00]
A
I actually think there's something really nice about. I know the ephemeral thing started and it's just a function of how you. Your order of development. But in some ways I feel like our tools should give us more autonomy on what gets stored and doesn't get stored. So today they have to copy it somewhere. I can imagine it'd be nice if they have to choose to save it or choose if they want it autosaved. So many business models are based on collecting as much data as possible that they just don't do that. But there is something quaint about that, which I appreciate.
[52:31]
C
I'm glad you brought that up because it really is. Privacy is really like our architecture in a way. So even though I do mention our next big push is obviously storing data, we give users the option to essentially enable temporary mode for every meeting that they have. And unlike how you might OpenAI does, where they still write that data is open and they hold it for 30 days, it actually completely bypasses our servers. Like we don't even have a record of that. That meeting occurred on our own servers. And we think that part is really important. And then likewise, we do have an organization setting too. If a particular org came in, they can actually enable that for everyone and then no one can actually disable the temporary meeting. So it would work like how it works today, where nothing gets written to the database. They could use it, but it stays on the machine that required to export it into another tool to save that data.
[53:26]
B
Yeah, and a big part of that, Teresa, is we've sold enterprise contracts with basically security considerations already incorporated as part of that language. So we have to keep that commitment.
[53:36]
A
On the hallucinations front, what's interesting to me is it does seem like as models have gotten better, this problem is fewer and fewer. And Sanon, your description of just giving it an out like an escape hash in the prompt, I think is spot on. I'll share a second tip that's worked almost 100% for me, which is to force the model to create proof of work. So if it's pulling from a transcript to return the line number or to return a timestamp, if the transcript has timestamps, that does two things. One, you can have a separate agent, then check to make sure that at that timestamp is that thing that the. And you can throw that at 4 oh mini or it can be a tiny model, but I don't even think you need to check it because I. I don't know that I've ever found mistakes like telling the model to provide evidence is enough for it to then stay grounded in the actual documents.
[54:29]
C
That is a great suggestion. And it's not only I feel like not only is that a great suggestion for hallucination, but that also is really good UX wise, because as a user, if you can almost see attribution you suddenly trust the products a lot more. And that's actually what I love about clockto. Like as it's running you can see, okay, like what files is it pulling from? So actually mirroring some of that behavior like in the pro in the product where it's not like all behind the scenes. Oh, here's your answer finally after 20 minutes. Okay, where'd you get that from? Like actually showing as it's occurring. Okay, like I found this meeting and this is relevant why to your point, like, okay, here's like the snippet of the transcript and essentially being able to show your work so it can gives the user that extra sense of validation. Okay. I can gauge how well it did or didn't do.
[55:17]
A
It's trust.
[55:17]
C
Great suggestion.
[55:19]
A
Yeah. And then on the evals piece I actually evals now are like rag. There's a war over it, which is silly. I think customer based evals are the best. Did it work for your customers? If your customers are happy. That's awesome. For me, what I found for evals is sometimes I get these like really stubborn behaviors that they're not simple as just make a prompt change. That's when I take the time to can I measure this in code so that I get a better sense of observing it? Obviously until you have storage, it's hard to do that. But I do work in an environment like I do my interview coach is integrated into a third party tool and I'm not allowed to store any data from my processing. I can store like metadata on the API call, like how much it costs to manage expenses, but I can't store the transcript or the LLM response. And it's a really tough environment to operate in. But what I do to work around that is when I identify like a persistent bug that I just need a good measure of. I can run a production evaluation and store the results of the eval as long as I don't store any of the content of the transcript or the LLM response. And so that's one way I like will roll those in and out depending on what I'm trying to solve for.
[56:39]
C
Yeah, that's a great tip. Almost. I love that. I love that idea. Not being able to see the actual data, but at least having the results of the LLM for you. The LLM is a judge approach. Yeah, that's a great tip. And then honestly a big part of it is that it's also come down to like our bandwidth at the moment. Like we would love more granular evals. We would Love. More confidence in being able to experiment, like, with model changes. Like, my fear around changing the system prompt is real. That would require a lot of testing on my end. But I think, like you mentioned, like, a big part of our focus has been trying to nail the user experience habit down first and then like backfilling those. Almost like finding out, okay, like, what works? And then, okay, like, now that we know that works now let's make sure that's robust and full of evals and then we can continue to iterate on it versus starting with evals has been more challenging. Especially because if we start with evals and then push it out and realize, okay, no one's used that, then we're like, oh, no.
[57:41]
A
Yeah. I almost think about evals as firefighting. Like, I don't write evals for everything. I wait for there to be a problem and then if I can't easily make the problem go away, then I go, okay, I have to get a better measure of this. I'll put in some time. I think when I was learning about evals, I had a very naive approach that, oh, this is like unit testing. I need it across everything. And then I quickly learned, no, that's expensive. It's a lot of effort. Don't do that. It really is, like, for the most important things that are stubbornly persisting. How do you measure it so that you can run lots of experiments?
[58:13]
C
Yeah, it's a great way.
[58:15]
A
Let me ask you this. I know we're coming right up on time. What's next for Earmark? I know you already previewed your projects and your search. Is there anything else that's coming up that you want to share?
[58:29]
B
Your near term vision is basically continuing building this AI chief of staff that goes beyond automating deliverables. This vision of second brain for product teams that kind of goes back to the proactive tasks, right? Spawning tasks automatically, adjacent team conversations, all that kind of stuff. We talked about the reporting pieces, but I think the broader vision for us is this idea of can we create this prolific, like, incredibly capable chief of staff experience, right, where nothing falls through the cracks, that the confidence and deliverable quality is there because we have the right scaffolding in place to understand, like, whether or not we're delivering things that people actually want and we want to make our customers look amazing and credible. To Sienna's point earlier around, like imposter syndrome and like what product people especially have to deal with with all the different audiences they're interacting with, that's a huge thing that I think we can solve for. And then just the idea of can we provide the feeling of being of comfort, right? And just folks feeling truly supported in these servant leadership roles. Right? Or just R and D roles in general where it's a zoo, right? R and D is hard and I think if we could just help again going back to the beginning, that audience, people like us with the tool that we've always wanted, that's where we want to be.
[59:41]
A
It's so clear to me that you're very mission driven and one of the things I love about what AI is enabling is that a two plus person team can have such a big impact. So thank you so much for taking the time to share your story. I really appreciate it. If you enjoyed this conversation, please subscribe in your favorite podcast app and give us a rating as it helps others find the show. Thanks. I appreciate it.