Summary9 min read

Podcast Summary: Latent Space – Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future

Guests: Simon Last & Sarah Sachs, Notion
Host: Latent.Space (with Alessio Ponder and Twix)
Date: April 15, 2026

Episode Overview

In this vibrant, deeply technical conversation, Simon Last and Sarah Sachs from Notion join Latent Space to walk through the past, present, and future of AI agent infrastructure at Notion. They lay out the company’s multi-year journey building agentic workflows, what it means to be a “software factory” company, the culture and organizational principles behind rapid iteration and frequent rebuilds, and the technical tradeoffs between MCPS (Multi-Call Protocol Services) and CLIs for agent orchestration. The episode provides an honest, behind-the-scenes chronicle of building Notion's AI capabilities—foundational insights, hilarious stories, and real-world lessons included.

Key Discussion Points & Insights

1. From Tools to Agents: The Journey and Its Challenges

Multiple Rebuilds, Shifting Paradigms:
Notion's AI capabilities have been rebuilt "five times," tracking industry changes—beginning with their own function-calling framework, through custom tool calling, to embracing agentic workflows as models matured.

"It's probably the fourth or fifth time that we rebuilt that." — Simon [01:46]
Early Struggles with Models:
Early attempts with tool calling failed due to insufficient model capabilities and context length. Big unlocks came only with the release of emerging models like "Sonic 3.6 or 7."

"At the time ... we designed our own tool calling framework and ... tried to fine tune the models ... but it never felt quite robust enough to be like a useful, delightful thing until ... Sonic 3.6 or 7." — Simon [02:35]
Staying Two Steps Ahead:
Product launches, especially agents, are the culmination of cycles spent iterating and learning—often being "two or three milestones ahead."

"You have to be. You can’t get complacent ... But there’s a lot to build." — Sarah [00:59]

2. Organizational DNA: How Notion Builds AI

Engineering Leadership Culture:
Leadership at Notion prioritizes low-ego, fast-changing, team-oriented approaches. The willingness to throw away code and iterate is a cultural constant.

"You need to build a team that’s comfortable deleting their own code and is very low ego ... That culture comes directly, I think from Simon and Ivan." — Sarah [11:17]
Prototypes over Design Docs:
Designers and engineers operate through live prototypes, not specs—their “Design Playground” repo means new features are demoed in full, not just mocked.

"We don’t do mocks, they just make like full prototypes ... and it’s become actually quite sophisticated.” — Simon [17:01]
Swarm Team Model:
Org structures form after products ship, not before: "we form org structures after we ship things, not before." — Sarah [13:13]

3. Software Factories & The Rise of Coding Agents

Coding Agents as AGI’s Kernel:
Notion sees the coding agent loop—where agents debug, merge PRs, and maintain services—as the core of future software engineering and AGI development.

"I think coding agents are the kernel of AGI. Everything is a coding agent." — Simon [04:45]
Software Factory Automation:
The long-term vision is workflow automation where agents coordinate, debug, maintain, and improve codebases with minimal human input.

"Can you create ... an as automated as possible workflow for developing, debugging, merging, reviewing and maintaining a code base and a service where there's a bunch of agents working together inside?" — Simon [04:45]
Human Supervision & Reframing the Role of Engineers:
The panel discusses the shifting definition of software engineers, with humans moving up the abstraction ladder to supervise agent flows, not write every line.

"We're just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system." — Simon [27:04]

4. Tooling, MCPS vs CLI, and Agent Composition

MCPS ("Multi-Call Protocol Services") vs. CLIs

CLIs for Power and Self-Repair:
CLIs are favored for debugging, bootstrapping, and progressive disclosure of tools within a rich terminal context.

“If there's an issue, the agent can debug and fix itself within the same environment.” — Simon [37:51]
MCPS for Safety and Simplicity:
MCPS provides strong permission boundaries, deterministic task execution, and often lower per-task cost—useful for lightweight, highly controllable agents.

“MCP inherently has a really strong permission model. Like, all you can do is call the tools a CLI is a little bit murkier ... MCP is just like the dumb simple thing that works and it's pretty good.” — Simon [38:44]
Strategic Commitment to Both:
Notion will continue investing in both paradigms to meet customer needs in different environments.

“We will always support our MCP insofar as other people are using mcps ... we've put a lot of effort into our MCP and we have a fantastic team that we're building.” — Sarah [38:56]

Agent Composition & Orchestration

Composability as a Feature:
Agents coordinate via databases, pages, and direct invocation—e.g., a “manager agent” overseeing 30+ subordinate agents, aggregating notifications.

"We set them up with a manager agent ... creates a layer of abstraction. So instead of 70 notifications per day, it's like five." — Simon [33:55]
Memory as a Native Notion Primitive:
Agents use Notion pages/databases for memory and state — no special constructs.

"Memory is just pages and databases ... That pattern works extremely well." — Simon [35:16]

5. Eval Culture: Quality, Velocity, and Model Drift

Eval Infrastructure as an Agent Harness:
A dedicated team builds out eval systems where agents write their own evals, iterate, and debug test failures.

"We made an effort ... to treat the eval system as like an agent harness." — Simon [25:58]
Frontier Evals:
Evals are classified as regression/unit (in CI), launch quality (report card), and "frontier/headroom" (where Notion actively wants 30% pass rates to probe model limits).

"We have ... frontier or headroom evals where we actively want to be at 30% pass rate." — Sarah [23:11]
Model Behavior Engineers (MBE):
MBEs are hybrid evaluators (data specialist, prompt engineer, PM) who create evals, perform qualitative assessment, and help steer development.

“Model Behavior engineers started with ... data specialist ... now they're primarily building agents that can write evals for themselves or LLM judges.” — Sarah [24:25]

6. Pricing, Custom Agents, and Product Philosophy

Usage-Based Pricing & Abstracting Over Tokens:
Credits abstract over underlying token, GPU, and service costs. The aim is to match customer needs, not purely maximize usage.

"We had to think of an abstraction above tokens and it's also not just tokens, it's the token model and serving tier trade off." — Sarah [56:32]
Empowering Power Users over Simplicity for All:
Notion made a conscious decision not to minimize complexity at the cost of power—system prompts, tool lists, and agent settings are transparent to favor power users.

"We're not trying to build for everyone here. ... Because the more we do that, the more we just diminish its capabilities." — Sarah [51:31]

7. Meeting Notes: A Canonical Use Case for Agentic Notion

Explosion of Agentic Content:
Meeting Notes drove an explosion of long-form data—boosting the value of capture, enabling agent workflows for team standups, summarization, and follow-up task generation.

"Meeting notes was one of those things where at first we were nervous ... they’re one of our biggest growth levers." — Sarah [71:08]
Agents in Internal Workflow:
Internal agents now read Slack and GitHub before meetings, auto-generate pre-reads, and create actionable items from live conversations, making meetings nearly “hands off keyboard.”

“There’s a custom agent that does a pre read before standup. ... Then we just press play, ... and then we have a custom agent integrated with our calendar and triggers that then files tasks for tomorrow or today based on what we spoke about.” — Sarah [72:35]

Notable Quotes & Memorable Moments

On Engineering Culture:
"My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on ... And it's a huge disservice if all of those ideas have to pass like the sniff test of what me and a product partner or Simon and Ivan decided." — Sarah [10:33]
On Rebuilding:
“I basically just [rethink everything, rewrite everything] in a loop every six months.” — Simon [12:13]
Past Solutions and Model Evolution:
"We created this whole XML format ... but the model didn’t know the XML format … we're like, okay, it has to be markdown. The models know markdown." — Simon [44:43]
Progressive Disclosure and Agent Harness:
"We hit a bottleneck where our agent worked really well. ... [But] it became pretty hard to add new tools ... So we had an effort basically to make our harness implement progressive disclosure in a nice way. That's a big shift." — Simon [47:36]
On Power vs Simplicity:
“We don’t think our system prompt is our secret sauce.” — Sarah [50:56]
Meeting Notes Adoption:
"It's been one of the most powerful signals for our agent because ... you’re capturing a whole new thing." — Sarah [72:23]
On Evaluating Models:
“Our job is to make sure we have the evals to understand the changes that are important to us ... Not only is that not helpful for our partners, it’s not helpful for us to understand where the stream is going.” — Sarah [22:04]

Highlighted Timestamps

| Timestamp | Topic/Quote | |-----------|-----------------------------------------------------------------------------| | 00:00 | Simon bullish on MCPS for lightweight, permissioned agents | | 01:46 | Five rebuilds of core agent infrastructure | | 02:35 | Early tool calling attempts; barriers in model capability | | 04:45 | Software factory & coding agents are AGI kernel | | 10:33 | Leadership philosophy: supporting prototype-driven ideas | | 12:13 | Rapid rebuild cadence – “every six months” | | 13:13 | Organization forms after shipping, not before | | 17:01 | Designers building full prototypes / “demos over memos” | | 25:58 | Evals as agent harness | | 27:04 | Software engineer’s evolving role (up the abstraction ladder) | | 33:55 | Manager agent oversees 30+ agents, aggregates notifications | | 35:16 | Memory as pages/databases; no special memory concept | | 38:44 | MCPS vs CLI: tradeoffs, permissioning, cost | | 44:43 | Abstractions moved from XML to markdown for model alignment | | 50:56 | Transparency: tools & prompts not secret sauce | | 51:31 | Designing for power users rather than universal simplicity | | 56:32 | Credits-based, usage pricing at abstraction above tokens for fairness | | 71:08 | Meeting Notes as viral growth lever and agentic data foundation | | 72:35 | Internal workflow: agents generate pre-reads, create follow-ups from meetings|

Structure of Notion’s “Software Factory Future”

Frequent, Fast Rebuilds:
- Rewriting foundational components in sync with external advances and internal learnings.
Empowering Prototyping:
- Everyone prototypes, ships, and gets feedback; velocity and feedback loops prioritized.
Modular, Composable Agents:
- Agents leverage Notion’s primitives (databases, pages), compose via manager/subordinate patterns, and are orchestrated for both background and collaborative work.
Eval-Driven Quality:
- Automated evals, dedicated MBE roles, “frontier evals” for pushing capability limits.
Transparent, Power-User Oriented Product:
- Open system prompts, transparent tool lists, clear permissioning—favoring operators’ interpretability and customization.
Pricing Fairness:
- Usage-based pricing abstracts real token/capability costs, credits system enables flexible, fair access.

Final Reflections

The episode reveals Notion’s conviction that the future of software engineering will be agentic, collaborative, and highly abstracted—with human engineers acting as orchestrators and supervisors of increasingly autonomous, modular systems. Their approach—iterative, composable, and culture-driven—offers a critical window into the mindset and machinery required to thrive as an "AI-native" product company. Explicitly not chasing “the lowest friction for everyone,” Notion targets power users, enterprise buyers, and advanced workflows—an unapologetic stance that reflects their confidence in where AI meets software collaboration.

For detailed show notes, visit: https://latent.space

Loading summary

Transcript359 lines

[00:00]
Simon
Broadly speaking, I'm really bullish on clis. I'm still bullish on MCPS in a certain environment. I think MCP is really great for when you want like a narrow, lightweight agent. I think there's definitely a lot of use cases where you don't want like a full coding agent with the compute runtime and also you want it to be like more tightly permissioned. MCP inherently has a really strong permission model. Like all you can do is call the tools. MCP is just like the dumb simple thing that works and it's pretty good.
[00:24]
Sarah
Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP insofar as other people are using mcps, regardless of our perspective. We've put a lot of effort into our MCP and we have a fantastic team that we're building.
[00:39]
Alessio Ponder
Hey, everyone, welcome to the Late in Space podcast. This is Alessio Ponder and I'm joined by Twix, editor of a Late In Space.
[00:44]
Sarah
Hello.
[00:45]
Simon
Hello.
[00:45]
Host/Interviewer
We're back in the beautiful studio that Alessio has set up for us with Simon and Sarah from Notion. Welcome.
[00:51]
Sarah
Thanks for having us.
[00:52]
Simon
Thanks for having us.
[00:53]
Host/Interviewer
Yeah, congrats on the launch recently. Custom agents. Finally it's here. How's it feel?
[01:00]
Sarah
We ship things slowly. So it had been in alpha for a little bit and at the point at which it's in alpha, there's a group of people that are making sure it's ready for prod, and then there's a group of people working on the next thing. So sometimes some of these launches are a bit delayed satisfaction. So it's quite nice to remind yourself all the work you did because we do have a habit of being two or three milestones ahead just because you have to be. You can't get complacent. But it's been great that people understood how this is helpful. And I think that's just easier in general building AI tools today than it was two, three years ago. People get it. And so that user education, there's just. It was our most successful launch in terms of free trials and converting people and things like that. It was really successful. So. Yeah. But there's a lot to build.
[01:43]
Host/Interviewer
Making it free for three months helps.
[01:46]
Simon
Yeah, it was definitely super exciting for me because it's probably the fourth or fifth time that we rebuilt that.
[01:53]
Host/Interviewer
Yes. And you've been building this since 2022.
[01:57]
Alessio Ponder
Yeah.
[01:57]
Simon
It even right when we got access to GPT4 in late 2022. Okay, let's make an agent that we used the word assistant at the time there wasn't really the word agent yet. But oh, we'll give it access to all the tools the notion can do and then it will run in the background do work for us. And then we just tried that many times and it just was too early.
[02:14]
Host/Interviewer
I need to force you to like double click on that. What is too early? What didn't work?
[02:18]
Sarah
We were fine. Like before function calling came out, we were trying to fine tune with the Frontier Labs and with fireworks like a function calling model on notion functions. This is right when I joined. I joined because we needed a manager. Simon was needed to be able to go on vacation. That's around when I joined. So you can speak much more to it.
[02:35]
Simon
Yeah, we did partnerships at both M4Epic and OpenAI at different times to try to. At the time the. When we first tried, there wasn't even a constant of like tools yet. We designed our own tool calling framework and then we tried to fine tune the models to use it over multiple turns because it didn't work well out of the box, I think. Yeah, the models are just too dumb and the context length was also way too short. Yeah. And yeah, we just banged our head against it for a long time. Unfortunately. It was always like. There was always like glimmers that it was working, but it never felt quite robust enough to be like a useful, delightful thing until I would say the big unlock was probably like Sonic 3.6 or 7 early last year. And that's when we started working on our agent, which we shipped last year. And then custom agents kind of a similar capability and that one just took longer because we just wanted to get the reliability up a lot higher because it's actually running in the background and
[03:28]
Sarah
the product interface of like permissions and understanding. This custom agent is shared in a slack channel with X group of people and has access to documents that are serviced to Y group of people. And the intersect of X versus Y might not be whole. And so how do you build the product around making sure administrators understand that permissioning took multiple swings.
[03:48]
Alessio Ponder
Everything is hardback at the end of the day. Yeah. I'm curious, when the models are not working, how do you inform the product roadmap of okay, we should probably build expecting the models to be better at some reasonable pace, but at the same time we need to. You had a lot of customers in 2022. It's not like you were a new company with no user base.
[04:06]
Simon
Yeah, I mean I think there's always the Balance of, like, you want to be AGI pilled and thinking ahead and building for where things are going, but also you want to be, like, shipping useful things. And so we always try to, like, keep a balance there. We. We try to take, like, a portfolio approach. We're always working on multiple projects, and we're always trying to work on maintaining things where they've already shipped, like shipping new things that are, like, eminently working well and make them really good. And then we want to always have a few projects that are a little bit crazy.
[04:32]
Alessio Ponder
Yeah. What are the AGI pill projects that you have today? I'm curious what. You don't have to share exactly what you're working on, but I'm curious, what are things today that maybe in 18 months people will be like, oh, obviously this was gonna work.
[04:43]
Sarah
18 months.
[04:45]
Alessio Ponder
Yeah, 18 months is.
[04:46]
Simon
It's a long time. And yeah, there's a number of things happening. I think one thing that's becoming more clear is I think the coding agents are the kernel of AGI. Everything is a coding agent. I think that's one. One direction. And then the exciting thing about that is your agent can bootstrap its own software and capabilities and actually debug and maintain them. And so we're thinking a lot about that. And then, yeah, like, another category of things that I'm really excited about is we call the software factory. Lots of people are using this sort of word. Basically, it just means can you create like a as automated as possible, a workflow for developing, debugging, merging, reviewing and maintaining a code base and a service where there's a bunch of agents working together inside? And, like, how does that work?
[05:28]
Sarah
If you think back to your initial question, like, why did this take so long? I think something notions.
[05:32]
Host/Interviewer
I didn't say that, but yes. Okay, go ahead.
[05:34]
Sarah
Why? What changed over the three and a half years of trying?
[05:36]
Host/Interviewer
Because most people always say it didn't work yet. Then reasoning models came. Then it worked. I was like, okay, let's go a little bit.
[05:41]
Sarah
I mean, that's part of it, but I think the other part of it that I actually think is really what will set notion apart for every new capability is we have two skills that are crucial when it comes to frontier capabilities. One is not letting yourself swim upstream. So, like, quickly realizing, if you're just pressing against model capabilities versus not exposing the model to the right information, not having the right infrastructure set up, that in of itself is a skill of intuition. And the second is to see, okay, you're not swimming upstream, which Direction is the river flowing and what is like, how do we think ahead about the product and start building it even if it's not great yet, so that when it is there, we're ready for it. And like those can sometimes feel like counterintuitive things. Like we can be trying to fine tune a tool calling model when they don't exist yet. And the trick is to not do that for too long, but realize that there was something there. And we've had a lot of things which like, we're just like not swimming in the right direction with the streams. I think we had multiple versions of transcription before we got meeting notes, right?
[06:35]
Host/Interviewer
Oh, I gotta talk about that. Yeah, yeah.
[06:37]
Sarah
I think that like we really closely partner with the Frontier Labs on capabilities and we also have to have strong conviction on as those capabilities move. Notion is about being the best place for you to collaborate and do your work. And how does that narrative change if the way that we work changes?
[06:53]
Simon
Yeah, yeah.
[06:53]
Host/Interviewer
You told me you were a fan of the Agent Lab thesis and this is.
[06:56]
Sarah
I show that thesis to so many candidates, like have it as my Chrome autofill at this point. Like it's one of my most visitations.
[07:03]
Host/Interviewer
Is this the. Here's why you should work in Notion and not open AI.
[07:06]
Sarah
Here's what's different about it and here's why it's not just a rapper. I actually think more and more people understand it's not just a wrapper. And by the way, like in the beginning parts of what we build are wrappers on functionality that works well, but that's not really the most. I would say that's not the product that that drives revenue and that's not necessarily always what users need.
[07:25]
Host/Interviewer
Notion is the AWS wrapper. But like the wrapper is very beautiful and like very well polished.
[07:29]
Sarah
So like the analogy that I've been coming back to is Datadog in aws.
[07:34]
Host/Interviewer
Yeah.
[07:34]
Sarah
Datadog could not exist without cloud storage. That it's fundamental that works. And aws has a CloudWatch product. But Datadog is an expert on understanding how people want observability on the products they launch. And we're experts in understanding how people want to collaborate. And that's really where our expertise lies. Totally regardless of the tools that we use.
[07:53]
Alessio Ponder
I'm curious how you think about implicit versus explicit expertise. I feel like Datadog is half and half implicit and explicit. It's like they understand across markets and industries what engineering teams usually look for. With Notion, it's almost like more of the expertise is at the edge because you as a platform you're like so horizontal that the end user is not really the same, but the end user is always an engineering lead, a kind of like SRE related person with notion. It can be anything. So I'm curious how you put that expertise into a product versus obviously AWS can unbuild notion. That doesn't quite work in this case,
[08:29]
Simon
but it's a little bit differently shaped. I think a classic vertical SaaS like data is kind of like that. They understand their individual customer very deeply. It's kind of a narrow slice. Notion has always been super horizontal and our task has always been to balance these two somewhat opposing forces. We're listening to our customers and what they want us to build. It's a broad slice. And then also we're thinking about, okay, how do we decompose what they want into nice primitives that are really nice to use and will get us like as much bang for the buck as possible and then maintain the whole system, make it all super clean and nice to use.
[09:00]
Sarah
We still have easier journeys. We still focus on core. I actually think the failure of our team is when we focus too much on what are tools that are cool tools. I actually think that's when we have the least velocity because you still need some sort of focus on a user journey. So for instance, we'll all sit down every Friday and look at the P99 of the most token exhaustive custom agent transcript and just look at why it didn't do well and cut a bunch of tasks. Like, we still focus on this has, like, this should work. Email triaging should work, right? And similarly, like when we're talking about before building, chatting before we started filming about, okay, how can I do PDF export? That's functionality that then merits. Maybe we should build a tool that has access to a computer sandbox and a file system and the ability to write code. But it's because we're thinking about the fact that our users, to do their daily work, need to export PDFs, not because we're like, I think a computer tool could be cool. Let's just see what happens. We have to focus on some user journeys, otherwise we just don't know enough strategy to prioritize.
[10:01]
Host/Interviewer
I think there's a lot of really strong opinions that you've had. Do you have a Tao of Sarasax? How do you run your team? I feel like you just have accumulated all these strong opinions. Obviously part of this is your token town thing.
[10:14]
Sarah
I think the Taos working with Sarasax is It depends who you ask. I think it depends if you're on my team or a partner or a vendor.
[10:22]
Host/Interviewer
Yeah, there are other people want to run their teams the way that you're running these things. But then also similarly, Simon, when you did the custom agents demo, you had, we've been using custom agents and here's the story, super long list of everything that we do. No humans ever read it. That's what you said.
[10:34]
Sarah
I was like, yeah. So I think for me, something that I learned very quickly and became very comfortable with was that my job was not to be the ideas person or the technical expert. My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on and had an avenue to prioritize what they thought was important. And I think that's true with all leadership. But I think especially on the AI team, almost all of our best ideas come from prototypes, from people that have a cool idea because they saw a user problem. And it's a huge disservice if all of those ideas have to pass like the sniff test of what me and a product partner or Simon and Ivan decided were the direction. Right. Because a lot of what we're doing is leaning into capabilities. So I think that's the first thing is I don't really view like the role of engineering leadership as hierarchical, nor has it ever been, but especially now, like very willing to change direction based on like, proof is in the pudding. And I think we have rebuilt our harness three or four times. And when you do that, then the second rule of engineering leadership is like, you need to build a team that's comfortable deleting their own code and is very low ego and is driven by what's best for the company and doesn't write design docs because they think it's their promotion packet. And that's a culture that Notion had long before I joined. But like our willingness to just swarm on different problems is. And redo things that we've built before because something has changed. There's a lot of friction that can happen at companies when you do that and it doesn't happen at Notion and because it doesn't happen when new people join, like, they don't want to be the ones that are saying we shouldn't do this. I wrote that code. So then it's. You create a culture that everyone adopts. And that culture comes directly, I think from Simon and Ivan though, because they're very open minded.
[12:12]
Host/Interviewer
Anything you'd add?
[12:13]
Simon
I'm not a Manager like Sarah is. A lot of my role is really to try to think a little bit ahead, make sure with that we're building on the right capabilities and then the prototyping stuff. And yeah, it's really critical to always just be starting again. Okay, this is new thing. What does this mean? What if we just rethought everything, rewrote everything and I basically just doing that in a loop every six months. Yeah.
[12:35]
Host/Interviewer
Do you believe in internal hackathons for this stuff?
[12:38]
Sarah
I think there's two different versions. So one is we just have a solid bunch of senior engineers that come and go on what we call the Simon Vortex and productionizing what we built. Right. Because when you're in the Simon Vortex, the velocity is super high, the direction changes daily and it's meant to be like the equivalent of a skunk works lab. We don't need to do hackathons for that. We need to have senior engineers that we trust to come in and out of those projects. For instance, like managing boundaries are really loose. Like you report to him, but you work for her right now that's something that when we hire managers, it's important they don't care about because we tend to form org structures.
[13:12]
Host/Interviewer
Yeah, don't be too territorial.
[13:13]
Sarah
We form org structures after we ship things, not before, just historically. The second thing is we do have company wide hackathons. Actually we just had our demos day for the hackathon we had last week this morning. That's more for people that aren't directly working on the project feeling like they have the time to pause and learn how to make themselves more productive or how they would use notion custom agents to build something or part of the hackathon was actually encouraging everyone across the company to build their own agentic toll loop, calling from scratch, following like in every blog post on how to do it. I think because we want.
[13:43]
Host/Interviewer
Is it the compound engineering one?
[13:44]
Sarah
Yeah. We want everyone to use cloud code in the company or whatever the coding agent they please and understand that fundament. So we set aside a day and a half where all leadership encourage everyone on their teams across the company to do it. So we have hackathons like that, I would say like facetiously, like everything we build is a little bit like a hackathon until it graduates and puts on big boy pants and as a product ops rollout later and has assigned data scientists and stuff like that.
[14:10]
Host/Interviewer
Security review, enterprise stuff.
[14:12]
Sarah
Actually security review is one of the things that we bring in first because it just slows us down way more more and Causes a lot of tension and they build better product if they're involved early. That is probably the first person to
[14:23]
Host/Interviewer
get involved in something that's the right PR approved answer.
[14:26]
Sarah
No, but it's not just PR approved.
[14:28]
Host/Interviewer
It's like, it's actually real. I'm just saying.
[14:29]
Sarah
Scar tissue.
[14:30]
Host/Interviewer
Yeah.
[14:31]
Sarah
Because my background's also. I worked at Robinhood for a number of years, so compliance and things like that are a little bit more. You learn the hard way when it doesn't come naturally.
[14:39]
Alessio Ponder
Yeah.
[14:39]
Simon
I think the. The hackathon is really important for uplifting the general population, but if that's the only way you can build new things, you're toast. It has to be like the daily processes, building these new things. And it has to be about, I think in the AI era, a lot more leverage accumulates to the most curious and excited people. And so it's like we're all about just like activating that energy. If someone's protesting something on the weekend that they're excited about and it's important, that should be the main thing that we're doing.
[15:06]
Alessio Ponder
Yeah.
[15:06]
Simon
It's not a hackathon that we schedule once a quarter. It's just a daily process that's part of the culture.
[15:10]
Sarah
That's how we shift image generation and Notion now. It was always this thing that would be nice to have, but it wasn't really clear where that was necessarily aligned in product priorities. It'd be a lot of work. And we had someone on the database collections team, Jimmy, who was like, I really want to do image generation for cover photos and inside notion. And we're like, if you want to build it, like it's, do it, please. Like, we encourage you. We gave him all the resources of working directly with Gemini and being able to track the token usage and working through our endpoints. We gave him email support, everything, and then it became a full project. Yeah, that's why you can't have ego as a leader. That's how we work.
[15:45]
Alessio Ponder
What's the size of the team today? Both engineering and overall?
[15:48]
Sarah
I manage the team. That's what we'll call core AI capabilities and infrastructure. That's about 50 people. But then we have partner teams that do packaging. So how it shows up in the corner, chat versus custom agents versus meeting notes, that's another 30, 40 people. And then every team that has a product service at Notion that a user can interface with owns the tool that the agent interfaces with. The editor team, the team that did CRDT for offline mode is the same team that handles how two agents edit competing blocks. It's the same problem. The team that built the underlying SQL engine is the same team that owns how the agent asks it to run a SQL query and it does it performantly. And so from that regard, anyone working on product engineering is tasked with making them work for customers that are humans and agents. Because over time a majority of our traffic will be coming from agents using our interface, not humans. And our objective is to make it so that the whole product. Org is building for agents.
[16:44]
Simon
Yeah.
[16:44]
Alessio Ponder
How has it changed internally the activation bar is lowered a lot. Like anybody can create a prototype very somewhat easily, especially if you're like an existing code base. Have you raised the bar on what type of prototype people need to bring forward to gonna be taken? Not like seriously, but I think the
[17:01]
Simon
bar is lowered in many ways. One thing about our team build that is really cool is our design team made a whole separate GitHub repo called the Design Playground and it's basically just they create a bunch of like helper components and you for quickly throwing together UIs and it's become like actually quite sophisticated. Like it has agent in there and that's pretty fun. So we pretty much they don't do mocks, they just make like full prototypes.
[17:24]
Host/Interviewer
Here it is, it works.
[17:25]
Simon
They give you like a URL and they're like okay, so we have to make the real production version of that. And then for engineers a prototype looks like just making it a feature flag that actually works. That's a bar.
[17:34]
Sarah
Something to understand that's really unique about Notion. One of the reasons I joined we're super lucky is no one uses Notion in their job as much as people that work at Notion, of course. So I think there's very few companies maybe if you worked on Chrome I guess. But like everything that we ship internally first and get a lot of really quick feedback. And also sometimes our dev instance is totally borked and you have to change a bunch of flags to get things done and that's. But everyone. So people that do it ticketing, people that do supply chain procurement, recruiting, everyone is using the same instance of Notion with a lot of flags on for these prototypes people build. And so we have this Brian Levin, one of the designers on our team, I think evangelize this concept of demos over memos which has been very good for building demos. And I think it's put a big pressure point on us to have really strong product conviction because if anything can be demoed you really need a strong filter of making sure that if you're doing X amount of work you're making the you're focusing on one tower. You're not just building a really flat hill. That's actually where I think there has to be more conviction from our PMs and our designers, the company really to have conviction of what journey we're going on.
[18:42]
Simon
But overall I feel like it works pretty well. Like people, almost all the engineers have good enough taste to realize that this prototype doesn't actually make sense in the product or it does. So it's not that common that I would see a prototype. It's like, oh, this makes no sense. It's people are doing reasonable things and then it's just a matter of which things we build first and then often just figuring out how to turn it on and off. In our experimental chat UI, there's hobby like 100 checkboxes in there, different things you could turn on and off.
[19:11]
Sarah
But I think that okay, so that is true, Simon. But being the person that manages the evals team, there is a level of intensity that it adds to the platform team. If we're going to do image generation in notion all of a sudden the way that we do attachments in the way that we our LLM completion like Cortex talks and expects tokens back and now it's getting images back. There's a lot of platform work that we do need to solidify a little bit. So sometimes it'll be in dev for a couple weeks before it makes it to prod. Just because we still have to like make it robust, make it HIPAA compliant, ZDR compliant, figure out the right contracting with the vendor, whatever it is, and we need to eval it because we want the team to still maintain what they build. That's the one thing is if we have a bunch of prototypes, it can't just be like a small group of people that then maintain whatever in prototypes. So we have invested a lot of people in eval and model behavior understanding teams that we call it agent Dev Velocity. So your dev velocity building agents can be faster if we invest in that platform. And so we have a whole org dedicated to agent platform velocity so that you can build your own eval and then maintain it once you ship it. So if a new model release comes
[20:21]
Host/Interviewer
out and we every team maintains their
[20:22]
Sarah
own eval, we maintain the eval framework, every team owns their own evals and a lot of them we've integrated to opt into CI or we run them nightly and we have a team, a custom agent that triggers to a team to look at the major Failures. That's really critical because if we have all these different services now, a lot of it's on the same agent harness, so it's easier to maintain. It's just packaging of different agent harnesses, but new functionality of the agent. Let's say that like we want to update, like they deprecated Sonnet 4 or whatever it is and we need to auto update already.
[20:54]
Host/Interviewer
That's so.
[20:55]
Simon
Okay.
[20:55]
Host/Interviewer
Yeah, it wasn't that long ago they
[20:56]
Alessio Ponder
were just 3.5, 3.5, 3.
[20:58]
Sarah
7 just got deprecated.
[21:00]
Host/Interviewer
3.7, 5.2 or.
[21:02]
Sarah
Yeah, no, it's not 5.2, 5.1, 5.2. Yeah. 5.4 is 40% more expensive than 5.2. So if they deprecated 5. 2, you would hear, they came, you would hear from me about that. That one. But another conversation to have.
[21:15]
Host/Interviewer
I have a cheeky evals question for you. Have you noticed any secret degradation from any of the major model providers? Secret degradation during the workday when it's high traffic, it suddenly gets.
[21:25]
Sarah
Yeah, not just between the. We definitely notice flakiness. We've definitely noticed, particularly for some providers, that things are slower during working hours.
[21:34]
Host/Interviewer
And that's a latency argument, not a quality argument.
[21:36]
Sarah
No, I think the quality difference that's interesting is even though companies that say they're selling the same, it's really into like quantization. But like companies that say they're selling the same model through different vendors, whether it be through First Party or Bedrock Azure, et cetera, we do see different qualities sometimes and that's not necessarily what's advertised.
[21:58]
Host/Interviewer
Yeah, Kidney went to the point of they shipped like this eval across all the providers and it was like very obvious we were secretly quantizing and it
[22:04]
Sarah
was very, very embarrassing. We hire sub processors to figure that out for us. So we just want to understand where it's regressing or where it's optimized. And sometimes we're okay with regressions that optimize latency if they're the appropriate regressions. Our job is to make sure we have the evals to understand the changes that are important to us. And even like when we're partnering with labs on pre releases of models, they'll send us multiple snapshots. And this is less about quantization but more just regressions. Like they have shipped to models that were not the snapshots that we wanted and they have changed the snapshots that they shipped based on the feedback that we give. Because our feedback tends to be more enterprise work focused and not coding agent focused. And definitely those can be bummers. We know that this wasn't the version you wanted, but we'll help you make it work. We always make it work. But that definitely happens.
[22:50]
Simon
Yeah.
[22:50]
Alessio Ponder
Do you have failing evals that you're just hoping that will have success eventually when a good model comes out?
[22:57]
Sarah
Yeah. So I think I could talk about this for 60 minutes. So I will limit myself. I think it's a real issue when people say evals and it's just like that's quality. It's like saying testing. It's not just unit tests. So we have the equivalent of unit tests, regression tests, those live in CI. Those have to pass a certain percent within some stochastic error rate. Then we have as you're building a product, evals of these aren't passing right now. And this is launch quality. So we have a report card and we need to on these categories be at 80 or 90% of all of these user journeys to launch. And then what we have what we call frontier or headroom evals where we actively want to be at 30% pass rate. And that's actually been a effort that we took in partnership with Anthropic and OpenAI in the past maybe two or three months because we actually hit a point where our evals were saturated and we weren't able to really give insightful feedback other than it wasn't worse. And not only is that not helpful for our partners, it's not helpful for us to understand where the stream is going. Going back to that analogy. And so we spent a lot of time thinking about what Notion's last exam looks like. Right? Not just humanities last exam, Notion's last exam. And there's a lot of dreams about what that would look like. I know we've talked a lot about benchmarking swix, but yeah, Notion's last exam is a big thing inside the company and we have people full time staff to it exclusively. We have a data scientist, a model behavior engineer and a full time evals engineer just dedicated to the evals that we pass 30% of the time.
[24:23]
Host/Interviewer
What you're hiring for MBE?
[24:25]
Sarah
I am hiring.
[24:25]
Host/Interviewer
What is an MBE?
[24:26]
Sarah
A model behavior engineer. Model Behavior engineers started with a title data specialist before I joined when they were working with Simon on Google Sheets. And Simon just needed someone to look through Google Sheets and say, yes, no, this looks bad, this looks good. And so we hired people with diverse linguistics background. We had a linguistics PhD dropout and a Stanford comp lit new grad. And they're amazing. And they formed a new function basically. And over time we've built a whole team with a manager who's now reinventing what that role is with coding agents. So they used to be manually inspecting code. Now they're primarily building agents that can write evals for themselves or LLM judges. There's a really funny day. I can send you the picture where Simon, about a year and a half ago, was teaching them how to use GitHub. And they're on the whiteboard and it was like, okay. I think it would be so much faster if our data specialists learned how to use GitHub and learned how to commit these things into coda. And that was then and now. I think coding has been a lot more accessible. But moving forward, it's this mix of like data scientist, PM and prompt engineer. Because there's craft in understanding even what models can and can't do things. How do we define that headroom? How do we define what a good journey is? Is this model better or not? Why is this failing? There's some qualitative work, but then there's also a lot of instinct and taste to it. And that's not necessarily software engineering. And so we have very firm conviction and we have had for a number of years now, that is its own career path. And we have always welcomed the misfits, so to speak. So we really firmly believe that you don't need an engineering background to be the best at this job. And that's what's quite unique about this particular role.
[25:59]
Simon
This is something that I've been pretty excited about recently is we made an effort basically to treat the eval system as like an agent harness. So if you think about it, you should be able to have an agent end to end, download a data set, run an eval, iterate on a failure, debug, and then implement a fix. And ultimately you should be able to drive the full end to end process with a human sort of observing the outer system. So yeah, we went pretty hard on that. That's worked extremely well so far. It's like basically just to turn it into a coding agent problem.
[26:30]
Host/Interviewer
Your coding agent or just whatever, any coding agent.
[26:34]
Simon
It should be totally general. Yeah, I think it would be a mistake to fix it on any particular coding agent. At the end of the day, it's just like CLI tools.
[26:39]
Sarah
It's like the same way that you would have a coding agent write the unit test. You should have a coding agent write the evaluation. Yeah, but there's a lot of supervision in that still. We just don't believe that supervision has to come from software engineers because a lot of it is UXRE and whatever. And these are the people that also triage failures and tell us where we should be investing next.
[26:57]
Host/Interviewer
Yeah, I'm going to go ahead and ask a spicy question. Is there a day there are no software engineers at Notion?
[27:02]
Sarah
What does it mean to be a software engineer?
[27:04]
Host/Interviewer
Exactly?
[27:04]
Simon
I think the way things are going is like we're on some continuum where if you look back three years ago, humans were typing all the code and then we had autocomplete, you're typing. Then we had like filling agents, filling lines, and now we're getting into like agents doing longer range tasks where you can debug and implement a fix and then verify it works and get your PR even like merge and deployed. I think we're just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system. There's a string of agents flowing through, merging PRs. What's going off the rails? What do I need to approve? Is there like a learning or memory mechanism that works? So it's a hard engineering problem. There's a lot to do there. I think we're just moving up the stack.
[27:46]
Sarah
The same transition machine learning engineers have made. Like, I haven't looked at a PR curve in a while.
[27:52]
Host/Interviewer
Yeah, you used to do this stuff and now auto research can do it.
[27:54]
Sarah
Right. Like, I think it depends on what you define as a software engineer.
[27:58]
Host/Interviewer
Yes, that's changing for sure.
[28:00]
Sarah
I think every software engineer at Notion this summer went through like this sheer. One of our engineering leads at the company called it. Like every software engineer is going through the identity crisis that every manager goes through where all of a sudden they realize their ability to write code is less important than their ability to delegate and context switch. And I think that is a transition out of being a software engineer.
[28:21]
Simon
But yeah, there's a critical difference to being a manager, which is that it is actually very deeply technical. The problem. Humans are very fuzzy and you can't treat a team of humans like a rigorous system where PRs flow through and can be in a blocked status. And then what happens when they're blocked with a set of agents, you actually can do that. I think it's actually there's a lot of interesting technical rigor that goes into that. It's a technical design problem.
[28:46]
Alessio Ponder
Ultimately, what is the design of the software factory that you're building?
[28:50]
Simon
Yeah, I think we're trying a lot of different things. Ultimately, you want to Design a system that requires as little human intervention as possible, but like still maintaining the invariance that you care about. So yeah, we're exploring a lot of different ideas there. I think I can talk about a few things I think are important there. One thing I think is really important is having some kind of like specification layer. You can just commit markdown files. That works pretty well.
[29:15]
Host/Interviewer
But it's nice to be notion, man. I'm just saying like the spec. Yeah. The natural home for specs is notion.
[29:22]
Simon
Yeah. It can be a database of pages. Yeah. It needs to be something that is human readable and imbuable and I think that's pretty key. Another really key component is like the self verification loop. Yes. You need a really good testing layers basically and that's a really deep problem for getting that. And then it's kind of like the workflow of what happens when there's a bug. How does it flow into the system? Is it like a sub agent working on it? How does it make a PR and how does that get reviewed and merge and then. So there's like the flow of process.
[29:50]
Host/Interviewer
Yeah. Cool. One thing we did work up before you guys came in was this demo or just agents? Agent demo.
[29:56]
Alessio Ponder
So every time we do an episode, we try the product. Right. I don't think there's ever been an episode that I haven't tried.
[30:03]
Host/Interviewer
Try is a big word. Since day one, lanespace has been on notion. But this is the net new thing. Yes.
[30:09]
Alessio Ponder
So this is for kernel labs, which is the space we're in. So next week we're opening applications for tenants. So there's a web form.
[30:16]
Simon
Let me.
[30:16]
Alessio Ponder
We have this form done here before the workflow would be. I get an email. Then I look at the person, it's like, should I spend time talking to this person? Then I respond, they respond back. So I build this. So the name it came up for on its own.
[30:29]
Simon
Can you maybe. How do.
[30:31]
Alessio Ponder
How does it come up with its own name?
[30:32]
Simon
Yeah, that's a pretty apt name. It's just a random. It's a random name generator.
[30:36]
Alessio Ponder
Oh, okay. That's funny. It just came.
[30:38]
Simon
The fact that it picked that is hilarious.
[30:41]
Sarah
I'm pretty sure it's just Resilient Collector, I think. I've never looked at the code for that. I've never second guessed it. I think it's like a MAD library.
[30:48]
Host/Interviewer
Yeah, I think.
[30:48]
Simon
Yeah, it's totally a different thing.
[30:51]
Alessio Ponder
I thought it was great.
[30:52]
Simon
Although when the. If you use the AI to set itself up, it can update its own name.
[30:56]
Alessio Ponder
Okay.
[30:57]
Sarah
How did you create it? Did you just do class?
[30:59]
Alessio Ponder
I did, yeah. I'll say. Just check my inbox for applications or
[31:03]
Simon
a co working space.
[31:03]
Alessio Ponder
Keep it with people. So it created the database for me, which I have here. And I guess database is like a notion table because everything is notion. And then whenever an email comes in like here, it just creates a new role for the person. And then it uses web search to enrich the profile. So it like searches the web and this is who this person is. This is when they say they want to move in and updates everything else.
[31:28]
Simon
This is.
[31:28]
Alessio Ponder
It's not AGI, but to me, I don't want to do this work. So it feels. It took me maybe like 15 minutes to set up the whole thing. And I really like that. Most of the information should live here. It's not like some other tool asking me.
[31:43]
Sarah
Yeah.
[31:43]
Alessio Ponder
To bring my stuff. There is like, I would have probably already created an ocean thing.
[31:47]
Sarah
So most of our biggest use cases and gains are from that extra layer of human involvement in the process to make it. And like one of our biggest use cases is bug triaging. So if someone posts something in Slack and you just have a custom agent that lives there that has its own routing constitution of what team this belongs to, creates a task in your task database and then posts in that Slack channel, that's one of the first things that we built internally, I think. And it's completely changed the way that notion functions as a company. Nothing falls through the most things don't fall through the crack. We don't know what we don't know. But it's not replacing people, it's replacing processes.
[32:22]
Alessio Ponder
Yeah. And I'm curious how you think about composability of these things. So the other one I was working on is like a lease filler. So whenever somebody signs up as a tenant, builds out the lease for them, there should probably be some agent that is like office manager agent that can handle the requests, make the lease, and then give them a verkata access to the office and all of that. How do you think about that feature?
[32:44]
Simon
Yeah. So there's two ways you can compose. One way is by using like the data primitives. So you can. You could give. You have one agent be writing to the database and move another agent that's walking the database. So that's one way that they can coordinate that's like a little bit more decoupled and works really well. Or you can couple them. So I think it's actually not released yet. Releasing it like next week is in the settings for an agent, you can give it access to invoke any other agent. So you can have them just, just talk directly.
[33:10]
Host/Interviewer
Was there a limit on number of recursions or just probably like you can just get an infinite loop.
[33:15]
Simon
There's some kind of.
[33:16]
Sarah
Yeah, I think it's. There is actually a number somewhere, I believe.
[33:19]
Host/Interviewer
I'm just like someone's going to screw it up.
[33:21]
Simon
You should try it and see.
[33:23]
Host/Interviewer
Yeah, everything's going to be paperclips.
[33:24]
Simon
Oh yeah, yeah.
[33:25]
Alessio Ponder
But.
[33:25]
Simon
But that's really useful. Yeah. So like I just, I helped someone internally the other day. They had built like over 30 custom agents for, for our go to market team doing all kinds of different things. For example, like researching like filling information about a customer or like triaging customer feedback or something like that. Literally over 30 of them. And then he even made like a database of all the agents and then it's okay. And Now I'm getting 70 over 70 notifications per day with just the agents are blocked on various things. And then I was like, oh, okay, cool. The obvious thing to do there is to make a manager agent.
[33:55]
Alessio Ponder
Right.
[33:56]
Simon
That's gonna be another abstraction layer in between your 30 agents. So yeah, we set them up with a manager agent and then has access to invoke all the other agents and it's like watching and observing them. And then it's. It just creates a layer of abstraction. So instead of 70 notifications per day, it's like five. And then the manager agent can help debug and fix any problems with the.
[34:13]
Host/Interviewer
Because there's a concept of an inbox or something that you're basically saying that they can message each other.
[34:19]
Sarah
Well, they use the system of record, which is notion.
[34:22]
Alessio Ponder
Yeah. So we actually.
[34:22]
Simon
Yeah, we didn't make any special concepts at all.
[34:24]
Host/Interviewer
They're interested in the notifications that I would have got.
[34:27]
Sarah
They can just write a task to a database that the other agent's tasked to listening to or they can actually call a webhook to the agent. Like they can just. The agent.
[34:34]
Simon
Okay, yeah, this is something that we're still working on. I think we like generally the way we do these things is you first make it possible maybe like a sort of janky way. So I think the way I set them up is we created like a new database that was like issues that the custom agents were experiencing and then gave them all access to file an issue and then the manager has access to read the issues. And that works pretty well. Essentially give it its own like internal issue tracker just for the agents. And then if that becomes a concept that seems useful generally, maybe we'll think about how to package it in. But generally we try to just keep it to composing the primitives if we can. Another example of this is we have no built in memory concept. Memory is just pages and databases. And so if you want to give it memory, just give it a page and give it edit access to that page.
[35:14]
Host/Interviewer
And a human can edit it, Agent can edit it.
[35:16]
Sarah
Yeah.
[35:16]
Simon
And so that works. That pattern works extremely well. And depending on the use case, you can have it be just a page or it could be an entire database with. Or I can have sub pages. It's pretty amazing what you can do with it.
[35:25]
Alessio Ponder
So when I was setting this up, I connected my inbox and it was like, do you want to use Gmail or Notion Mail? And I'm like, I don't want to use either, I just want you to do it. I'm curious how you think about Notion Mail, Notion Calendar, all of these kind of UI UX info, full stack, Notion. Yeah, when like at the same time you have the agents abstracting them away from you in a way, how do you spend like the product calories, so to speak?
[35:46]
Simon
Yeah, I think it's pretty important that you don't have to use Notion Mail to connect to the mail capability. So you can just connect to Gmail or whatever you want to use. And we're thinking of the mail service as being really great to the extent that it's really agent built. So maybe the mail app is just prepackaged agent that helps you automate your inbox.
[36:07]
Alessio Ponder
Yeah, the auto labeling is great. I think the.
[36:10]
Sarah
When we integrate with Gmail, for instance, we have a series of tools available that are available via MCP or API to Gmail. When we integrate with Notion Mail, we have the Notion Mail engineering team to build us the exact right tools that optimize latency, optimize performance and quality. They own that quality. There's product leads there. They're directly thinking about the user problems that happen in mail. So it tends to be when we build integrations and connections, we build natively first and then think about extending them generally just because it's also easier to build natively first. So that tends to be how we phase things out.
[36:47]
Host/Interviewer
Talking about integrations, you prompted me, so I got to ask mcp, cli, what's going on?
[36:52]
Simon
What's the opinion? I think I'm definitely bullish and excited about clis. I think there's a few really cool things about clis. So One really cool thing is that it's in the terminal environment, so it gets a bunch of extra power. So, for example, it can paginate and cursor through long outputs. And it has progressive disclosure. Inherently, you don't see all the tools at once. It's just you see the CLI wrapper and you can use the help commands and read files. And then I think the most important thing that's super cool is that there it's also inherently bootstrapped. So if there's an issue, the agent can debug and fix itself within the same environment that it uses the tool. Right. I think I saw a tweet this morning. Someone said my agent didn't have a browser. So I asked it to make itself a browser tool, and within 100 lines of code, it gave itself a little browser like wrapping the Chromium API. That's pretty incredible. And then if there was a bug, it would just immediately try to fix it. On the other hand, if you use like the Chrome DevTools MCP, I've had this issue where, like, sometimes the transport gets like, messed up. If it gets messed up, the agent has no way to fix itself. It no longer has a browser, it's now broken. I think that's pretty fundamental. But I would say, like, a lot of the bad things about it can be fixed. So I think, like, as a progressive disclosure, can be fixed with great harness. Like, it obviously doesn't make sense to show it all the tools all the time. That's not really inherent to the MCP protocol. It's just like how you wrap it and use it.
[38:07]
Host/Interviewer
There's many poorly implemented mcps because we didn't know.
[38:09]
Simon
Yeah, I mean, it was just early. The obvious thing is to start with is to just show it all the tools and it's okay. Now we have 100 tools and the tool calling actually works. So let's give it a way to filter, to search the tools. I would say, broadly speaking, I'm really bullish on clis. I'm still bullish on mcps in a certain environment. I think in particular, MCP is really great for when you want like a narrow, lightweight agent. I think that there's definitely a lot of use cases where you don't want like a full coding agent with the compute runtime and also you want it to be like, more tightly permissioned. MCP inherently has a really strong permission model. Like, all you can do is call the tools a CLI is a little bit murkier. It's, can I access The API token are you like properly like re encrypting the token so it can exfiltrate it. It introduces a lot of new issues which are real and hard to solve. And MCP is just like the dumb simple thing that works and it's. But it's pretty good.
[38:56]
Sarah
I'll add two more perspectives not from it working well for Notion, but how Notion commits to both platforms. Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP insofar as other people are using mcps. Right. So regardless of our perspective, we put a lot of effort into our MCP and we have a fantastic team that we're building to do more there. And the second thing I'll say I think we all think a lot but. But lately I've been thinking a lot about making sure there's a value alignment in pricing with capability literally on the expression. And needing language to execute deterministic tasks feels wasteful. And requiring on a language model to interface with third party providers seems wasteful for tasks that don't require it. And particularly because our custom agents are using usage based pricing. We think of pricing as like the barrier of entry for use of our product and we're quite committed to making sure that it's not wasteful. Not just because it's a bad deal for our customers, but it's also bad business. We want of as many buyers like there's a. There's an elasticity of demand and so if we can have our agents properly execute code that calls on CLI deterministically it's a one time cost. Right. Versus constantly having a language model integrate with an MCP over and over and paying those like repeated token fees and it's happening outside the cash window. Then you're paying for it over and over and it's just unnecessary and less deterministic when it doesn't have to be.
[40:18]
Alessio Ponder
Yeah, the open endedness I think the main thing is if I could write code to just call an API I would never use an ncp. But then you need an NCP sometimes when you know what to call but you don't want it to restart versus I think it built a browser from scratch is like it's great when you're doing it on your own, but if your customers were having your AI write a browser from scratch every time and you had to pay the token cost of that, you'd be like no. The Chrome DevTools MTP is actually pretty great. Just use that I'm curious, how do you make that decision? Should it be just straight API call, very narrow? Should it be an mcp? Should it be super open ended?
[40:50]
Sarah
Do you mean for when we ship notion capabilities or when we add capabilities,
[40:54]
Alessio Ponder
you might have a capability that the only way to do is an open ended agent? Like an agent with a coding sandbox?
[41:00]
Sarah
Yeah. In notion AI they're not explicit. We also ship an mcp. Yeah, yeah, yeah.
[41:06]
Alessio Ponder
Internally. Is there ever a discussion we're not going to ship it because we're not able to tighten it down or are you happy to just.
[41:13]
Sarah
No. There are a lot of things where we choose not to use MCP because we want to add more high touch to quality. I think search and agent defined is like the largest instance of that where we have Slack and Linear and Jira Search and Notion that is not using necessarily the search MCP functionality that is provided by those companies. And that's because it's quite critical, we think, to how our agent trajectories work is for us to have a little bit more control on the functionality of the search journey. And so it usually comes from quality and there's a long tail of things. And that's why we built an MCP client or an MCP server. Excuse me, so that people can connect whatever they want. There was that long tail. Right. But for search particularly I would say that's like the primary answer point. But there are other connections as well that it's a little bit of secret sauce about when we are okay with MCP functionality and user driven auth and when we actually want to want to carry a lot more ourselves.
[42:08]
Simon
I think that there's not really a conflict here. There's just like different layers of the stack and different abstractions. If I were to like map it out, you've got mcps give you a way to. It's a protocol for gaining access to tools. It's an open protocol so you can easily get like a long tail of many things. So if you open up our like in the tool settings. Oh, that's not the trigger, that's something that MCP can't do. So if you scroll down and the tools and access. So you know the connection. Yeah, MCP is a really great way to gain access to tools. It works really well. But you just looked at the trigger UI for example, there's no trigger protocol. And so those are things we had to build ourselves. And then there's some integrations where we used mcp. So for example, I think the linear and the GitHub they use mcp, but the Slack Mail and Calendar, those are actually ones they built in house. And we spent a lot of time really fine tuning all the tools to make the mixture really good and also building out the trigger. So it's just like different layers of the stack. Some things make sense sometimes, and then we just have to harness the right tool at the right time. I don't think there's an inherent strong conflict between these things.
[43:10]
Alessio Ponder
Do you have a canonical representation of these tools internally where you wrap these things together? The MCP plus the custom build.
[43:17]
Simon
Yeah, we have internal abstractions for what is a tool, what is an agent, what is a completion call. Yeah.
[43:25]
Sarah
We even have internal obstructions for what is a chat archetype, whether it be from teams or Slack.
[43:32]
Host/Interviewer
Yeah, it's like the only way to build with AI because everything's moving so quickly. You would have to abstract it so that you can swap things out.
[43:38]
Simon
Yeah, there's always a dance. We probably rebuilt our framework like I said, five different times. It's always a dance of, okay, how does this new thing work? What should the abstraction be? What is OpenAI giving us? What is anthropic giving us? Like they're trying to rap over it. I think. I think we've been pretty successful with that. It's just a matter of staying nimble and making sure that you always have the simplest, dumbest abstraction you can. That the maps look different things. Yeah. So we have a tool integration abstraction, for example, and then MCP is like a type of integration. Yeah, that's one of the.
[44:07]
Host/Interviewer
This might be a big ask, but I'm going to try. Which is. You said. You've said multiple times. You rebuild a few times, like five times. I don't know what the right number is. Is there like a brief history of what was the each rebuild doing? And yeah, I know it.
[44:19]
Simon
I can try to do that.
[44:20]
Host/Interviewer
Yeah, you need to rag over Archaeology.
[44:23]
Simon
The first version that we started building like late 2022. Oh my gosh. There have been many versions actually. Okay.
[44:28]
Host/Interviewer
The highlights, wow.
[44:30]
Simon
The first version we built was actually a coding agent. So we're like, oh, instead of building tools, let's make everything be JavaScript. And then we'll just give it JavaScript APIs and we'll just write code and that's how it speaks the tools. But at the time it just sucked at writing code.
[44:43]
Alessio Ponder
It wasn't that good.
[44:44]
Simon
So then we moved to more of a tool calling Obstruction. A tool calling didn't exist yet. So we created this Whole XML representation. And a big learning in that version is we were catering way too much to what made sense for Notion and Notion's data model versus what the model wants. So as an example, we created this whole XML format that can losslessly map to notion blocks and the transformation between them is super easy to do. And then we created this sort of like mutation operations to edit pages, but it sucked because the model didn't know the XML format. And also.
[45:18]
Host/Interviewer
And you had to prompt it in. Yeah.
[45:20]
Simon
And the tool just more inconvenient. And so yeah, we're like, okay, it has to be markdown. The model's no markdown. We did a whole project around basically creating a notion flavored markdown where the whole goal was like it has to be just simple markdown at the core. And then we can add some enhancements and it doesn't have to be a full lossless conversion. That was a big one. And then we did a whole similar learning to the database layer. So querying a database in the Notion API, the way you query a database is. There's a crazy JSON format and it's limiting, but it maps nicely to like how we represent things. Internally we scrapped all that and we're like, okay, let's just make it SQLite. Everything's a SQLite database. You can query it just like a SQLite query. And the models are super good at that.
[46:00]
Host/Interviewer
So give the models what they want.
[46:01]
Simon
That was another one. Yeah, yeah, you've also want. That was. I would say that was a big learning is just really to be savvy and really careful thinking about what the model wants in terms of its environment and cater around that and really try so hard not to expose it to any complexity about your system that that's unnecessary.
[46:18]
Host/Interviewer
Notion's underlying database is postgres now. So I don't know if there's any mismatch there.
[46:24]
Simon
That one was a fortuitous thing because we actually already had a big project going where. So we have this. When you query Notion database, it's actually querying this like cluster of SQLite databases. That's something that we'd already been working on even before the agents came around.
[46:40]
Host/Interviewer
Yeah, you guys had a fantastic blog post about it. And it's actually really good database engineering knowledge to have that from you guys. Because where else will we get it?
[46:49]
Simon
Yeah, yeah. It's a crazy engineering problem when you want to have like millions and billions of tiny databases or where some of them are tiny but some of them are very large and you want to have Them be very fast.
[46:59]
Host/Interviewer
Yeah. And also not that hierarchical sometimes. So somewhat of a graph. I do like that history because I think that shows the evolution that you guys went through and the work that went into it.
[47:08]
Sarah
He just ended you a year and a half ago.
[47:10]
Host/Interviewer
Oh, okay.
[47:11]
Simon
Okay.
[47:12]
Host/Interviewer
I need to hit continue.
[47:13]
Sarah
If you're curious. We can keep going. I'm just saying, like, that's really.
[47:16]
Simon
That's another one. Yeah.
[47:18]
Sarah
I mean, no, because there was to calling and then there was research mode, which wasn't a fully agentic tool calling. Then we moved away from few shot prompting entirely to tool definitions, and now we're thinking about Agent 2.0.
[47:31]
Host/Interviewer
So no few shot prompts ever. Okay. No, maybe not.
[47:33]
Sarah
I don't know if never.
[47:34]
Simon
But yeah, that kind of went away. It's an interesting thing, right?
[47:37]
Host/Interviewer
Yeah, there's just instruction follow really well.
[47:40]
Simon
I would say there's been like a general arc where it's like you gradually strip away everything and it looks more AGI. And it started out as it's a one shot, one prompt. There's a few shot examples, and it became, okay, actually, let's give it. Let's give it tools, but it'll still have few shot examples. And then it became actually, let's just give it a whole bunch of tools. One big shift that. That we. I've been working on recently that's about to ship is what happens when you have a lot of tools. Yeah. So then. Yeah, so then a progressive disclosure becomes really important. We were. We hit a bottleneck where our agent worked really well. We hit a bottleneck where it became pretty hard to add new tools and we became worried about it like breaking the model. Okay, someone.
[48:20]
Sarah
No, I just heard it was like saying hello was like thousand. Yeah, it was really slow.
[48:25]
Host/Interviewer
I can see you're the efficiency person here.
[48:27]
Simon
It was too many tokens. But also it's a quality issue because it meant that any engineer could introduce this new tool for some niche feature and it would nerf the overall model by causing it to call the tool too much. Stuff like that. Yeah. So we had an effort basically to make our harness implement progressive disclosure in a nice way. That's a big shift.
[48:44]
Sarah
You said earlier, like everyone says reasoning models was the big shift. Like, what's more, there when we went away from few shots to describing the goal of the tool in like goal driven, basically moving from a DAG to like a true system with feedback, that's when we could distribute tool ownership to the teams much better. Because when it was all a few shots it was everyone truly editing one string and things would compete in, like the order. There were all this, all these papers about, oh, not all context is created equal. The higher up it is in your examples, like, the more the model listens and we're trying really hard to fight against the order and the selection of the fuchsia. And that really had to be a center of excellence. And it didn't scale with the number of people for the need the company had. It was really just five or six people that were allowed to even touch that or had to approve it, rather in our code base. And then now we can actually, with the right eval setup, distribute so that everyone owns their tool and their tool definition. And sometimes we have crazy things where like we write two tools that have the same title and the agent crashes and stuff like that. So there are issues. Actually, believe it or not, Anthropic couldn't take it. Sonic couldn't handle two tools with the same name. And OpenAI GPT 5.2 was like, I can figure this out. So that was an interesting one that we learned by accident through asev.
[49:58]
Host/Interviewer
I mean, the underlying representation is. That's a dict, right? Really, like, that's a safety key name.
[50:04]
Sarah
Exactly, exactly. But so that was like a big shift for the company in Velocity. Not immediate, because the AI team that was the center of excellence team that owned that one file of few shop prompts, had to become a platform team overnight. And that wasn't natural. But I would say that in terms of like the velocity of how we contribute to the agent, beyond coding tools, obviously being a big velocity lever, being able to distribute tools and not have to all collaborate on one very select string of system prompt is truly, I would say, the biggest lever on how we've scaled.
[50:36]
Simon
We're just fighting to keep the prompt as short as possible now and then. It's in the latest version of the agent. It's not in custom agents yet, but it will be like next week or the week after or so. There's now like over a hundred tools just for all the crazy notion stuff. So we're able to really go deep in.
[50:48]
Host/Interviewer
Would you list those tools publicly? Is this like IP or.
[50:51]
Simon
No, it's totally public. You can ask. You can just ask the agent and it will tell you.
[50:57]
Sarah
And we're going to post a benchmark
[50:58]
Host/Interviewer
like you're going to post a benchmark.
[50:59]
Sarah
We don't think our system prompt is our secret sauce.
[51:01]
Host/Interviewer
Yeah, great.
[51:02]
Simon
We don't try to hide the tools at all. I think it's I think it's kind of important, actually, as an operator, as
[51:08]
Host/Interviewer
a power user, I want to be like, oh, I can do this. This good.
[51:10]
Simon
Yeah, yeah. One thing that one phrase we say internally a lot is to teach to the top of the class. Customization is like a power tool. We try to make it as easy as possible to set up, but we want it to be pretty deep and sophisticated. And I think a huge part of that is the operator needs to be able to interrogate the way the system works. And a big part of that is, what are the tools? How do they work? Like, how should I prompt it to use the tools in the right way?
[51:32]
Sarah
I'd actually say we don't try and make it as easy as possible to use because the more we do that, the more we abstract away that interpretability that Simon's talking about that basically nerfs the model or nerfs the agent from being super capable. So a huge, I would say, turning point. I can think about the week and a half that we all came together on this as we were building custom agents was that alignment that we're not trying to build for everyone here. We're not trying to build the model that or build the user experience that anyone can figure out how to use. Because the more we do that, the more we just diminish its capabilities. And that was a big everyone in. A couple Slack messages aligned on that actually made us all work faster again because we all were like more centralized on who we were building for.
[52:10]
Alessio Ponder
What does the metaprom generator look like? So I looked in the system prompt that it generates. For example, it uses emojis. That's not an obvious thing to be doing.
[52:20]
Host/Interviewer
Wait, did you just ask it, what's your system prompt? Oh, this is how to generate prompts. The prompts to generate prompts.
[52:25]
Alessio Ponder
Yeah.
[52:25]
Simon
So this is actually just the agent. So one thing we did that I really like with the custom agents is it can set itself up. So we not only gave it access to use the tools that has access to send your emails or whatever, but it has more tools to set itself up and to debug itself. And so when you ask it to write a system prompt, it's just your agent itself is doing that.
[52:44]
Alessio Ponder
So this is just the model preference. You're not really injecting into the model
[52:48]
Sarah
too much a good custom agent and things like that. And then it's really nice too, because if it fails, you can ask it, why did it fail? And then say, okay, update your instructions so it doesn't fail again. Obviously we should build product of self healing. That's next on our roadmap. But it actually, it creates a nice system.
[53:06]
Simon
Yeah. We do essentially give it like a development guide. Here's. Here's how to make a custom agent. Here's how to like help the user test it end to end. So to help them gain confidence that it works. Stuff like that.
[53:14]
Alessio Ponder
Yeah, the fixing thing worked. It wasn't automatic but I miss set something up and then there was like a fix button and then just.
[53:21]
Simon
Yeah, yeah. One thing where it's actually, it's an interesting sort of permission problem. So the thing about custom agents that is that by default it has no permission to do anything and then you have to explicitly grant it all of its permissions and that's what lets you trust. It can work in the background. Like you can know, oh it, it can read my email but not send email. Okay, I can just that if you let it fix itself, you're breaking that perversion barrier. It's not allowed to edit its own permissions but in the current product you can click a button to fix. But now you're entering sort of an admin mode where you're in a synchronous chat and you can see what it's
[53:53]
Sarah
doing and it confirms before it changes.
[53:55]
Alessio Ponder
The thing that I really like that most people don't do is the editing chat is the same thing as the using chat. Like you can message the agent to both edit it and use it versus a lot of other products are like.
[54:06]
Simon
I think that's really key.
[54:07]
Sarah
I think a lot of designers will feel so happy you said that because we spent. We called this Flippy.
[54:11]
Host/Interviewer
What is this? What do you mean?
[54:13]
Sarah
This view of.
[54:14]
Simon
Yeah, so if you, if you close that and like open settings you can see. Yeah, this is we, we called it Flippy because we started with the settings were the sort of the main page and then you can test the agent. The AGI pill way to think about it is oh, it's just the agent. Everything is the agent.
[54:28]
Alessio Ponder
Right.
[54:28]
Simon
It can set itself up, it can test itself and it can run the workflow that you want to run. So we flipped it so the main view I was looking at is the chat and then the settings is more just like a side panel that's previewing the changes that it's making so you can introspect on them or you can also make changes manually if you'd like. But we want to design the experience from a get go so you don't have to ever any of the settings Manually you can just talk to it.
[54:49]
Sarah
And the inside baseball is like how this works was probably the launch blocking part of this. Especially because we had a lot of early adopters that were used to the old way and that's like the benefit of adopting in public. But then changing how people think about setting up custom agents when they already had this flow in and of itself was difficult.
[55:07]
Simon
That's really fun because we ended up painfully delaying the launch by a month. A few weeks. Yeah, definitely like a month or so. But the whole team was super enthusiastic about it though because it was just so much better. It was like oh yeah, obviously get the chat with you to set itself up. And everyone was super bullish on that. So it was like painful for a second but then everyone was like and
[55:27]
Sarah
like back to organization design which I probably care about more than Simon. But the people that built this are three engineers from three different teams. Because we're like we need to launch this and we need to fix this. And then we've just built built a company where then we just put people on it and no one complains. The manager doesn't complain and we were able to unblock and just ship it.
[55:44]
Alessio Ponder
Yeah. But being in a failure chat and asking it to just fix yourself is amazing. Versus I gotta copy this and put in the settings chat. Do it.
[55:54]
Simon
Yeah. So interesting. Like trade off in there that we're trying to explore which is we want to be like a business enterprise safe agent where you can delegate something and trust that it's going to work. But also we want to get some of that sort of bootstrapping power that you feel like when you're coding it is making a browser for itself. Right. There's something there I think that's really important. So we're trying to navigate that trade off and try to get you both.
[56:15]
Alessio Ponder
Now it's free. It's amazing. I'm worried about when I have to start paying. How do you think about. So you have notion credits as a payment for this which is separate from the usual tokens that the model generates. How do you design pricing? Value based pricing based on the task and things like that.
[56:33]
Sarah
So they are the credits and payment structures associated with the token usage. The reason that we had to make it not just throughput of tokens is that it's not always priced that way. Like our fine tuned and open source models are served on GPUs. Web search is priced differently. If we were to host sandboxes, those are priced differently. We had to think of an abstraction above tokens and it's also not just tokens, it's the token model and serving tier trade off. Right. Because we can have priority tier processing, we can have asynchronous processing. The cash rate could be different depending on who uses it when. Right. And so we wanted to from the get go commit to making sure that customers were getting the fair deal. Not necessarily that we were making a ton of money off of it, but that customers were paying for what was reasonable. That's the fundamental of where we started. And also we're selling enterprise SaaS. So if we sell credit packs, then you get discounts if you're an enterprise and you buy a certain amount of credit packs and things like that. So it also just helped the sales motion work a little bit easier. So that's the answer on the abstraction of credits to dollars. Now was the question how we decide how to price it or.
[57:33]
Alessio Ponder
Yeah, I think there's all tokens are not made equal, but we obviously get charged mostly equal. Like you can ask Codex to create you a dumb tool. I created one for our Starcraft 2 land for people to find the game, but then people create it to build features in like billion dollar companies. But the token price is the same.
[57:50]
Sarah
Yeah.
[57:50]
Alessio Ponder
Like for you, I can ask this to update my favorite recipes doc and it'll do it. Or I could ask it to respond to an email from an investor and the value is like very different and you could charge more but you're not necessarily doing it. So I'm curious if there was any discussion.
[58:07]
Sarah
I think that that's not where the market is right now. Number one, the second reason that we're not doing that is it ended up being complicated to figure out what was complicated or not. So we at first were like let's just charge on agent runs and you know, we went through all the different versions that all ultimately just brought you back to a lot of complexity that mapped directly to token throughput. And so it's also just simpler. It's quite difficult to build those pricing systems. And I actually think that one of the biggest reasons we had usage based pricing for this capability is we've had our core agent for a while with a model picker and there were certain models or certain functionality that we had margins to maintain and if we wanted to ship this functionality, we couldn't afford it. It would bankrupt the company. If we let for instance like autofill or the database autofill feature, we soon be agentic that will be associated with usage based pricing. Because if every single autofill Action was an agent running on OPUS on every single database cell. It would be billions of dollars. Right. And so we had to find a way for the customers that wanted to do more and wanted to give us their money and pay more to find the outlet for them to do it that we didn't have to apply to the lower end of the curve. And also not all knowledge work is equal. There's different points. A lot of the agent workflows here really saturate model capabilities. Like you don't need a complicated model for it and so charging based on token usage. We couldn't just decide for you that you wanted your email client to be dumb or not. We want you to decide if you want to have OPUS auto triage all of your emails. We will actually give you nudges in the product to rethink if that's the right choice. Because also not every user understand. You'd be surprised in user interviews people will be like, oh, I didn't know that. So now we actually have a little hover that tells you like if it's expensive or not. Yeah. It's also slower. So the thing that's interesting is like people don't care about speed and custom agents and so the incentive of haiku being faster. People don't care when it's asynchronous. And so we want to only provide the service of extra benefit that people want. And the best way to do that is to incentivize them because it's their own money.
[60:07]
Alessio Ponder
Must be confusing for people that are not familiar. It's like, why is there no 5.3? You open this thing and it's. Is there something missing in my menu? Not their fault.
[60:15]
Host/Interviewer
Not their fault.
[60:16]
Simon
Yeah. That's just the world we live in now.
[60:18]
Host/Interviewer
It just randomly jumps point 2.
[60:20]
Sarah
It's like cloud had that Auto is heavily. I think what's actually been hard for us is to convince people that Auto is not just our cheapest, dumbest model, but actually the model that's best for the task that you want to do.
[60:32]
Alessio Ponder
I mean.
[60:32]
Host/Interviewer
Exactly.
[60:33]
Sarah
Nice. And a lot of our job is actually figuring out Auto because it's.
[60:38]
Host/Interviewer
This is the agent lab. Every agent lab has an Auto.
[60:41]
Sarah
Yeah.
[60:41]
Host/Interviewer
And that's the job.
[60:43]
Sarah
Exactly. Because if you think about, like I said, I come from Robinhood. Like you could spend a lot of time keeping up with the markets or you could have a Auto investing. Right. And you can have an index fund
[60:55]
Host/Interviewer
or you can have Robo Advisors.
[60:57]
Sarah
A Robo advisor. So like at a certain point we also can be robo advisors and like we have a lot of people figuring out what model is best for the right task. And right now we're not using Auto as a margin maker, we're just using it to reduce stress. It's not opus, that's for sure, because a majority of the tests people are doing aren't OPUS level intelligence.
[61:17]
Simon
The other thing I would say is the unlike a lab, we aren't fully incentivized just for you to use as many tokens as possible. We're actually really interested in giving you the right tool for the job. A lot of the time the right tool for the job is actually just writing code and not even using agent at all. So that's something that we're investing in a lot is imagine your agent can actually automate itself out of a job. We would love if that were true.
[61:38]
Sarah
I feel very strongly about this because I don't necessarily feel like that's the skews that Frontier Labs give you. I feel like they are just getting more and more capable and more and more expensive, which is fantastic for the use cases of when people want to do really complicated things on Notion. What's difficult is like that market that I think right now is no man's land of where reasoning models were six months ago that the Nano haikus, et cetera, haven't caught up to. Because now we're just paying more for those for like extra capability that we didn't necessarily need and so are our customers. And labs aren't necessarily incentivized right now with how few players they are to be meeting the market everywhere. They just need to be the cheapest. They don't need to be at value that the customer wants. If no one's cheaper than them, then they're the cheapest and that's good enough. And so we're doing a lot to make sure that we have the right optionality to switch between models and also invest in open source because the open source models actually are getting to be the place where reasoning models were three, four months ago. And that's what's filling that gap right now. So you'll see we offer minimax and we're collaborating a lot with different open source labs to think about Notion's last exam and how they can do better on these types of tasks so that we can offer them for that intelligence to price to latency trade off. Because in that triangle of intelligence, price, intelligence, price and latency, excuse me, users get to choose where they are. But right now there's not the Whole triangle isn't filled with models. Right. And the more that different models fill that triangle, everyone's clustered in capability. Where everyone's clustered and haiku's not that much cheaper. No one's really in the middle. Like people really tend to cluster around two. This is really capable and it's really fast, but it's really expensive or whatever. And so we just want to make sure that triangle is filled and we want to offer the models that fill it and we want to guide users to understand when they need it.
[63:25]
Alessio Ponder
Yeah.
[63:25]
Sarah
Which one?
[63:26]
Host/Interviewer
All I'm hearing is that someday you're going to change your model. You have lots of tokens.
[63:32]
Sarah
I don't know if. What do you mean by train your model money to train a foundation?
[63:37]
Alessio Ponder
You go raise it.
[63:38]
Host/Interviewer
You can raise it.
[63:39]
Sarah
That's your job. Simon. No, I don't think that needs to be our core competency.
[63:44]
Host/Interviewer
This is usually the thought process that leads to no one else is doing it. We'll take a crack.
[63:48]
Simon
Yeah, I feel like to the extent that we do anything like training and the area I'm actually most excited about is less of one big model for all the users. But as it becomes more possible to do to make specific fine tuning that's like really knows your context of your company, the people that work your company, what's going on. I think that's pretty interesting because if you had a model that really knows your company, I think that would be like a huge quality uplift.
[64:11]
Sarah
We actually have some enterprise vendors that kind of ask about this along with bring our own key. If I have a model that really understands like my enterprise that we're training for all these reasons, these tend to be like, like quite large institutions thinking about how to let people bring their own models. But those models have to function with like understanding how to call our tools. And that's where again, having more public system prompt is like beneficial to notion. Right. We want all models to plug into notion as. As well as they can. That being said, like, of course there are certain aspects of notion where we do fine tune and de. Reinforce and fine tuning on our own capabilities, but that's not necessarily trained on user data. You don't need that that much data in the first place. And that's where when we have like a data scientist and a model behavior engineer really understand where the capability gap is, that's when we invest there.
[64:58]
Simon
I personally burned a lot of time trying to train models and it's tempting, right? It's so tempting. Retraining every day I was doing crazy Amount. Yeah, I was doing a lot of different things.
[65:08]
Sarah
I was the budget person that I showed up and I heard that was happening.
[65:14]
Simon
A funny thing that was sort of an arc that, like, looped on itself is back when I was doing tons of training stuff. It takes a long time to do it, any kind of training run, and you end up operating like 247 around the clock. Like, it becomes very important that before you go to sleep, like, everything is watch intensive work, the experiments are started. And then as I stopped training, that
[65:31]
Host/Interviewer
kind of went away.
[65:32]
Simon
But now the coding agents have totally brought this back. So now every night before I go to bed, I'm like, okay, did I start enough agents to get them done? I gave them the done. So
[65:42]
Host/Interviewer
you have to try polyphasic sleep so you can wake up every two, three.
[65:45]
Simon
Yeah, we. Yeah, I have not gone there yet. But my goal these days is just to. Before I go to bed, the agents are running. And I'm confident that they won't be done by the time I wake up.
[65:55]
Host/Interviewer
Really Eight hours.
[65:56]
Sarah
There is a. I won't say which coding frontier lab, but there is a point where he had outlived, like, the thread length and context length that that coding agent provided. And I DM you DM'd them being like, hey, I need more. And our account rep DM me directly. And they're like, is Simon trying to prove string theory? What is he doing?
[66:13]
Simon
I had a single coding agent thread going for, I think it was like 17 days pretty much continuously.
[66:19]
Host/Interviewer
Don't they just compress?
[66:20]
Simon
Yeah. Yes. It was actually just a bug. It was a harness bug. Yeah. It had done compaction like a hundred times, probably.
[66:26]
Sarah
The other thing that reminded me about fine tuning that I think you and I have aligned on is that our tools change really frequently. And right now we spend a lot of time rethinking and building tools for capability and fine tuning a model to understand your tool. Like, we don't have legal expertise or coding expertise. So if we were to fine tune a model, it would either be expertise about the enterprise, and we have zdr, no data retention offerings for those enterprises. So we'd have to really rethink how we structure if an enterprise wanted to opt into that. Or it would be fine tuning and better capability on navigating our tools that doesn't match with the velocity with which we create new tools. And so it would actually really slow us down to have a model that was fine tuned on our tools because we'd have to retrain and cut a new model every Time we did that and that's not how we're set up right now. Particularly with the way that we're changing our. I guess we could fine tune a model to like search for tools. It's just the amount of time it takes to do that, ship it, have the right system. You're basically making a bet against a frontier capability, not serving that and the time it takes you to build it. And that time lag hasn't happened for us yet.
[67:26]
Simon
Yeah, it's just the wrong trade off. I think it's just like, like you want. Yeah, we literally change our tools every single day and if we notice an issue, we'll, we'll fix the problem. I think a good way to think about it, I think it's pretty fruitful, is don't focus too much on training. I would think of that as that's an implementation detail, like what's the outer loop? Right. If the outer loop is you have a model and then some harness or system where it's interacting with the system that needs to work. And if it's a problem, the way to solve the problem isn't necessarily to train a model, it's oh, maybe there's just a bug in one of the tools and actually 99% of the time it's a bug in one of the tools. So just fix the bug and then the outer loop. Thing that's really fruitful to think about is like, how can you improve your velocity and robustness in making really good tools, making a good harness, like verifying it works.
[68:11]
Sarah
The one place that we do invest more in model training now necessarily though is actually in retrieval because we're at a point right now in our business and enterprise or AI enabled plans where the search load and the search traffic, majority of it's coming from agents, not humans. And so for every query that's hitting our elasticsearch or our vector indices, they're not coming from humans. And the queries are structured differently and what's returned has a different requirement. Positional ranking matters less, but top K retrieval mode matters more.
[68:37]
Host/Interviewer
Isn't top K a form of position?
[68:39]
Sarah
Of course it is, but when you're trading on like click through rate, it's really one through number six is very different than it needs to be in the top 100.
[68:47]
Host/Interviewer
Like the slope is just higher.
[68:49]
Sarah
It's a different optimization for a retrieval model. Similarly, what snippet you include matters more or less. So we are rethinking a lot of that functionality to work with how the agents like to write queries and how they want to receive information. So we are doing like another kind of reinvestment into rethinking not only search for how do agents do searches versus how humans do searches, but we're also investing in like indexing different things now because how are. How do you index the setup generator for notion agent? It breaks our block model entirely where all blocks are nested in each other. Same with meeting notes. And so we. Do we. So we're hiring ranking engineers and model training engineers. Put. It's primarily on ranking.
[69:29]
Host/Interviewer
Yeah. Does ranking map to Rexis for you? Does recommendation systems.
[69:33]
Sarah
Yeah. Yes.
[69:34]
Host/Interviewer
Okay. Saying this a bit. I'm trying to promote Rexis more in general because it's weirdly unpopular.
[69:39]
Sarah
I don't know why. But the other thing is that I was just talking about this with a peer, like, how much is ranking important versus being able to do parallel exhaustive queries? So they're both important. They're both important, but they're both two tools to the same user outcome or the same agent outcome. Right. And that. That's something that we're also rethinking a lot. Even on. We just did an experiment on notion ranking at this point for notion retrieval, vector embeddings are less and less.
[70:05]
Host/Interviewer
Did you see that?
[70:06]
Simon
Yeah.
[70:06]
Alessio Ponder
Notion just for so long, it became dark mode.
[70:11]
Sarah
We're working the night shift. Right.
[70:13]
Simon
Looks pretty. Do you want to see any bugs?
[70:14]
Host/Interviewer
I worked on this, like, parallel search thing where you fan out to eight different queries. Right. And so you actually need to use the model to work on query diversity so that you get maximum search space.
[70:24]
Sarah
And so like, the people that are working on ranking and retrieval are the same people working on what query generation is. It's all one journey. We call it agentic find. And we're actually realizing, for instance, that it's less about selection. Like, we don't spend a lot of time trying to optimize what vector embedding we use anymore. That was a period of time, but that's just not the right level for optimization.
[70:44]
Host/Interviewer
Never rare. Okay, we've gone long. I have to talk about notion meeting minutes and then we can call it there. You just have a lot of comments. I don't know where you want to start. Is it the audio side? Is it the summarization? Yeah.
[70:56]
Simon
What makes it work?
[70:57]
Host/Interviewer
No, just anything interesting. Technically, I think you had some book points. I always call these check marks along the way. When the guest says something that they want to return to later, I just check mark. And I'm like, okay, we'll go back to it.
[71:09]
Sarah
Meeting notes was one of those things where at first we were nervous that we'd have to teach people a different way to work and we were nervous. That was a lot of user friction. I think one of the reasons why, I mean they're one of our biggest growth levers. I think they're one of the most like in terms of virality, of adoption and retention, quite strong. And so we've invested more and more as we did that. I think what's really powerful about it is again notion is the system of record of where and how you work. The way that I use meeting notes is every one on one meeting I have is meeting notes. When I do my performance review for myself, my self review, I say primarily look at all my conversations with my manager and write up what I did this year. Right. Because if I didn't talk about it in my one on one with my manager, it probably wasn't relevant for my performance review. So it also just adds a ton of signal on prioritization. That's really helpful for a good system of record. That's really helpful for like our agent. It's also caused a lot of scaling for search and for the agent and it's just an explosion of content when you have transcripts like that. How we do compaction, a lot of that was triggered by meeting notes, passed into context, things like that. So it's been a good impetus for us to think about longer form content when you think of it as like a priority primitive. But it's been one of the most powerful signals for our agent because it unsurprising right.
[72:24]
Host/Interviewer
You're capturing a whole new thing.
[72:25]
Sarah
So it's like our own data. Like we want users, they're creating their own data flywheel, right.
[72:30]
Host/Interviewer
Like it serves me to prefer notion to put all my stuff because it has my other stuff.
[72:36]
Sarah
Totally. The way that like our teams run right now is there's a custom agent that does a pre read before standup. It looks through all of Slack and GitHub and just says it creates a summary and it creates a meeting note and it says everyone do this pre read. Then we just press play. We have the meeting, we talk through the pre read, we talk about what needs to happen next and then we have a custom agent integrated with our calendar and triggers that then files tasks for tomorrow or today based on what we spoke about and sends off Slack messages that that we decided in the meeting needed to be follow ups. Like our meetings are hands off keyboard and we're focused on the root of the problem, not the bookkeeping around the problem.
[73:08]
Simon
One thing that the meetings team added recently that was that been blowing my mind is they. We. They made it. So it actually. When it makes a summary, we'll actually mention the people that were referenced in it. So I now get notifications whenever someone
[73:20]
Sarah
talks about me in a meeting. Yeah, I feel like that one was.
[73:21]
Simon
It's like, oh, Simon is working on this.
[73:24]
Host/Interviewer
Okay, I'm gonna.
[73:25]
Simon
It's actually amazing how this is. Then I'm like, oh, okay, cool. I'm gonna talk to them about that.
[73:27]
Host/Interviewer
What if they're two Simons?
[73:28]
Sarah
No, wait. So it's powered by the agent. So it's doing agent. So if you look at it thinking, I don't know if this is shipped yet. It will be. When you look at it thinking when it's doing the summarization, it's saying figuring
[73:38]
Host/Interviewer
out who Simon is most probable Simon.
[73:40]
Sarah
And we also have people to people similarity cache and stuff like that in the attendance.
[73:45]
Simon
We also generate a profile for each person and use that. Of course I can get it wrong, but the goal is for not to get it wrong.
[73:50]
Sarah
Meeting Huts is just like the agent Primitive packaged on top of a transcription primitive and then a vertical team. It's probably one of the only teams at Notion that's completely a vertical team around quality and product, like UX Design, because it's still a tiger team with a fantastic manager, Zach, that joined recently from Embra, but Zach Tratar.
[74:08]
Host/Interviewer
Yeah. Yeah, I chatted with him when he was talking about his working on Embra.
[74:12]
Sarah
Yeah. So he's managing that team now and thinking about it as data capture. That's what MIDI notes is, data capture. Get all the reframing where MIDI notes are valuable as a data capture problem and then working inside. Like the summarization used to not be agentic. Now it is because it does all the things like figure out who the right Simon is and one day you can have a custom agent directly integrated in it that knows like what task database the meeting is referring to and as you're having the meeting, perhaps update the tasks and things like that. There's a lot of that experience of where we do our work in meetings that we want to invest in making more seamless.
[74:43]
Simon
Yeah.
[74:43]
Host/Interviewer
OpenAI is doing hardware. Would you ever ship one of these? Yeah, probably not, but this is meeting notes in person.
[74:49]
Simon
Yeah, yeah, I'd be excited about. I'm excited about the product category in general, for sure. Yeah.
[74:54]
Sarah
I think it's a mechanism and one of those Needs to work really well with Notion. We would partner with whoever is building one of those.
[75:02]
Host/Interviewer
They would bought by Amazon. I don't know, I can't refer you.
[75:04]
Sarah
And there's some wild companies doing like really cool things that come to our partnerships team that I like to sit in on the demos of wearables. I always like to send it on the demos. Cause I think they're pretty cool. And all of them want to make sure not just Notion, but like you can imagine the ones that talk to you being able to do search and build context. So if you're entering like a conference being able to do look at your CRM and do things like that and you can utilize the notion agent to do that. So we are in like the very beginnings of those partnerships. I think what's unique about that particular technology is it goes against what I talked about with custom agents right now, which is the more simple it is, the harder it is to have like advanced controls over its capabilities. And so that would be a great investment for data capture. But not necessarily like our agent is workflows.
[75:44]
Simon
It's a little bit of a different slice. The problem is that's going to be deeply personal. Like your company's not going to force you to wear a wrist wristband.
[75:51]
Sarah
I think it's good to hear that for me from you.
[75:56]
Simon
The CEO is going to force everyone to wear a wristband. The slice of the problem that we care about is can the company have all the context of what everyone said at every single meeting and then use that to drive value for themselves.
[76:06]
Sarah
That kind of reminds me, I remember once you very strongly reminded me our job is to not make the best, best harness for agentic work. Our job is to be the best place where people collaborate. It's like our job isn't to build the best wearable to capture meeting notes. Our job is to build the best place where meeting notes live.
[76:25]
Simon
Yeah.
[76:25]
Host/Interviewer
So basically you're saying everyone else can just pipe to you and it's fine, right? Yeah, that's a reasonable thing. All I will say is that there's people walking around with notion tattoos on them. They'll wear notion anything. So just, I don't know, do a limited run.
[76:37]
Sarah
We have such understated swag that the idea like our swag has so a few notion logos on it. The idea that people have notion tattoos is pretty antithesis to our design principles. So that's pretty funny. Do you have one?
[76:50]
Simon
No, definitely not. I do not have a notion to. I've seen them.
[76:54]
Host/Interviewer
Yeah. Cool. Thank you so much. This is such a great deep dive, actually. The chemistry between you two is amazing. I can't believe, like, we work together a lot.
[77:01]
Sarah
Different jobs, work closely.
[77:03]
Simon
Yeah.
[77:04]
Alessio Ponder
That's it. Yeah. Thank you, thank you, thank you.