Summary7 min read

Podcast Summary: Latent Space – "Why the Frontier Ecosystem must be Open"

Guests: Matei Zaharia & Reynold Xin (Databricks)
Date: June 24, 2026
Host: Latent Space
Duration: ~1:09:00

Episode Overview

This episode explores the philosophy and technical rationale behind keeping the AI ecosystem open, as discussed with Databricks co-founders Matei Zaharia and Reynold Xin. The hosts dive deep into recent launches around Databricks’ AI Summit—highlighting new open-source frameworks for agentic workflows, security and governance in AI engineering, and innovations in unified database engines. The conversation unpacks Databricks' evolution, their open-source-first approach, competition in the cloud data platform space, and why future AI and data infrastructure must remain open and interoperable.

Key Discussion Points & Insights

1. Origins and Culture of Databricks

From Meetups to Community: Databricks’ first summit was a 50-person meetup at Berkeley and has grown to 100,000+ global participants.
- Quote: “Now it’s like…100,000 people around the world, 30,000 in person. It’s a crazy community.” (Host, 00:23)
Leadership Culture: Co-founder and CEO Ali Ghodsi is lauded for his growth, humor, and high EQ/IQ mix.
- Quote: “Ali today is quite different from Ali 10 years ago. There’s a lot of work put in to get to this point.” (Reynold Xin, 01:09)

2. Omnigen: Open Agent Infrastructure

Genesis and Motivation: Internal pain points led to building Omnigen, a unified architecture for collaborative, agent-driven coding and AI workflows.
- Multiple agentic frameworks sprung up internally; need for standardization and collaboration became clear.
- Quote: “All the customer ones were running into this thing of like, ‘I need to switch model and harness every few months…’ Plus, the agent is useless if you can’t share sessions…” (Matei Zaharia, 02:22)
Design Philosophy:
- Orchestrates both coding and custom agents over a uniform, composable API.
- Emphasis on collaboration, session persistence, security, and plug-and-play extensibility.
- Quote: “If you think it’s a layer that will actually…benefit from many people collaborating on it, that’s a reason to open source.” (Matei Zaharia, 10:05)
Community Engagement: Immediate adoption after open-sourcing: “Already got 400 merged [pull requests]—about half not from our team.” (Matei Zaharia, 12:00)
- Community contributions for Kubernetes support, cloud sandboxes, new agent harnesses, and UIs.
Security & Contextual Policies:
- Omnigen integrates session-based, contextual policy enforcement—security decisions driven by stateful agent behavior, not one-off yes/no permissions.
- Example: “If it installed a one-day-old package or read a thousand confidential docs, then don’t let it push to the website.”
- Tracks and limits session spend (‘token maxing’), allowing guardrails on agent costs.
- Quote: “I can literally say, okay, launch a sub-agent to do this and cap it to spending five dollars…because we’re counting that within that session.” (Matei Zaharia, 21:26)
Batteries-Included and Composability: Out-of-the-box usability remains critical, but designed for deep extensibility—anyone can build libraries of policies or custom integrations.

3. LTAP & The Dream Database Engine

The Evolution of Data Stacks:
- History of splitting storage (OLTP) and analytics (OLAP), with brittle integrations (e.g., CDC pipelines).
- The rising need for unification, but prior attempts (HTAP) often forced bad compromises.
LTAP: A Pragmatic Approach:
- Unifies storage (row+columnar) while keeping query engines (OLTP & OLAP) separate, maximizing compatibility and avoiding major ecosystem disruption.
- Uses object storage in columnar formats (e.g., Parquet) for real-time analytics and transactional workloads.
- Quote: “We’re not collapsing databases at the query layer, just collapsing the indexing layer. That’s an important part.” (Reynold Xin, 51:31)
Technical Innovations:
- Transcoding between row and column formats at ingestion, leveraging idle CPU in storage fleets for zero overhead.
- Builds on open standards—no proprietary formats—enabling easier incremental adoption and broad compatibility.
Product Development Culture:
- Rapid prototyping, incremental deployment, and an internal mantra: “Who’s the target customer, and are you texting with them?”
- Culture enables major architectural shifts—“someone just tried it, and it worked”—by empowering senior engineers to experiment.

4. Competition and The Open Future

Open Ecosystem vs. Proprietary Platforms:
- Databricks prioritizes open formats (Parquet, Delta, Iceberg)—in contrast to rivals (e.g., Snowflake’s initial proprietary approach).
- AI and data are closely linked—platforms that integrate both (openly) will win future workloads.
- Quote: “Start open and start large… easier to go from broad to fast and specialized than the other way around.” (Matei Zaharia, 55:04)
- Increasing enterprise preference for open formats due to past nightmares with lock-in.
AI Model Strategy:
- With the acquisition of MosaicML, Databricks can both run/fine-tune custom models and orchestrate first-party agents.
- Focus moves toward utility (“Genie” as virtual data scientist agent) and specialized sub-agents (e.g., document parsing) rather than competing in the raw “frontier LLMs” arms race.
- Customizing models (via RL, synthetic data generation, etc.) will become progressively easier.
- Quote: “The ease of training algorithms is only going up…there’s a question of when it crosses into mainstream.” (Matei Zaharia, 61:50)
Data as Strategic Asset:
- Reinforces Satya Nadella’s “context is the new oil” thesis. Data, and the ability to take action on it rapidly and securely, is what unlocks AI’s promise.
- Security, governance, and spend management (not just model choice) are now first-class engineering problems.

5. Culture, Leadership, and Strategy

Startup Wisdom:
- Build for actual customer needs, not technical ambition—overfit to real use cases first, pivot as needed.
- “The downside of overfitting is much smaller than the upside. If you try to boil the ocean, you might have no customer at all.” (Reynold Xin, 41:00)
Enterprise Mindset vs. Tech Startups:
- Enterprises require deep governance, security, and non-DIY solutions; tech startups are more prone to “I’ll build it myself” mentality.
- “There are people who want to bridge technology—they don’t want to learn databases...as interesting as we think it is.” (Matei Zaharia, 43:10)
Internal Decision-Making:
- Empower teams, tolerate experimentation, drive incremental releases, maintain cohesion, and avoid unnecessary product sprawl.
- Be open to bottom-up innovation with an eye on real customer impact.

Notable Quotes & Moments

On Open Sourcing Agent Infra:
“Imagine our thing wasn’t open. We had some kind of agent hosting thing, but it’s not open. And then there is an open one—which one’s going to win in the long run?” (Matei Zaharia, 10:05)
On Agent Security Features:
“The thing we decided we need is stateful, or what we call contextual, policies where you keep track of the state of that session.” (Matei Zaharia, 18:10)
On Evolving the Data Stack:
“The thesis of LTAP is…we’re not collapsing the databases at the actual query layer, we’re just collapsing the indexing layer.” (Reynold Xin, 51:31)
On Company Culture and Innovation:
“Instead of building ‘boil the ocean’ for everything, let’s figure out how to do it incrementally, how do we do it very quickly?” (Reynold Xin, 38:33)
On Why Databricks Outpaced Snowflake:
“Probably the biggest fundamental difference…is open. Like Databricks has never had the proprietary format.” (Reynold Xin, 53:02)
On Founders and Leadership:
“Ali’s the perfect combination of IQ, EQ, technology obsession, execution, business acumen…” (Reynold Xin, 56:52)

Segment Timestamps for Key Topics

00:00–02:22 — Databricks’ founding, growth, and early leadership style
02:22–15:20 — The architecture, motivations, and open sourcing of Omnigen
15:20–21:26 — Cloud sandboxes, session persistence, and security policy innovation
21:26–25:23 — Policy APIs, spend tracking, batteries-included philosophy
25:23–27:30 — Building for developers, feedback loops, startup/collaboration tips
27:30–39:34 — Datastore architecture: OLTB/OLAP, CDC pain, the modern data stack, LTAP explained
39:34–43:41 — Product culture, customer-driven innovation, differences between tech and enterprise customers
43:49–49:47 — The “dream engine”, second-system effect, and building a next-generation database
49:47–56:38 — Incremental upgrades, competition with Snowflake, the importance of open formats
56:38–61:50 — Leadership, strategy, and MosaicML’s integration/future of models at Databricks
61:50–67:09 — Custom model training, agent and data synergy, enterprise AI adoption
67:09–End — Reflections on Databricks’ journey, conference dynamics, and VC advice

Where to Learn More / Get Involved

Omnigen Repository & Docs: Reference GitHub and documentation for contextual policies and contributing guidelines.
Discord & Community: Join Databricks’ Discord, engage on GitHub, and provide feedback on agent workflows.
Databricks Data + AI Summit: Review keynotes for product deep dives, especially on security, governance, and new engine architecture.

Concluding Insights

This episode delivers a behind-the-scenes look at how Databricks is striving to build the AI and data platform of the future: open, composable, collaborative, and powered by real-world data and engineering pain points. Their commitment to openness, efficient incremental iteration, and real customer needs emerges as the thread that ties together both their technical and strategic decisions—and serves as a call for the broader AI community to keep the frontier ecosystem open.

Loading summary

Transcript238 lines

[00:04]
Host
Matte and Reynold from Databricks, welcome to Alien Space.
[00:07]
Reynold Xin
Thanks for having us.
[00:08]
Matei Zaharia
Yeah, thanks so much.
[00:09]
Host
Thanks for taking time out. You have your databricks data AI summit going on. You were just telling me how the first summit that you guys ran was just 50 people.
[00:17]
Reynold Xin
Yeah.
[00:17]
Host
In Berkeley.
[00:18]
Reynold Xin
Little meetup at Berkeley, I think.
[00:20]
Matei Zaharia
Put together these tutorials and yeah, just teach people spark.
[00:24]
Host
Yeah. Obviously now it's like, I think the headline number is like 100,000 people around the world, 30,000 in person. It's a crazy community. Well, I mean I just saw the keynote. Ali is just. Did you know, was it obvious back when that Ali would be like such a great CEO, like such a great person?
[00:43]
Reynold Xin
What do you think?
[00:44]
Matei Zaharia
I mean, I think among our group of founders it was clear that I think he'd be the best at this and yeah, it turned out great. I mean he's ramped up on so many topics, growing a company. He would just go in and study it and you know, talk to all the experts. Like even if you can't hire the person, you know, learn enough about like finance and sales and whatever it was and you know, and go from there. Yeah, yeah.
[01:10]
Reynold Xin
I mean he's obviously very high IQ and a very high eq, but it wasn't like Ali today is quite different from ali from like 10 years ago. I think he. There's a lot of work that he put in to get to this point.
[01:21]
Host
Yeah, I mean, no, I mean to me the most appealing thing about him is that he's funny and it's true. It's hard to make jokes about data security and what have you.
[01:33]
Matei Zaharia
Oh yeah, that's for sure.
[01:34]
Host
Yeah. So you guys launch a whole bunch of things. I'll just sort of name check briefly the stuff because we were not going to cover everything. Omnigence, your baby, ltap, your baby, your dream engine. We're also going to cover Genie Cover Customer Lake, you acquired Panther Open Sharing and there's Unity AI Gateway. A lot of these I think like are things that you would expect a databricks to do. It's like part of the roadmap. Everyone in your category has similar things, but I think probably the two of you are leading the two most unique and differentiated initiatives in the landscape. Maybe we'll start with Omnigent and we'll go into it. I do think that a lot of people are exploring this sort of meta harness concepts. What led you to it?
[02:23]
Matei Zaharia
Yeah, there were actually a couple of converging lines which I think is a good sign that you need something new. So on the one hand there's all the coding agent infra internally we have a really great dev infra team. They built something called Isaac that's basically like a wrapper on cloud code and Codex and lets you use them either on the web in sandboxes or just on your dev machine or on your laptop or whatever. And then they were adding all kinds of stuff there and we saw all the sort of more advanced engineers were building their own workflows with tons of agents and they were building their own UIs and stuff even on top of that. And then the other one was us building agents. We ship this data science agent called GENIE on the research team which I co lead basically we also built a lot of internal ones for various things and then we have all the customer ones and all of them running into this thing of like, oh, I need to switch model and harness and so on every few months. Plus the agent is completely useless if you can't share sessions with someone and have history and have search and all this layer on top of it for collaboration. I thought a bit about it from both contexts and at first people thought it was weird. Why are you doing coding agents and custom agents in the same thing? But I said it's basically the same problems and you just want to build the stuff that lets you deliver the agent, maybe control it if you care about security and make it portable across things. And then we prototype some things as experiments. We say, yeah, actually we can make it work. And then we sort of built it for real.
[04:07]
Host
I'm wondering if this kind of, let's call it architecture maps to anything in your careers in the past. I always think about how a lot of things actually just tie back to operating systems. A lot of operating systems tie back to databases or the other way around.
[04:22]
Matei Zaharia
So the thing I, I do think it ties a lot to network protocols, Internet protocol, communication between entities. Yeah, we did stuff with data sharing also, which is probably most viewers probably won't know unless they're.
[04:36]
Host
Yeah, open protocol is the term open sharing.
[04:39]
Matei Zaharia
Open sharing, yeah. So it's like you have a company, you maintain some kind of table, like let's say like Walmart or something. They have like inventory and what's been sold in each store and then you also have suppliers and they would love to produce more things and ship them like exactly the moment you need them, them. So they would love real time access to your table. So instead of sending emails around or Excel sheets or phone calls, why can't you share a view of that table in real time with them. Then they query, they join it with their data, and they decide what to send. So it's one of these things where you might ask, today, since we can vibe code anything so fast, why do we even need to Design protocols or APIs or software? Can't you just wipe code things on demand? But actually, for this type of interoperability, where multiple parties that are moving at different speeds are building stuff and you still want some layer on top to coordinate, you do want to design it and build it. So it reminds me of that. Like agents talking to each other and users talking to agents and tools.
[05:42]
Host
Reynold, any other comments or alternative viewpoints?
[05:46]
Reynold Xin
I think, by the way, we had a debate on exactly which set of benefits would matter a lot. And I think around the time we decided to do this thing, I was telling Mateo, hey, hey. It just happened that there's a particular week that was coding nonstop from the moment I woke up to, like, the moment I went to bed. I was, like, looking at my class sessions, my codec sessions. And one of the things particularly annoying was having to keep my laptop open. I was actually driving to a doctor's appointment. And I remember because I wanted to make sure the whole thing continues working.
[06:19]
Host
By the way, it's so comforting to hear you say that, because I'm like, I don't know if I'm a clown and I'm doing this or.
[06:25]
Reynold Xin
Yeah, honestly, I was driving and I was tethered my laptop to my phone, keeping it on the side. Whenever I hit a red light, I started looking at what's going on on my laptop. And I just felt that was ridiculous. It felt like we went back to the dark ages of programming. I mean, the productivity you gained from all this coding age is amazing. But have you heard of cloud? Yeah, it was crazy to me.
[06:50]
Matei Zaharia
The thing we were working on was the sandboxes. Or was this before that?
[06:53]
Reynold Xin
It was a sandbox.
[06:54]
Matei Zaharia
Okay, so you were in.
[06:55]
Reynold Xin
So I was approaching from a very different angle. I wanted to, hey, we gotta have cloud sandboxes that actually doesn't shut down. You can get one very quickly. But not just for running agentic sessions, it's actually also for running development. So I was actually personally building that that week, and through building that, I ran into all these issues. And then I wrote actually a document for my taste, like, here's my wish list of what the actual environment should do. And I think he actually ended up almost implementing every single one of them.
[07:24]
Matei Zaharia
Yeah, I remember Reynolds saying, because my first prototype of this had just chats with your agent and he said, I have to be able to open a shel like my own shell and list files and tail them and stuff.
[07:36]
Host
You actually ssh into a mainframe?
[07:38]
Matei Zaharia
Yeah, actually it has that tailing my log.
[07:42]
Reynold Xin
And also another thing I think I asked was I still use cursor for the sole purpose of rendering markdown files. So I said, just give me a way to see my markdown files and render them properly. I don't need a separate tool anymore. I think you also built that.
[07:57]
Matei Zaharia
Yeah, we did that. Yeah. Yeah, we had a lot of engineers building their own vibe coding setup. But then the other thing they all said is like, hey, I built something that's amazing for me, but no one else on the team can use it because I don't have a server to collaborate. And this is why we tried to set up Omnigen, so you can have a server and have the security set up in there, like log in with Google or whatever and actually securely share stuff. And that's where we've seen a lot of other agents hit things. Like people think they prototyped an awesome agent, but, you know, it's not allowed to connect to like some really important data or whatever because of the security team.
[08:39]
Host
Yeah. At this point, for those watching along on YouTube, we're going to bring up a image of the structure here and we can talk through a little bit of the architecture. I think I just. I just want to have people understand because, like, when we're talking about software, it can be very abstract and like. Like here is actually what we're talking about. You've worked out in open source this entire platform, basically. And there's a runner component and server component with a sort of uniform API that you figured out any other sort of element. And obviously you can plug in all this persistence layers and compute layers. This is a whole cloud. It's an agent cloud.
[09:12]
Matei Zaharia
Yeah, it's got these components to work with it. A lot of the action happens on the machine where you deploy your agent too, so whatever you've got on there, you can run. But yeah, I think it's sort of the minimal thing. You want to have hosted collaborative agents and to have that server. And one of the reasons we open source it is anyone building agents just gives them an app they can start with and customize, which we were seeing in databricks too. Like, someone would make a nice agent app and then other teams would ask, oh, can I just use yours for my agent?
[09:45]
Reynold Xin
Yeah, I think we had like five or Six different agentic frameworks built by every different team. They do all do more or less the same thing.
[09:51]
Matei Zaharia
Yeah.
[09:52]
Host
Basically people want to take something that works in four kit and you might as well have something open source. Yeah, which also was another question which is interesting for a databricks like what do you choose to open source? What do you choose to make it proprietary? This goes back to Spark, right?
[10:06]
Matei Zaharia
Yeah. One of the reasons to open source something is if you think it's a layer that will actually there'll be some network effect, it will benefit from many people collaborating on it. So for example with Spark, I don't know if you know, when Spark came out we also focused a lot on letting you have libraries on top. So they used to be different distributed computing engines for machine learning and graph computation. We said they should all be libraries that you can compose and we made it super easy to add connectors to data sources too. And then we benefit because we don't have the time to write connectors to a thousand different databases and file formats but we can just use the ones people make. And of course they benefit from joining kind of this thing. So that's one of these. Another way to think about it is imagine our thing wasn't open. We had some kind of agent hosting thing, but it's not open. And then there is an open one which one's going to win in the long run. So here, because there is this benefit from people writing integrations, it'll be that and then there are other things that you just can't even deliver as open source that are things the company does. For example, how do you make sure you're streaming jobs or your lake based database doesn't lose all your data at night? Well that requires an operational team that's going to sit there. There's no way it has to be a service. So we want to make sure as a company we're really good at those infra services and then we're as open as we can in terms of what you build on top.
[11:42]
Reynold Xin
I mean speaking from a benefits, I think we're already seeing pull request on all kinds of ecosystem integration even though it was only released on Saturday.
[11:50]
Matei Zaharia
Yeah, Saturday.
[11:52]
Host
Let's see what's going on.
[11:53]
Matei Zaharia
Yeah, you can look at the merged ones. I actually asked Omnigen this morning about the 400 merge.
[12:00]
Reynold Xin
Already?
[12:01]
Matei Zaharia
Yeah, I think quite. I would guess around half are not from our team. But for example someone added support for running it on kubernetes. People added many cloud sandboxes. So this can launch a Cloud sandbox and run your agent in there, which is great for sharing too, because it's not like on your laptop and someone's running scary code on there. So, yeah, many startups have put those in and we expect to see more of them. We also have more agent harnesses already. Cursor, CLI and Anti gravity also.
[12:34]
Host
Yeah, that's all beautiful. I feel like the last time this happens there was the rise of the modern data stack. I don't know if it was that useful. I'm actually kind of curious. Your postmortem, I think most people will agree that it is finally dead, but maybe this arises to a new modern AI stack that does the same thing. I don't know.
[12:54]
Reynold Xin
I mean, I think the modern data stack was a pretty useful thing, probably even up until this day. I think what maybe for the audience who don't actually understand the history, I think the modern data stack is effectively decomposed into. You need a layer to ingest the data in, you need a layer to transform your data and then all of this are run and then you need a layer, so maybe visualize your data. And all of this runs on some sort of data warehouse or later on as we're doing data warehouse also. Lakehouse. I think that concepts are all very powerful and very useful. You sort of enable a lot of workloads. What people eventually run into is kind of a question of unification and consolidation, is, hey, do you really need to chop all this into different pieces and work with so many different vendors and platforms in order to get like a very simple visualization done right? So I think like over time, everybody started realizing that customers are pushing us. We start realizing. So we started building more and more capabilities and trying to consolidate. And at the end of the day now, customers don't have to worry about having me hook up five different systems in order to produce a chart. But I think honestly, something like this is probably happening in how many different frameworks do you want to hook up together in order to produce do a very simple agent?
[14:07]
Matei Zaharia
Just to be clear, I would say the core of this is this common API on top of all the harnesses. So the API is basically like you've got an agent session and you can send in a message or like a file. Basically that's what you can send in and then you get out, you know, these streams as it's streaming text or as it's doing tool calls and. Or the other thing you can send in is you can like tell it to cancel or turn. So that's the API. Now the thing we did is we could get you that on top of like Claude code running in a terminal Codex, you know, PI OpenAI SDK, all that stuff, we map them all to that same interface. So that is something that you'd have to maintain yourself if you build your own like agent Orchestrator. And then whenever Claude changes its API you gotta, you know, tweak your thing or it's going to lose some messages. So that's the thing that's valuable to maintain. Then on top of that we built a few apps. I think we built a pretty cool UI and stuff and we built the security and control piece, which I'm excited about. But it's that common interface, so that doesn't try to be a stack. And in fact you could plug in your own UI on top of this server. That's one of the use cases we care a lot about because we want to use this in our own products.
[15:21]
Host
Yeah, it should be everywhere. Yeah, I think one of those things that is really interesting to me is first of all I'll endeavor to do everything and not call it the modern AI stack because. Yes. So one of the first people that told me about compute sandboxing was Nikita from Neon, because a lot of people think about NEON as like, well, it's serverless postgres with the separation of compute and storage and instant branching and all those things. But actually every database company is also a compute company. And so he was actually showing to me his sandboxing solution. I don't think he ever launched it it.
[15:57]
Reynold Xin
So our sandbox solution, the reason we could have built it so quickly was because we realized if you just take the actual lake based architecture and remove the database from it. By the way, coming from.
[16:09]
Host
Exactly right.
[16:09]
Matei Zaharia
You have databases already. Yeah.
[16:11]
Reynold Xin
Now there are some differences. For example, in the one to support this particular workload, it's important to have local persistence because you want your state to persist. Your libraries, you don't have to install your library every time. Whereas the NEON architecture, because of a separation of storage from compute, you don't need persistent local disk. So there's some differences, but at the end of the day. Yeah, so this is when you run
[16:37]
Matei Zaharia
like a coding sandbox, like if I use it, we had the dev info internally at Databricks there's many tens of gigabytes of data just for all the source code and artifacts and stuff that I built and I want that to come back next time. So.
[16:52]
Host
But yeah, before the show we was talking about some statistics that might be surprising at the adoption, it could be internal, it could be external, whatever comes to mind. Just to impress people to scale. This is happening.
[17:02]
Reynold Xin
So we on the analytics side, I think we launch maybe 50 or 60 million virtual machines a day across all three clouds. So we took one of the biggest compute orchestrator out there for sure for CPU compute. And all of this process, I think exabytes of data, I joked about depending on which time zone you are, typically before you have breakfast, databricks would have processed exabytes of data already on that day. And on Neon, it's actually pretty interesting too. It's launching, I think 13 million databases a day now.
[17:35]
Host
Yeah, to me that was like a big.
[17:37]
Reynold Xin
And that's just like.
[17:38]
Host
What do you mean?
[17:39]
Matei Zaharia
Yeah.
[17:40]
Reynold Xin
And a lot of those were thanks to edging agents and branching experimentation because we made it so easy and so quickly. And thanks a lot to Nikita's team to launch databases. So it's changing the way people use databases. Yeah.
[17:55]
Host
Okay, we're going to go into more database talk in a bit, but I want to make sure we close up anything on omnigence. You mentioned you were excited about the security and control side. A lot of companies are figuring that out right now, as well as the spend side. What have you found there?
[18:11]
Matei Zaharia
Yeah, so I spent quite a bit of time talking to internal users, developers, security team managers, and also lots of customers. And there's a few things. First of all, one thing that immediately became obvious is for security, there's this tension between usability and security and the way people do. A lot of coding agents today have very basic things like you can tell me which tool patterns I'll allow or disallow or whatever. It's like yes or no, but that puts you in a very tough spot. So just as an example, should my agent be able to read some confidential documents? Or let's say, should it be able to install new packages from npm, which maybe it's compromised? Yes or no? Maybe I want to allow it. Should my agent be able to publish stuff to the company website? Well, if I'm using it to code on the website, yes. But should it be able to do both so it can grab a confidential document and be prompt, injected and leak it? Probably not. So the thing we decided we need is stateful or what we call contextual policies where you keep track of the state of that session. It's not like is it allowed to push to the marketing site or not, but like, hey, if it did a risky thing like it installed a one day old package from NPM or it read like a thousand confidential docs, then no, then don't do it. Otherwise maybe it's okay. That's one example of like moving that trade off so it's both more secure and more useful by having a more powerful engine. Essentially this requires tracking sessions. The other piece that was interesting there is like there are these very low level events it's doing and you want some libraries on top that parse them. Like for example, we have a MCP server on Google Drive internally it's got 60 API calls. How do I know which of those will share a document with stuff on the Internet and which ones won't? It's annoying. So we designed in Omnigen the policy layer so that its functions and you can have libraries like someone can make something that maps the low level events to high level ones and then you write a policy about the high level things that came out related to the Panther. Yeah, Panther will help with that. Panther is kind of a similar idea on the event processing side and it's Python based versus a weird custom language. This is sort of more as in real as things are happening. Yeah, but these are the cool things I think the contextual or stateful part and then the way it can be libraries. And that was another reason to make it open source because others will write libraries and like we and our customers customers can use them. And the final thing because it's stateful, one of the states we track is how much you spent in that session. So I've had like I ask an agent to debug something and it spent $500 because it decided to read a lot of log files and burn a lot of tokens. But I can literally say, okay, launch a sub agent to do this and cap it to spending $5. Like ask me for permission if it needs more. And because we're counting that within that session, it'll pop up and tell me, okay, you spent $5.
[21:26]
Reynold Xin
Do you want so important context here? Matei spent the last five years, a lot of his time was architecting und catalog at Nebrask, which is the governance layer for data.
[21:35]
Matei Zaharia
That's right. Yeah.
[21:36]
Reynold Xin
And he's sort of combining expertise at that layer together with all the AI governance.
[21:41]
Matei Zaharia
Yeah. But I also spent a lot of time being annoyed by coding agents and getting prompts. And also as the cto, I don't want to end up on the front page as like I installed some weird NPM package and leaked all the code. So I'm especially bad. But also I have very little time. So I don't want to sit there approving like do you want to run a 20 line bash script? Yes or no? So that's why I spend a lot of time figuring out how can I make it as safe as possible and not annoying.
[22:11]
Host
Yeah. Is safety and let's call it security a bigger concern than token maxing or token budgets? Which one is like.
[22:19]
Matei Zaharia
Oh yeah, they're both there. I don't know, I guess it depends on the type of company you are. So I think some companies, the budget is limited and they really care about that.
[22:34]
Host
I mean you can be uber and still be concerned, you know?
[22:37]
Matei Zaharia
Yeah. Oh yeah, totally. Yeah. Yeah.
[22:39]
Reynold Xin
I mean for U.S. securities.
[22:40]
Matei Zaharia
Yeah. For U.S. security is absolutely critical. As a cloud provider, it's the most important thing. And token maxing, we're not so worried about it yet. But I've seen the like. For example, I talked to some consulting companies. They have like 100,000 employees who are all coding for customers. If those each spend like an extra thousand dollars a month, that's not fun. We have only a few thousand engineers.
[23:06]
Host
What's the policy in databricks? Is it just unlimited or.
[23:09]
Matei Zaharia
It's unlimited, but we use our own product to analyze the traces and stuff and we have a team that's looking to optimize and to see if anyone's doing something weird. And we actually had some really cool insights just from analyzing current traces. Which models are better at say R versus like TypeScript or whatever. So yeah. At least in our code base.
[23:31]
Host
Yeah. Amazing. Obviously I have to ask the token maxing question. Obviously I think it's a key thing. But yes, security and control above that and figuring out a sane layer there you can have some autonomy but not too much.
[23:43]
Matei Zaharia
Yeah, yeah. And we want to make it super easy. As an engineer you should set the thing so in omnigence you can ask your agent set a policy on yourself to do this.
[23:52]
Host
If there's something I should be showing, I don't see it on the GitHub
[23:55]
Matei Zaharia
but in the docs you can look at at it later. Just look in the docs on contextual policies if you want to see.
[24:04]
Host
I just like to point out.
[24:05]
Matei Zaharia
Look at the built in policies.
[24:06]
Host
Yeah. If you want to follow up on this, this is exactly where to look. Right?
[24:10]
Matei Zaharia
Yeah. And the story of these is like I just wrote, I wrote a doc with like 10 ideas for things before as you were working on them. Well that was like my wish list of things. People asked and I told the team like hey, can you do like at least Five of these for the last and then they just got back with all of them.
[24:29]
Host
Oh wow.
[24:30]
Matei Zaharia
So you can come up with more. But some of them are just meant to be examples really. You can intercept any event the agent is making and you can then either block or force it to ask the user or allow and you can update state to track stuff.
[24:47]
Host
Yeah. Because ultimately I think of you as like a systems designer. You let people plug in.
[24:51]
Reynold Xin
Right.
[24:51]
Host
That's the whole modus operandi of what you do.
[24:54]
Matei Zaharia
Yeah. And we care a lot about also compose. Like can someone else write a library that others use? Which this is meant to.
[25:01]
Reynold Xin
There's also a batteries included philosophy here. Probably very similar to how you did Spark, which is you could just start using.
[25:07]
Matei Zaharia
Yeah, that's right. It has to be good out of the box at certain things and then you can build your own things on top that like we, you know, we don't want to do. But you know, in Spark if you just want to like, I don't know, like read a table or do like aggregation, it should be awesome and that out of the box.
[25:23]
Host
Yeah. People want to catch up on Omnigen, they should watch your key. They should go through the GitHub and the docs if they wanted to contribute or they want to build on this ecosystem. Where would you call out as the most high leverage places to get involved?
[25:36]
Matei Zaharia
Yeah, do get involved in the discord and in GitHub our team is there is monitoring and some of the things people ask for, we just built ourselves some of them. We're collaborating with them to build it. And also tell us how you would like to use that because I think especially for developers, like everyone wants it to work their own way and a really good developer tool, like you have to hear the feedback on all the ways and figure out the abstractions and how to let people customize. So we'd love to hear like if you think, hey, I don't want it to work this way, tell us. We really just want to get that compatibility layer across agents and then let you do stuff on top. Yeah.
[26:15]
Host
Is there any, you know, in terms of like the startup side. I'm a founder, I want to, I see an opportunity, I want to get in front of you. What's your request for like a startup that's like, you know, I wish someone was working on this.
[26:26]
Matei Zaharia
Oh, for a startup.
[26:27]
Host
Yeah, like, you know, you got your own startup, it's doing well. Yeah, but like, you know, if you weren't working on your own startup, what is like obvious that you should you advise many startups too? Obviously.
[26:37]
Matei Zaharia
I mean I do think just as a company with a lot of engineers, like anything that helps me make sense of how people are using coding agents and. But also quality or like you should write, you should add this skill or you should write this thing or your agents are really horrible at tasks involving this service. So go spend time. That would be nice. Yeah.
[27:01]
Host
The closest I found is this team git AI.
[27:04]
Matei Zaharia
Oh cool.
[27:05]
Host
Yeah, they started with like we will just do code and human attribution, but they're basically building the analytics layer on top of that. I do think there are a bunch of artificial analysis is obviously super well with their stuff. So. So there will be people. I think this is the domain of consultants first, but then people actually build software that had the management plane for coding agents.
[27:31]
Matei Zaharia
Yeah, I think there'll be a lot of insights there and you have it in other areas.
[27:35]
Host
Okay, well and then the other big thing is your dream engine if you want to tell the story of LTAP and background. I'm going to make people listen to our Ankar Goyal episode when we talk about single story store HTAP and all that history.
[27:53]
Reynold Xin
Yeah, yeah. The LTAP idea is actually pretty simple. So if people have heard of the anchor's talk about htap, it's effectively the world of databases. Sorry, there's like maybe a lot of context needs to be injected here.
[28:06]
Host
I'm happy to be the database podcast that I'm forcing people to learn your databases. Guys, you cannot vibe code with just markdown files.
[28:14]
Reynold Xin
It's one of the most important fundamentals systems technologies out there. But the world of database is effectively split into roughly two halves. There's what we call OLTB databases which are transactional. And think of your Postgres, your MySQL, your Oracle databases. And the other side is what we call analytics. And sometimes I defer the term olap. And the difference is on oltp you typically have maybe run some transaction or some event that looks up at one specific role. We update that role. Right. So very role oriented data structure. And on analytics you're trying to reason, on the data you're trying to compute, hey, what's my revenue per store? How's my website doing every day? And then you eventually want to probably end up running machine learning on it to predict, hey, how will my maybe sales be going in the future? There are so very different architecture and everybody start with OLTP databases. Every app, when you become serious enough that needs more than markdown files. You need to Have a database, you want to lose your data, you want to have some transactional consistency. But once you want to reason on the data, if you only have like 100 rows, it's probably okay to run it on your postgres or on your MySQL database. But once you have more data and want to run more complicated analysis, the very analysis might crush your postgres database. So you start doing. Getting data out of the database, replicate them into the analytic systems.
[29:40]
Host
Yeah, which for people, elasticsearch is like a big.
[29:43]
Reynold Xin
Yeah. So some of them actually get into elasticsearch for like blocked analysis. A lot of our customers obviously get into databricks to run more sophisticated things. And there's this term called cdc which change data capture. And what it does, it reads the binlog of the database. And if you don't understand what binlog is, fine, but it's a little delta of the data and it reconstructs based on the delta the state of the database on the analytics side. But CDC is like a very painful thing. It's how basically standard in industry, everybody uses it, but it ends up being sort of. I think many data engineers ends up being wakened up at like 3am because there's some pipeline things.
[30:23]
Host
My explanation is like everybody became a five billion dollar company just doing cdc.
[30:27]
Reynold Xin
Exactly. It's one of the most boring, but one of the most fundamental operations like powering modern society. But it's so brittle that we joke that it should be called continuous data corruption because you might change your schema on your LTP database and then the CDC pipeline fails to handle the schema change and then everything goes out.
[30:51]
Host
And I mean there's all sorts of tricks that you can do. Like you add in like some versioning or whatever. Yeah, yeah.
[30:55]
Reynold Xin
But it's a very, in general, very complicated, like I think at my keynote I asked the audience put up their hand if they love their CDC pipeline. Only like maybe two people put it out. So if single store, like about maybe a decade ago, I think the industry had this idea, hey, what if I built a single database that can handle
[31:11]
Host
both workloads which like by the way, every database person ever has ever always dreamed about this.
[31:16]
Reynold Xin
This is the holy grail of database engineering. Why not build a single system that can do both of this? But it ends up just being a lot of compromises. I think one of the first issue is that hey, each they say postgres has a massive ecosystem.
[31:33]
Matei Zaharia
Right.
[31:33]
Reynold Xin
You want to be using the tools that's built for Postgres and Spark for example, had a massive ecosystem. There's a lot of libraries you want to use. If you were to create now a new thing, you don't have an ecosystem, you tend to create a new smaller proprietary API and you're lacking both. And it's also very difficult to make it performance wise to be self comparable on either side. So it ends up being actually sucking on both. And our whole idea of LTAP is kind of obviously a word play on the term HTAP is that we think this is HTAP done right. HTAP wants to build a single engine for both. We think you can get 99% of what you need by unifying the storage and just have a single storage layer. And once you have the single storage layer, if your postgres databases are writing data in a column oriented format, everything analytics can just go read that data directly without any delete delay. Right. There's no pipeline in between. So all the data will immediately be available for reasoning analytics. I think I was telling some customers earlier hey, when we talked about this is going to be super useful for agents I actually at first didn't really believe in it myself even though we wrote that positioning. But then last night I was having dinner with Australian customer and they actually told me oh hey, one of the big issue we have is we have all these logs from our services this and we see SLA dips and want to investigate. But then there's no way for those agents to even understand what's going on in the actual databases themselves. All we see is just like product telemetry of the database and the services. It would actually make those agents 10 times more powerful if understand for example who's actually placing those orders, what is happening, what exactly are they doing? So now I'm actually sold on our own message. I think it's really kind of kind of it gets you basically almost all of the benefits of the HTAP holy grail which is hey, make the data available immediately for reasoning analytics.
[33:27]
Host
Yeah, I think in the way that humans are generally intelligent and want to have the ability and access to query anything while they do the work, they also need history and they need context. And where else is going to get context? That's an analytical workload. Exactly.
[33:43]
Matei Zaharia
I remember when we had incidents with our databases and and engineers said well I can just run a giant query on it to see what's going on because that's going to bring down the database and hurt it even more. That's the kind of stuff that this gets rid of because you spin up a whole separate fleet of machines that's doing the analytics. You're not overloading the main database that's still trying to serve stuff.
[34:04]
Host
Yeah. So this has been a dream for a while. What had to get done in order to get to today. I feel like you have announced variants of this, this several times, but it wasn't as clear as ltap. Yeah, I think LTAP is like a, like. Okay, we've got it, guys.
[34:22]
Reynold Xin
I was talking to somebody at Meta and he was asking me, hey, what's the catch? Why is it possible now? And I think the reality is we took a lot of time to actually work on the lake based architecture. I mean, obviously a lot of it came from a NEON team, which is a separation of storage from compute. And it turned out it was just a tiny little step away way going from that to this LTAP idea, which is, hey, we just. In the NEON architecture and in lake based architecture, we're writing data in role oriented format to the open data lake, but in there we're writing in postgres pages. Actually Aliyah and I were spending a lot of time debating, hey, can we actually just change that to write in column oriented format? And we're just debating. One day one of our engineers was like super smart came in, it's like, hey, I just pulled tidy work works,
[35:08]
Host
but it's a prototype.
[35:09]
Reynold Xin
What prototype? Instead of storing the data in the data lake in the row oriented format like postgres pages, write them in parquet. And he just make the observation that hey, our storage fleet has a lot of extra idle CPUs and we could use those CPUs to do the transcoding from row to cog, where row is good for oltp, but columns good for analytics. So let's do that transcoding at that time. And as a matter of fact, once you transcode the data, the data compresses better. So from those services writing to, e.g. s3 or other data lake, like object stores, you can actually write them faster because now they are now smaller.
[35:49]
Matei Zaharia
Yeah.
[35:50]
Reynold Xin
So there's no overhead, it's no compromise in performance.
[35:53]
Host
Some CPU overhead.
[35:54]
Reynold Xin
Yeah, because. But we had extra CPU anyway that fleet anyway, so the debate ended. I mean it's one of the classics of a tech issue of a lot of debate. But then somebody actually went ahead and just tried prototyping and it worked.
[36:06]
Host
But like something this strategic and important to the company, I expect there to be like a kickoff thing like a design dog. Nothing like that.
[36:13]
Reynold Xin
Nothing like that. He just. We were debating in many, many meetings. And then we're just debating whether it's possible or not from first principle. And then somebody just did it.
[36:23]
Matei Zaharia
Yeah, I mean if you set yourself up so people do that, that'll be great. That happened a bit with Omnigen too. I think if I just had a doc, we can make these together, everyone would think, oh, what about this, what about this? But then if you try it out, it helps. And then if you have real users and they bash it and it's still working, or in this case, if you have the workload, you know what the workload looks like, you can just test the same pattern then tech aside, which is very cool.
[36:49]
Host
This is the most important thing, the culture of innovation. And you don't have to ask my permission. You don't have to do a whole formal process, just do it.
[37:00]
Reynold Xin
Especially these days, I think with AI it's actually easier to.
[37:03]
Host
I think you are very rare. I mean I made a lot of C suite of large companies and I think that at scale things slow down and I'm sure you felt it already, but somehow you have this core of people that are exempt.
[37:16]
Reynold Xin
How I think we hire and we work with really, really good people and that's a very important part of it and empowering them, but also spending a lot of time, maybe us in the trenches matter a lot also.
[37:29]
Matei Zaharia
Yeah, I think first people can adapt to being in the larger company, so that helps. And we want to make sure they know that they can try stuff and settle debates and have a lot of examples of how it was done before or launch a thing in beta or whatever. And then the other thing, I do think as a company, despite the size, we don't launch that many like products. We tried to keep it pretty coherent. That was actually the whole sort of theory of the company was like Instead of having 20Amazon services, you need to set up analytics and machine learning stack. You just have one and it's the same API, the same semantics across all of them, the same copy of the data. So that requires unification. And then we basically added one more thing at a time. We added storage with Delta Lake. We didn't used to do any storage. Then we added, added SQL, we added machine learning platform stuff. But yeah, don't do too many, but do those things well. And that also helps keep it manageable. Yeah.
[38:34]
Reynold Xin
The other thing we kind of encourage a lot is instead of building support the ocean for everything, let's figure out how do we do it incrementally, how do we do it very quickly? Like many of our products, they're built in the span of weeks and then we go to. Hey, usually actually my first question to whoever team is building is who's the target customer? Who are you working with? Are you on a first name basis with them? Are you texting with them? I think having that very tight loop.
[39:00]
Host
Can you bring up another launch that comes to mind in this kind of thing? I just want to Omnigent itself.
[39:05]
Reynold Xin
Anoi who's the customer?
[39:07]
Matei Zaharia
Yeah, Omnigent was more of an internal thing, actually, because we would use that for our developer. Basically the whole AI team got access to it and was using it and we made sure it works from the beginning with our internal code base, which is a mono repo. That's enormous. We gave them some infrastructure, we gave them lots of token capacity. So it's all the developers. Yeah, we had others. I don't know. This is, I think a public story,
[39:35]
Reynold Xin
but I was going to ask Marketplace Open sharing, all of them had. I just don't remember exactly which ones publicly referenced.
[39:41]
Matei Zaharia
Yeah, they had earlier. Well, very early in the company there was like Delta Lake, which is the transactional storage layer we did. Our largest customer at the time said like, okay, I want something in the cloud because if the rest of our network is compromised, this thing needs to be separate to store and query the events and then talk to us. He said, okay, this is the rate of events per second. This is the freshness I want. Can you do it? So that was way larger than any workload we had. And we had our engineer working on that, Michael Armburst, and he worked just to make this work. And once it worked for them, it worked for everyone else. Yeah, this was early in the company, probably like four years in or so.
[40:25]
Reynold Xin
20. 18.
[40:26]
Matei Zaharia
Yeah, 17. 18 maybe you have others.
[40:32]
Reynold Xin
Yeah, clean room, which is basically how you share data in a way without sharing underlying data, but you allow specific operations. Those were done effectively initially just for two customers. I think the industry has a sense of, hey, maybe if you overfit to like one or two customers, it's going to be really bad for you. But I think the downside overfitting is much smaller than the upside itself. And if you sort of try to be too ambitious and boil the ocean, it's a much bigger problem because you might end up actually having no customer.
[41:00]
Host
Yeah, that's the more likely outcome. Then you can sort of pivot from there. I do think there is such a thing as a bad customer that sometimes
[41:08]
Reynold Xin
you should find they could exist sometimes if you drive. But one of the challenge I think we probably see and maybe many AI, so newer generation companies are seeing is. So tech companies are very, very different from non tech companies or traditional enterprises. And if you optimize everything just for tech companies, you might have various challenges scaling them outside of tech companies.
[41:29]
Host
Okay, what like top three differences that you always think about?
[41:33]
Reynold Xin
Governance is a big one, I think.
[41:34]
Matei Zaharia
Yeah. A big one is like, yeah, security, data, privacy, governance, all that Stu. So usually if you're building some kind of like B2B or developer tool, like your biggest market is going to be enterprises. But it's just very different. A company that's existed for like, you know, it's had some form of it for like 30 years. They have so many legacy systems or they operate in a regulated space. Whereas a startup or even like you know, like sort of more recent tech company, all the everything is new and sort of pristine. So yeah, it's just different. And if you've never, never worked with enterprises or been in one, you just won't know about it.
[42:14]
Reynold Xin
And the procurement process is probably quite different. There's actually far more stakeholders.
[42:17]
Matei Zaharia
That is one. Yeah. Another piece that's interesting is I think some tech companies, people will say, oh, I can build that myself, right, I'll just build that myself. So then you go, but I don't
[42:29]
Host
think people say that about databricks.
[42:31]
Matei Zaharia
Yeah, they do, yeah. And it depends on the teams and things. But on the other hand, like many of the enterprises actually I don't, I never want to be in the business of building that. Like I don't want my, you know, whatever, I'm a retailer or something. I never want be down because like some weird like nerd like couldn't get streaming pipelines working. That's not what I'm doing.
[42:53]
Host
This makes them great customers to be honest.
[42:55]
Matei Zaharia
Right. But you have to understand it's hard without having work there and stuff like you may not appreciate.
[43:01]
Reynold Xin
Look, I think they all, all great. Don't get me wrong, they have different challenges but many of the tech companies for sure, there's a lot, far more diy.
[43:10]
Matei Zaharia
On the flip side, you have people who are, they're very much experts in their domain. Like they're building airplanes, they're, you know, designing medicines, whatever. And they just want to bridge the technology where like they don't want to learn, you know, databases or whatever. As cool as we think it is, even as interesting as the average software engineer might think it is to read a little bit like they just never want to know it. They just say Think I, I have a giant matrix or whatever with my clinical data, how do I cluster it or whatever.
[43:41]
Host
That's true. Okay. And then I wanted to actually build out the sort of dream engine vision. Where does this all lead?
[43:50]
Reynold Xin
So one other thing we realized maybe a couple of years back is that actually every single database engine out there, especially on the analytics side, are kind of a decade old. Old. Pretty much everything that have reasonable traction are about a decade old. And they all started targeting some very specific narrow use cases. And over time, it's become more and more successful. They have grown in their ambition. And then they tried to support more and more use cases. But the fastest way to support those use cases tend to be hacked around the abstractions that were initially created that were not for those use cases. But you can kind of support them more or less. Okay. And before you know it, after 10 years of organic evolution that way it becomes a gigantic pile of shit. But that includes databricks and very, very few company or very few systems. I think have the gut to say, let's go start from scratch. Let's go back to the drawing board, the design. Knowing everything we know today, after a decade of workloads and probably billions in revenue, let's attempt to rewrite it from scratch and actually make sure it will work. And then we can support all these use cases. So we started doing that. But it's a very ambitious project. By the way, you can search on Wikipedia, there's this thing called second system syndrome.
[45:09]
Host
Yeah, I know that.
[45:10]
Reynold Xin
Our second system effect.
[45:11]
Host
Every developer must know what a second syndrome.
[45:13]
Reynold Xin
It's basically you build your first thing and it works out great. And the second one is bound to fail because you become too ambitious. And then you ask or like, you
[45:21]
Host
know, you think you know everything and then you're like, I'm gonna design the perfect system this time.
[45:25]
Reynold Xin
And it turned out it's not perfect. And then they start failing and you're too ambitious. Never launch and you get killed. And the engineering team that actually started this, they were brilliant. I think we hired some of the best database engineers on the planet into databricks and they were brilliant. Thank God it's not their second system. Many of them have built more than two in the past.
[45:45]
Host
Nice.
[45:45]
Reynold Xin
But they were still worried about this. Hey, building a database engine from scratch, I think the conventional wisdom is going to take like five years, years to mature. This would be a very long term project. It could fail. I think one of the engineers kind of jokingly said, hey, maybe we just call it rental stream. Engine if we name after a co founder, maybe we're going to get canceled or killed. But I think they built something pretty remarkable. They went back to. They kind of changed the way the database engines were built. From a paradigm point of view. Usually when you build a database engine, you read a lot of academic papers, you try to understand what the latest algorithms and data structures and you put them together and see if they work or not. And there's a high risk of failure there also because whatever that looks really good on paper might work, that might actually work really good in 70% of the workloads, but then it backfires on the other 30%. They actually went built more of a factory for building the database. So they spent more time building this factory. And the factory takes the decade of traces we have. I think they count as a quadrillion data points in the trace table.
[46:48]
Host
You don't drop anything. Oh you see sample we for sure sample.
[46:51]
Reynold Xin
But there's like massive amount of things and they use that to build a model. Like a machine learning model? No, machine learning model. Machine learning model basically can very, very quickly tell us how any algorithm and how any implementation will perform for any specific type of queries with very, very high fitness fidelity. And based on that, they can pick the most likely algorithm and data structure that will actually help with the different kinds of workloads both at runtime as well as at implementation time. Because there's like unlimited number of.
[47:27]
Host
I mean it sounds like you want to route to different data structures.
[47:32]
Reynold Xin
Yeah, I mean if you think about it, this is not one database has many things implemented together, but you want to make sure they all work well with each other. And then for any given operation there might be more than one implementation. So we make it actually run really, really. The reality is things algorithm that work super well, for example for very very low latency might not work very well for say scanning through petabytes of data. Actually most often there's a trade off there between throughput and latency.
[47:58]
Host
What are the key dimensions like scale, throughput, latency, anything else and the distribution
[48:05]
Reynold Xin
of data, how sparse the data is, how hard that matters very a lot. How frequently do you hit the same data?
[48:11]
Matei Zaharia
How many distinct values, stuff like that.
[48:13]
Reynold Xin
Those things matter a lot. Number of distinct value basically impacts the memory consumption of your aggregation your hash at some point.
[48:20]
Host
There's a hash table in my write up. I'm going to try and list all these up because I really want the taxonomy to me, taxonomies are so helpful because it covers everything. They should Think about, about.
[48:29]
Reynold Xin
I think if you actually try to list it out, probably like a million different features. I always want like, okay, give me
[48:36]
Host
like 12, give me someone did, I think an Oracle paper in like 40 years ago did. Like, these are the eight fallacies of distributed systems. That kind of thing is super useful.
[48:46]
Matei Zaharia
Yeah.
[48:47]
Host
It's like, okay, think through these eight.
[48:48]
Reynold Xin
But let me give you a very weird example. But it actually has profound implication on performance, which is it's your strangest act ascii or does it have Unicode in it? How should you encode it?
[49:00]
Host
I mean strings are the most complex data types.
[49:01]
Reynold Xin
Yeah. So the, that like for example, if string is super dense, you could actually convert every string into a. Like imagine you have to do a aggregation. Instead of having a hash table, you could actually have an array. Because if your string is dense enough, if you only have 256 options, you don't need a hash table. You can just do array lookup.
[49:22]
Host
Yeah.
[49:23]
Matei Zaharia
This string is like a country code or something. Yeah, yeah.
[49:26]
Reynold Xin
So it's actually like probably millions of features in that model. But using that they can one basically prioritize the different algorithms that might actually impact in practice. And many of them are very counterintuitive. This isn't necessarily things that you think might work super well actually don't work that well in practice. But also more importantly, at runtime you can dispatch the right algorithm and structure.
[49:47]
Host
I'm listening to the dream. I feel like databricks is doing a really good job of the incremental evolution. Do you have to hard cut to a new system at any point or
[49:57]
Reynold Xin
like we designed it in a way that it can be incremental. So first we're releasing a new endpoint. Baba. This goes to the world of ocean versus what we wanted to do is wanted by design. This new engine should be able to do everything we're able to do before and better. Right. It's been particular. The better part refers to very low latency, low latency workloads that can finish in tens of milliseconds, seconds. But we want to roll it out incrementally with incremental capabilities so it doesn't take like five years to actually see the light at the end of the tunnel.
[50:30]
Host
I think that's a heroic task. I don't know what other way to say it. I am really interested in any sort of new workload and new databases. I mean, obviously, I think I've maybe established that I'm a little bit of a database nerd. The transactional databases. Sorry. The accounting databases like the tiger beetles. I don't know if you've seen those.
[50:50]
Reynold Xin
What do they do?
[50:51]
Host
Dual entry accounting database. It's meant to really model financial accounts and credit systems. It's like a very specific, very high throughput. Yeah. So when you're talking about how everyone starts with a thing and then they scale up and then they tack on other things. It's exactly that. And then I recently interviewed Simon from TurboCorver.
[51:09]
Matei Zaharia
Same thing.
[51:11]
Host
And Chroma as well. All the vector database Companies of 2023 are suddenly now just. We are just generally general storage Blob
[51:18]
Reynold Xin
storage Vector database should have never been a separate cate.
[51:22]
Host
I think it used to be a hot take. Now it's like the conventional wisdom nowadays what should be a separate category if everything becomes ltap?
[51:31]
Reynold Xin
I think the thesis of LTAP is we're not collapsing the databases at the actual query layer, we're just collapsing indexing layer. And that's I think a very important part. And we actually don't think it makes sense to collapse the query layer into a single HTAP style style database. And part of it, by the way the other thing I think a lot of people had is hey, it would be nice if there's only one query language I have to worry about instead of worrying about PostgreSQL and maybe Spark SQL. Why not just one? But I don't think that's an issue for agents. Agents are very eloquent in PostgreSQL or Spark SQL. It's never going to get confused. As long as the data is there and it's accessible, agents will do fine. That might been so five years ago might have been a problem for humans.
[52:17]
Matei Zaharia
That could arise over time also. And this leads to how to do things incrementally. Right. Like we realize you don't need it right now. We don't need to solve that problem to have a lot of value from the current deltap.
[52:31]
Host
Yeah. Okay, I'm going to end the pod with a little bit more sort of spicier things. Everyone has had the receive within a separation and storage and computer and trying to build the clouds. I had the same pitches from Snowflake. How have you succeeded where they failed? That's rough. Respecting that they are a competitor objectively you have outpaced them. What is the core insight from your point of view? That you guys just went different directions?
[53:03]
Reynold Xin
Probably the biggest fundamental difference. Both companies started around the same time, both went to the cloud, both focused on storage from compute architecture. But the biggest difference one Is open. Like databricks have never had the proprietary format. Right. We started with the open sort of ecosystem, started with Parquet and then evolving to Delta and Iceberg and all that. That's like one big thing, I think that matters a lot. The other one is AI. I mean before 2022, October 2022, when ChatGPT came out, we had always pitched databricks as a machine learning plus data data. And a lot of the platform were built with machine learning use cases in mind. And obviously AI is a little bit different. And Mattei's like spent far more time there than I do. But the whole platform was. We never felt, hey, we're just a data infrastructure platform, like analytics only. Yeah.
[53:55]
Matei Zaharia
I think they started with like they thought, okay, we'll just manage the most valuable data and try to make it really fast for that. We'll have our own storage which is optimized with the engine. And then we'll just target like the small amount of data that the managers and whatever finance people and so on look at and make that super fast to serve. And it was a different space. Whereas we started with like, we'll do the bulk processing and ingest. You got a bunch of JSON log files, you got whatever. We do that very large scale stuff because that's what Spark was for the large scale mapages like stuff. And then we'll keep the data in an open format, might be slower, but like it's already out there, you can consume it downstream. And it turned out that it's easier to go from that broad thing that's really good at the scale and ingesting and super low cost and create versions in it that have the speed and features of the super easy to use smaller data for business users thing.
[55:04]
Host
Then optimize.
[55:04]
Matei Zaharia
Yeah, start open and start large in some sense. We started upstream of them and there was a time actually when we both sort of listed each other as partners because you said if you use both solutions together, use databricks for your ingest and compute and then serve the tables out of Snowflake. You get all the visualization, all the very fast stuff. That's great. And then we both realized customers were telling us, why do I need this other thing? Why can't I just query your tables? And we said, no, we're horrible at that. Please use our partner for the SQL Warehouse stuff. And then they realized they're like, wait a minute, so much of the computer is moving upstream stream into this other thing.
[55:43]
Host
You have to grow into each other's territory.
[55:45]
Matei Zaharia
Yeah, But I think we did start with the bigger scope and with the open thing, and that's important architecture. Again, it goes to enterprises. If your company's existed for 30 years, you've experienced being locked into Oracle and all kinds of crazy things. And if you're the CTO there and you're setting up the architecture for the future for your company, you're going to want to pick a foundation that's open and you only want one way to manage data in your company. Ideally, you don't want like seven different systems.
[56:17]
Reynold Xin
By the way, the open data format have won. Like, I think now every enterprise wants to put data in open data format. But it was actually very controversial, like back then, I think five, six. When exactly one of the Snowflake co founders actually wrote a blog called Choosing Open Wisely, which basically argue against. Yeah, I think they might have taken it down. You have to find your archive down.
[56:38]
Host
Oh, I mean, it's never going away way now. No, no, it's still there. I love the sort of perspective that only you guys will have because obviously you run the company and thank you for indulging. That's an incredible perspective.
[56:53]
Reynold Xin
Maybe one last one. As you were talking. I think I have to give Ali a lot of credit. He's an incredible CEO. I think he's the perfect combination of iq, eq, technology obsession, execution, business acumen. And he's also a founder, which makes a lot, a lot easier for him to mobilize and execute. I think that's.
[57:15]
Host
Oh, that was it. So you have Ali and they don't
[57:19]
Reynold Xin
like, okay, well, I call it other things, but I think Ali played a pretty big role in the.
[57:24]
Host
I thought there was going to be some technical choice that he contributed to.
[57:29]
Matei Zaharia
He saw a lot of these. There were sort of forks in the road where he pushed for like one way and then it became clear that that was the right way. Them.
[57:37]
Host
Yeah, I mean, there's a whole book that needs to be written about how the eight of you work together and all that. I think there's been profiles that people have done. Second one, not a cleared question again. Mosaic. Mosaic. A lot of people in our community are curious on what's the sort of the model story of databricks. Right. Like when you guys bought Mosaic, the thing was like, okay, well, we're going to do fine tuning, we're going to do in house model. Because they had the Mosaic models and it seems like you're not doing that and it seems like you're going towards more of the LTAP and The hardness stuff, what's the story there?
[58:15]
Matei Zaharia
Yeah, I guess when Mosaic started, I think it was well known or became most well known for releasing open source LLMs early on and they were general models actually before that they were doing other things. They were about optimizing training systems basically. So they had the fastest image model training stack in the world and stuff like that. Like that. And then they decided to do LLMs, which was smart. They moved into it before ChatGPT, so they had some of the first open source LLMs.
[58:44]
Host
We interviewed John Franco and Abhi for MPT7B.
[58:47]
Matei Zaharia
Yeah, exactly. Yeah. Oh yeah, very cool. Yeah, yeah. So we decided, even though we did launch open source model DBRX and we went up to sort of above the Llama 3 scale, we decided that we really want to focus on there'll be so many people releasing models and instead of doing the general model where a big part of the recipe is just throw in a lot of compute and just scale, we want to focus on the next step also of let's say you have the very smart model, how do you make it useful? For us it was a lot about automating, making it very good at querying data. That's the first party agents we have called genie. So it's like a virtual data scientist. Imagine there's someone who already knows all the stuff in your company inside out and knows all the machine learning libraries, all, all the data libraries, all the stuff on the web and you can ask them questions. That's what we wanted to do first. So that meant let's not focus as much on, let's just train some kind of frontier model, but let's build this system using either external models or fine tuned customized components. We're still doing quite a bit of model training though and in fact we're procuring lots of GPUs and stuff all the time to do it. And there's a few places where we're doing it. One is there are many high volume use cases where if you have a specialized model, it's just so much better than any of the general models you get. A nice example of that is understanding documents like PDF, Word documents, stuff like that, parsing them. If you've ever tried to do that, it's frustrating because you send it to Claude Fable or whatever, it almost gets it, but it gets some things wrong and it's super expensive. You just burnt a huge amount of tokens plopping in an image into there. So our team built this document sort of vision model that takes a page and gives you back a nice JSON with all the components. And it's very competitive. It's probably like 100x cheaper than those frontier models and still better. And that's actually done by one of the researchers who came from DeepMind, was a CO founder of Adept, like very early LLM scaling person but focused on this. Likewise we're doing specialized sub agents for part of what the coding agent does. And if you've seen the stuff on advisor models from Harvey, Anthropic has been commissioned and UC Berkeley actually one of my grad students there wrote a paper called Advisor Models. I think before those came out, I mean I'm sure others had the idea at the same time, but that's something that helps a ton. So yeah, we actually showed some stuff just today at the keynote on parf.
[61:39]
Host
Oh, you know parf?
[61:39]
Matei Zaharia
Yeah, yeah.
[61:40]
Host
Oh, he's speaking of my thing. He's doing continual learning.
[61:43]
Matei Zaharia
Yes, I'm one of his advisors.
[61:45]
Host
We interviewed his brother Chai because he's also at a bridge. Yeah, that family is very smart.
[61:51]
Matei Zaharia
Yeah, they're awesome. So we're doing some of that. And as we get experience with these in the first party agents, we're also doing them with customers. So my feeling is customizing models is actually going to get way easier over time. That's what we're finding because the base models are smarter so they generate better, better traces in RHEL already. And then RL is about learning from your own past traces. And then synthetic data generation is way better, way easier. Now we have pipelines just using open source models. Like the same model generates training environments and trains itself and beats like opers and GPT 5.5 and stuff at a task. So I do think it's going to pick up. The thing is the ease of training the algorithm algorithms is only going to go up over time. There's a question of when it crosses into mainstream. Instead of this specialized document parsing thing we did where you need a hardcore LLM researcher. When does it get easy enough that anyone can plop in some stuff and describe a task?
[62:54]
Host
Well, you know what makes it easy? Interfaces and unified APIs. Because obviously if it's not interoperable then you cannot switch.
[63:00]
Matei Zaharia
That's what we're seeing with the like with Omnigent and the composable agents. You can have sub agents with specialized models and then you can train the whole thing. I think that'll help a lot too.
[63:11]
Host
Yeah, the last thing I was going to leave actually I'm sequencing this so I'M actually kind of proud of myself. Satya is talking about this. I interviewed him at Microsoft Build a couple weeks ago and then he wrote this essay which I'm sure you've seen, which is talking about building Frontier Ecosystem. He's sounded, when I was talking to him, more like a databricks CEO than I've ever. This thing presumably went viral in my circles. I don't know in your circles, what's the sort of theory of, I guess, tokens as IP building up the context? He basically said everything but data is the new oil or context the new oil. Some version of that that you guys have heard before.
[63:54]
Matei Zaharia
Yeah, I agree. I think the data you have as you get better technology around that you can just do more in your domain with it. It's not even just about AI. Even when people started collecting stuff in real time. I remember all the power companies put the smart meters and stuff and all the car manufacturers started putting sensors and cameras and stuff. Any technology makes data more valuable and can give you some advantage. Anything that helps you do something with it and make some decisions. And AI is the same way you had all this stuff that's just sitting there now. You can have an agent automatically tell you, for example, instead of. Instead of. I discovered as what feature in my product is broken because a customer complained. The agent tells me. I noticed no one is uploading files anymore because they get errors or whatever. And as you saw with Radon as a database company, because we have all the history of all the queries and all the table layouts and how they work, we can build a new engine very quickly that actually is good and we're confident that it's going to be good. So I think this is right. I think the question is exactly how it will land. But I do think model customization, which Saja talked about, is going to get easier over time.
[65:10]
Host
Which is why, by the way, I brought up the model thing. Because they have their Mei things and you guys don't. That was to be the mental question.
[65:17]
Matei Zaharia
Yeah, we're doing RL fine tuning as a service with a bunch of customers. We don't have have basically we have preview customers and we have a general something called AI runtime that's like we get you GPU clusters on demand with software stack in there. That makes it easy to do training. So we didn't sort of. That's existed for a while. We've had like GPU compute for a while and that's where a lot of the Mosaic stack went to help scale that. But yeah, we found that the engagements, like some of the. There's two types of customers. There's some who just want GPUs and libraries to get data in and out and monitor. So that's what AI runtime is. And then there's some that say, hey, can you actually work with me? Build evals, build synthetic data and the
[66:06]
Host
more forward deploy solutions architecture.
[66:07]
Matei Zaharia
And then that's what we're doing. And more things will transition from being custom to not. But that's sort of how it is today.
[66:16]
Reynold Xin
Going back to your original question, I think one of the thesis we have is actually once you can get the data in the right place, the AI models are becoming pretty good. The generic agents are fairly. I mean, ali talked about AGIs already here. They have pretty good reasoning capabilities, actually. I think many of the traditional software will be sort of rewritten with this new paradigm which is just get the data to be there and then slap some agent on top. Magic will come out, but without the right data, you can't really do that. And actually our approach going to security and our approach to going to the customer data platform space is like. We launched two products at Data and AI Summit, one targeting sort of security teams and the other one targeting marketing teams. And those all have a lot of existing technologies out there. And I think our approach is just, hey, once you get the data in, everything is a lot easier with agents on that.
[67:10]
Host
Yeah, yeah. Well, you guys have been fantastic guests. I just love this discussion. I just love the ability to dive in on the tech side, but also culture and strategy. I hope this isn't the last time we chat. I mean, congrats on all the success so far.
[67:24]
Reynold Xin
Thank you. Congrats on your success also.
[67:27]
Host
Yeah, yeah, yeah. I mean, David's actually supporting my event, which is. So I run AI conference and it is actually I've been an attendee of Data AI Summit for a long time and I noticed that it was like kind of. This was back in 2022, it was like 90% data and then 10% AI. And I was just like, wow, okay, we need the community thing that is like just 90% AI, which now everybody is.
[67:51]
Matei Zaharia
Yeah, yeah.
[67:53]
Host
So databricks be at the conference. And I know, it's just amazing to see you guys build out the most interesting cloud that I have ever seen outside of the big three. And it's amazing how far you've grown. One of the most insightful. I'm not a vc, but I play one on tv. Ben Horowitz, when he was talking to you guys advising you on? Just, where is this company going? He was like, don't sell until 100 billion or some version of that story.
[68:22]
Reynold Xin
It was like, the company should be worth a trillion dollars. You're underselling it for 10 billion.
[68:26]
Host
And, like, he doesn't do that for everyone, you know? Like, for some reason, like, you know, I think he saw the vision, but also the infinite Runway that you have.
[68:36]
Reynold Xin
We're lucky to have Ben. He's a big supporter. Yeah.
[68:40]
Host
Amazing. Okay, well, thank you so much.
[68:41]
Reynold Xin
All right. Thank you so much. R.