Summary7 min read

Podcast Summary: AI, Data Engineering, and the Modern Data Stack

AI + a16z | June 20, 2025
Guests: Tristan Handy (Co-founder & CEO, DBT Labs), Jennifer Lee (a16z GP), Matt Borenstein (a16z Partner)
Host: a16z

Episode Overview

This episode explores how artificial intelligence is transforming the field of data engineering and analytics, bringing automation to once-manual processes, altering organizational roles, and pushing the concept of the “modern data stack” into a new era. The conversation features insights from DBT Labs’ Tristan Handy and a16z partners Jennifer Lee and Matt Borenstein. Together, they discuss the limitations of current AI models in analytics, the sociotechnical aspects of defining organizational truth through data, the evolution (and stagnation) of the modern data stack, the technical debt and tool consolidation in the data infrastructure ecosystem, and lessons the data world can still borrow from software engineering.

Key Discussion Points & Insights

1. The Limits and Promise of AI in Data Analytics and Engineering

Task Automation and Human Value Add
- Many routine data engineering tasks shouldn’t require highly paid, skilled humans. Automation will increase rapidly, especially for debugging and maintaining data pipelines.
  - “If you literally take all the tasks that data engineer does every day...a lot of them, they shouldn't.” (Tristan Handy, [00:00])
AI Writing SQL Isn't Transformative—Business Context is Key
- Writing correct SQL is no longer impressive—what matters is constructing shared organizational definitions of metrics (“socially constructing truth”), which AI can’t do without encoded metadata or a semantic layer.
  - “The hard part of analytics is...they are socially constructing truth inside of an organization...and a model just, like, doesn’t have access to that unless you give it very specific instruction. And you would do that through metadata.” (Tristan Handy, [02:23])
The Semantic Layer as the Missing Link
- Handy explains the importance of semantic layers for enabling AI to consistently understand and return accurate organizational information.
- Example: DBT integrating techniques from Transform to allow AI models to answer business data questions correctly when semantic metadata is present.
AI for Visualization and Data Prep
- Jennifer Lee shares positive experiences with AI automating data visualization and prep, though she stresses that organizational and social context still demand human input.
  - “There's this organizational social work to do, which I don't believe ever...maybe through a few agents working together they can gather some truth.” (Jennifer Lee, [03:54])
Human-in-the-Loop vs. Full Automation
- Complete replacement of analysts isn’t practical; rather, AI acts as an accelerator for technical users but self-service for non-technical users is limited by their inability to validate output.
  - “...the way that people think about the AI analyst is...self service inside of businesses...but those folks don’t have the ability to evaluate is this code actually producing the correct result.” (Tristan Handy, [07:06])

2. Automating Data Engineering: Where AI is Effective

Debugging and Pipeline Maintenance
- Debugging pipeline failures is time-consuming yet often intellectually unchallenging—AI agents are proving effective at identifying issues and even suggesting fixes.
  - “One of the things that I think is the most time suck-y and produces very little value is debugging pipeline failures...Agents are quite good at identifying the problem and proposing a fix.” (Tristan Handy, [09:10])
Automation Across System Boundaries
- AI works best within system boundaries (e.g., schema updates) and less well when external dependencies are involved.
  - “If you have to interface with an external system, it’s a lot worse...versus if it’s like, oh, there’s a schema mismatch—it’s actually pretty good at making a guess at trying to align them.” (Matt Borenstein, [10:02])
Jevons Paradox in Analytics Automation
- As data engineering processes become more efficient and less costly, organizations expand their analysis to fill the new capacity, not to reduce work.
  - “Jevons paradox is coming into effect pretty hard right now...analytics always expands to fill the available budget.” (Tristan Handy, [24:56])

3. The Modern Data Stack: Rise, Plateau, and What’s Next

Origins and Definition
- “Modern data stack” gained steam with the arrival of cloud-based analytic platforms like Redshift around 2013, which enabled a new level of composability and democratization in data tooling.
  - “I would put the start of the modern data stack at the launch of redshift in 2013…you could swipe a credit card and get access to really great analytic technology in the cloud.” (Tristan Handy, [11:44])
The “S-Curve” Model
- The industry goes through “S curve stacking,” with each new wave swiftly adopted until it plateaus—modern data stack has now matured (“it won, so what’s next?”).
  - “Every technology goes through an S curve…The way you get technological progress is you stack S curves on top of one another.” (Tristan Handy, [13:57])
Open Standards and AI as Next Innovation Areas
- File and table format open standards (e.g., Delta, Iceberg) and AI-driven tooling are highlighted as the next waves of major innovation.

4. Lessons from Software Engineering Yet to Be Learned

Local Development Environments & Compilers
- The lack of local development options (everything is remote/proprietary in data) is hampering productivity, unlike software engineering.
  - “Most of the processing engines that we use are proprietary...there’s no such thing as a local development environment.” (Tristan Handy, [19:59])
Reusable Ecosystems and Package Management
- Tristan argues the data ecosystem is “decades behind” software engineering—compilers, interpreters, package management, and shared libraries need to be ported over.
  - “We can stop the process of people reinventing the wheel over and over and over again.” (Tristan Handy, [27:33])
DBT Fusion Example
- DBT’s new engine (acquired via SDF) offers a multi-dialect SQL compiler, enabling local emulation and more efficient, robust pipelines. Also enables explicit PII tracking—a critical new feature.

5. Industry Consolidation and Evolving Tooling

Acquisitions & Platform Convergence
- Recent acquisitions (e.g., DBT’s SDF, Databricks buying Neon, Snowflake buying Crunchy Data) signify a push for consolidated platforms blending operational (OLTP) and analytical (OLAP) workloads.
  - “The idea that you would have the same vendor being able to provide both [OLTP & OLAP] seems like obviously a good idea.” (Tristan Handy, [29:19])
Market and Technical Drivers
- Analytical databases grew in response to internet-scale data, but operational/transactional (OLTP) databases remain larger and steadier markets.
  - “The growth of OLTP has been, I think, pretty consistent over time...But the novelty is in analytical databases.” (Tristan Handy, [32:04])
AI Workloads as Catalysts
- The rise of AI (particularly vectors) is driving workload expansion and synergy between OLTP and OLAP systems.
  - “There will be more and more AI agents...the more you can standardize, the better your agents will be able to interface with your data.” (Tristan Handy, [28:24])

Notable Quotes & Memorable Moments

AI's limits in analytics:
“The hard part of analytics is what data analysts are doing is they are socially constructing truth inside of an organization.”
— Tristan Handy ([02:23])
Automation is about expansion:
"Analytics always expands to fill the available budget. You want to continue to improve the price to performance ratio not so that at the end of the day people can stop doing things, but so that they can do more things."
— Tristan Handy ([24:56])
Modern data stack plateau:
"I would say that that S curve kind of came to an end in the same way that the S curve around railroads came to an end. We got all the railroads, and we're not in a deployment phase of railroads anymore, circa 1925."
— Tristan Handy ([13:57])
Software engineering gap:
"I've pretty consistently felt like software engineering tool stack was maybe two decades ahead of data...the idea that the only way I could possibly run my workload is in Amazon rds, like that's not a thing, or it was a thing 25 years ago."
— Tristan Handy ([19:59])

Timestamps for Important Segments

AI's Role in Automating Data Engineering - [00:00]
Limits of AI in Business Context/Truth - [02:23]
AI's Capabilities in Visualization & Data Prep - [03:54]
Humans in the Loop; Analyst vs. Self-service - [07:06]
Debugging Automation with AI Agents - [09:10]
Modern Data Stack Origins & S-Curve Theory - [11:44], [13:57]
Software Engineering Practices for Data - [19:59], [22:34]
DBT Fusion & Multi-dialect SQL Compilers - [22:38], [24:56]
Industry Consolidation and the OLTP/OLAP Divide - [29:12]-[34:49]

Conclusion

This episode offers a comprehensive look at how AI is altering the workflow and value proposition of analytics and data engineering, not by replacing humans entirely but by shifting their focus to higher-leverage work. It identifies where data tooling can still learn from the software development world—particularly in local dev, compilers, and reusable ecosystems. The conversation underscores that while the modern data stack has “won,” the next frontiers are standardization, real automation, and a tighter integration between data platforms driven, in part, by the demands of AI-powered workloads.

Loading summary

Transcript53 lines

[00:00]
Tristan Handy
If you like, literally take all the tasks that data engineer does every day and you write them on a list and you say which one of these are things that a highly trained, highly paid human being should be spending their time on. It's like a lot of them, they shouldn't. Pipeline failures happen and yet inevitably the cause of those failures is kind of dumb. It's not that interesting. Agents are quite good at at identifying the problem and proposing a fix. I expect to see a lot of automation of data engineering tasks over the coming 12 months. Jevons paradox is coming into effect pretty hard right now. Analytics always expands to fill the available budget. You want to continue to improve the price to performance ratio, not so that at the end of the day people can stop doing things, but so that they can do more things thanks for.
[00:55]
Podcast Host
Listening to the A16Z AI podcast. We have another great discussion for you today, this time featuring DBT Labs Co founder and CEO Tristan Handy, along with a16z general partner Jennifer Lee and partner Matt Borenstein. If you're active in the world of data engineering, there's a good chance you're familiar with dbt. But if you're not, here's the very short version. DBT helps its users build data products using the rigor and best practices of software engineering, and as Tristan points out during the episode, it counts more than 1 million users across more than 70,000 organizations. However, this discussion isn't really about DBT. It's about the major changes in the data world brought about several years ago by the concept of a modern data stack and more recently by the advent of generative AI. The three start off on the topic of where AI can really shine in the world of data analytics and data engineering before getting into the rise and plateau of the modern data stack. They also cover the lessons data engineers can still learn from software engineers. And finally, what we should make of a spate of acquisitions and product announcements across the data infrastructure market. And you'll hear it all after these disclosures. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com disclosures.
[02:24]
Tristan Handy
I don't believe in the idea that you're going to do analytics by asking a model to write SQL. It's not that interesting if you can write a well formed SQL query. The hard part of analytics is what Data analysts are doing is they are socially constructing truth inside of an organization. There is no such thing as revenue in an abstract sense. It is just what do we all agree that is the way that we measure revenue. And a model just like doesn't have access to that unless you give it very specific instruction. And you would do that through metadata. And in a best case scenario you would do it through something called a semantic layer. A semantic layer would actually give the model exactly the metadata required to construct the SQL query in a way that everybody in the organization agrees it should be constructed. We acquired a company called Transform, I think it was two and a half years ago, and now it's integrated into the DBT platform and we built an MCP server that kind of exposes this functionality. And when you go to any MCP enabled language model and you ask it questions about your business data, it gives you correct answers. And the funny thing is that like there's a bunch of people that kind of play around with that, but it hasn't crossed into the mainstream. There's a ton of curiosity around this, but still people are using the BI tools generally that they have been using.
[03:54]
Jennifer Lee
Let's break it down to what are the tasks an analyst is doing today and what are the pieces that actually these models have capabilities to serve. I think even compared to a year ago, the capability of writing SQL is nice and day. And I recently asked ChatGPT to build a chart for me with this very complex Excel sheet. Actually, you know, you need to sort of do a couple of pivot tables and like paint this chart, like take, take out a couple rows and columns as well. It did a great job at painting this chart. I was very surprised, impressed. I did a couple of spot checks of like if this is still correct data points and they're correct. And that sort of gave me more, you know, hope of where AI can be applied to maybe in the first final step of visualization. But there's also data cleaning work to do for analysts. There's this organizational social work to do, which I don't believe ever, maybe through a few agents working together they can get gather some truth. But which are the pieces? Maybe we can speculate now, ready to be automated. And which are the pieces still, I think will require a human to come in and do the work.
[05:02]
Matt Borenstein
I can share just one or two things that we've seen from some of your fellowship fellow portfolio companies. I think you're absolutely right that humans have a lot of work to do in data analysis that's very clear. So Much of the work is gathering context, making definitions, sort of almost negotiating with other stakeholders about what the definitions mean and which ones are correct. You obviously are kind of the world expert on this, so you know this much better than we do. What's interesting is there's a parallel there to writing application code. I think where there's, there's. When you're, you know, just writing code for a new piece of software, there's still a lot of context, both in the terms of, like, the code that has to be ingested and sort of, you know, you have to understand the architecture of the whole system and all that kind of stuff, and the sort of social context of you're working in a bigger team of engineers with all different opinions. And we're obviously seeing AI coding take off, you know, to. To a very large extent. And I think the key there has been finding the right insertion point. I think this is exactly what you're asking about, Jennifer, in Analytics, where it's like, it's very clear you don't want to just completely replace an engineer with an AI coding system in the same way that you wouldn't want to completely replace an analyst. It almost just doesn't make sense. It's like somebody still has to press the button. So as long as that's true, it's like, okay, are they just pressing a button or are they providing a specification? If so, what does the spec look like? And if so, shouldn't they just be writing some code or kind of driving. Anyway, so there's this kind of fundamental problem, I think, with full replacement of people in these jobs. But, you know, what coding has gotten right is the models write very good code. Actually now they can do some stuff sort of on their own if you give them proper direction, write a good spec. And there are great tools like things like Cursor and CLAUDE code and things like this that kind of insert in a way that engineers like. I'm sort of curious to see if that comes to analysts. Right. We've actually seen, like, people use Cursor to write analytics queries, which is pretty interesting. One of our companies, Hex, has a pretty good AI products that I think, you know, this company interest in. It's almost hard to go back once you've used their kind of magic features to not using it. But these are still relatively small and relatively incremental. So what's interesting, I think, is kind of what happens next and what really is that.
[07:06]
Podcast Host
Right.
[07:06]
Matt Borenstein
Insertion point?
[07:07]
Tristan Handy
I think you're totally right. The interesting question here is human in the loop versus human not in the loop. And this is why typically the way that people think about the AI analyst is not as a way to accelerate current analysts, it's a way to do self service inside of businesses. So like, okay, I want to take this and I want to give this to every single data user throughout my company, which is 10x100x. As many analysts as there are, they're the people who mostly are using Excel today. But those folks don't have the ability to evaluate is this code actually producing the correct result?
[07:46]
Jennifer Lee
Yep, they have a way to verify it, which is a very scary thing.
[07:49]
Tristan Handy
That's a human out of the loop process. The places where I think human in the loop is working really well and Hex is a great example of this. Typically you're going to see users constructing these queries or notebooks that have an ability to read the code and say this is correct or not. And as a result it becomes an accelerator for them as opposed to a replacement for them. The area where I think there's even more room for this is in data engineering. Data engineering is incredibly valuable, but also many, if you like, literally take all the tasks that data engineer does every day and you write them on a list and you say which one of these are things that a highly trained, highly paid human being should be spending their time on. It's like a lot of them, they shouldn't. This is an area where SQL generation is very valuable. Like pipelines are incredibly valuable. The performance of them, their performance matters a lot. So in dbt we have the ability to like using natural language, help you build pipelines. And of course you've got to actually validate them and, you know, work them through the CICD process, et cetera. That works great. One of the things that I think is the most time suck y and produces very little value is debugging pipeline failures. Pipeline failures happen. They pipelines are more brittle than we'd like like them to be, and yet inevitably the cause of those failures is kind of dumb. It's not that interesting. And so you just need to look through enough log files and trace it upstream. And there's a process for going through this, but oftentimes it takes four hours for a human being to trace this. And it turns out we haven't productized this yet, but we've proven it to ourselves internally via prompt engineering. Agents are like quite good at identifying the problem and proposing a fix. And if you have the right tooling, you can then take that fix and run it through CI and You can say, ah, this actually produces the output that I'm looking for. So I'm. I expect to see a lot of automation of data engineering tasks over the coming 12 months.
[10:02]
Matt Borenstein
And are these failures kind of within system boundaries or across system boundaries? Because I've found this is one of the big kind of questions for AI. It's like if you have to interface with an external system, it's a lot worse at that versus if it's like, oh, there's a mismatch schema. It's actually pretty good at making a guess at trying to align them.
[10:21]
Tristan Handy
You're right. The things that I'm focused on are very much in the world of the data has landed and all the way through to I've got the data set ready for analysis that I want. But I think that there is enough connective tissue in this space that think you could turn around and do that same set of things to 5tran pipelines or anything like that.
[10:47]
Matt Borenstein
Because in the coding example, you can set your bug bot loose to try to find a bug and if there's some external dependency, it tends to just start making stuff up. It's like, oh, maybe that system went down or maybe the function signature and it's like, well, did it or did it not? It's very hard for it to tell. That's actually very interesting. That's sort of a point in favor of kind of, you know, to your point.
[11:11]
Tristan Handy
Yep. Pipeline failures in our world happen for a pretty defined set of reasons. Like very interesting, an upstream source changed its schema and it broke something or new data showed up that we didn't anticipate or these kinds of things.
[11:25]
Matt Borenstein
That's very interesting.
[11:27]
Jennifer Lee
I'm asking you this question because you coined a term. Second, also because you're sort of a historian of the space as well. You wrote, you know, very popular blogs around this, the Analytics Engineering podcast. Also talk quite a bit about modern data stack. Give us a bit more background on what is this modern data stack.
[11:44]
Tristan Handy
Modern is always a tough term, modern relative to what I live in a mid century modern house. It was built in 1979. That's not particularly modern anymore, but we still call it that because it was in reaction to a thing that came before it. And I'm not an architecture snob, so I don't really know the full history there. But the thing that modern was referring to kind of in relation to two things that came before it. One was the kind of Hadoop world and the other was the kind of on prem data warehouse appliance. World. And both of those either had already hit or were starting to hit some pretty serious headwinds by the time that kind of cloud came for data. I would put the start of the modern data stack at the launch of redshift in 2013. I think that that was. And you could argue that maybe the 2013 version of Redshift didn't have many of the characteristics that some of the data platforms later came to have. But it was the first time you could swipe a credit card and get access to really great analytic technology in the cloud. Before that, you had to spend 100 grand in procure servers. So an ecosystem grew up around it. In the early days, it was like Looker and Mode and Periscope and fivetran and Stytch and maybe a couple of others. And you could pretty quickly, like, put together a set of products that was like, pretty mature with a couple of credit card swipes and, like in an afternoon. And that was brand new. And for people who had been stuck with not good tooling for a long time, it was really exciting. It allowed us to work in ways that were not at all possible. Now, I'm sure we'll get into it, but I think that the arc of history has, I think, played out on the term modern data stack, mostly because it won, like, the ideas in the modern data stack have kind of taken over the industry. And so then the question becomes like, well, what? What's next?
[13:42]
Jennifer Lee
You sort of also ended the modern data stack era to 2024. So where are things at now and where is Data Stack at? Is it postmortem?
[13:54]
Matt Borenstein
I don't know, the impressionist data stack or deconstructionist data stack?
[13:58]
Tristan Handy
I don't remember who originally got me onto this, but I've become a big fan of the Carlotta Perez framework. It introduces the concept of S curve stacking. So every technology goes through an S curve where it starts off where almost nobody uses it, and then very quickly, a bunch of people, early adopters and then middle and then late adopters, and then eventually everybody's using this and it kind of starts to level off. And the way that you get technological progress is you stack S curves on top of one another. The way that I see the space right now is that really we had the S curve right before. What we've been talking about as the modern data stack was kind of the rise of public cloud and Hadoop. And Hadoop was really enabled by the cloud. Like most companies couldn't really imagine running kind of a Hadoop Infrastructure on prem. It's just like, not really how it's built. That S curve kind of came to an end. And then you have the rise of the modern data stack. And I would say that that S curve kind of came to an end in the same way that the S curve around railroads came to an end. We got all the railroads, and we're not in a deployment phase of railroads anymore, circa 1925. And so now the big, big axis of innovation, I think, is in two places. One is in open standards, things like Delta and Iceberg, that's at the file format or at the table format level. And then the other one obviously is in AI. And so AI is a much bigger topic than the world of data. For all the excitement that's happened in data over the last, you know, 15 years, I don't think anyone's worried about, like, artificial data intelligence and data putting us all out of jobs or anything like that. Like, the. The societal implications of AI are like, fascinating and well beyond what I'm an expert in. But there are, like, very direct implications for AI on data and for data on AI. And so it's that intersection that I'm particularly interested in.
[16:06]
Matt Borenstein
One question I have for you, Tristan, is are there things that modern data stack never hit right? Are you seeing a lot of workloads that are still kind of there and they've been there forever? And even though people know modern data stack is the right way to do things, they're like, oh, but this has some other thing. And so we just haven't touched it.
[16:23]
Tristan Handy
The term modern data stack. If you move away from the technology part of it, there's also a Persona part of it who tends to work with the set of technologies. And I think the answer to that is the spectrum from data engineer to analytics engineer to data analyst. It's like people that are firmly in the world of data. Software engineers sometimes dabble in that space, but they mostly don't. And similarly, if you're a business analyst, you might dabble, but you mostly don't. Like, business analysts have been pretty resilient to the rise of the modern data stack, and a lot of them still use tools like Tableau and Alteryx and nxl. And software engineers often still, like, we will run into people who, despite the fact that there's well over a million people that are authoring DBT workloads and 70,000 companies today, software engineers a lot of times just don't have any contact with this tooling stack at all in terms of workloads. I thought that we were going to do more in streaming, like we the collective kind of ecosystem, and that hasn't turned out to be true, at least at the pace that I had anticipated. I think that ends up being more of a Persona thing than a technology thing because I think there are actually good answers to how to do SQL and Python on on stream processing engines. But I think it actually just tends to be different humans who need really low latency data delivery.
[18:06]
Matt Borenstein
Streaming like cold fusion is one of those things. It's always a good idea and always on the horizon. That's a really interesting point you make about software engineers versus analysts or analytics engineers, for instance, because I think in a lot of ways the history of the data stack, like you said, is sort of this diffusion from more engineers towards more analytics type people, right? Like you mentioned Hadoop, for instance. Hadoop was sort of a very technical, highly engineered solution built by a bunch of engineers, right? They kind of looked at this data problem that existed at the time and said, okay, let's do a distributed file system and this really complicated sort of programming model that only a Google engineer could invent called MapReduce. And and so I think you saw a diffusion of this happen for a long time, right? Like you can trace things like Hadoop into things like Redshift or Snowflake, where you're having this distributed benefit, but with an easier programming model. For instance, I think Iceberg or Delta, which you mentioned, is another great example of that, where this was sort of a new table format, as you mentioned, sort of built by people at Netflix and Airbnb and Apple and places like that, but really is diffused much more broadly now as kind of mainstream enterprises want to apply this kind of independent storage layer. It's not clear if that's still happening right now. Are there kind of new things that are diffusing out of the engineers into kind of the analytics world? Or maybe to your point, maybe those groups aren't talking to each other as much these days, or maybe it's just kind of the natural flow of the industry. It's an interesting question I think, that you bring up.
[19:44]
Tristan Handy
Not that we represent the entire modern data stack by any stretch. I would never try to claim that. But I will say that if we're at over a million developers and 70,000 companies, those are huge numbers, by the way.
[19:57]
Matt Borenstein
Sorry to interrupt. That's crazy to think about.
[20:00]
Tristan Handy
It's a decent slice and we still see those numbers growing pretty quickly. That is not because there are that many new humans getting minted Every year. It's because people are still joining this movement, this way of looking at the world. I think that will continue to be true for a long time. In terms of we still have a lot to steal from software engineers, I've pretty consistently felt like software engineering tool stack was maybe two decades ahead of Data. I think that maybe we've closed a little bit of that gap, but we're still pretty far behind. One of the things that is irritating to me is that in Data, most of the processing engines that we use are proprietary and they're, they're controlled by a vendor. And as a result, there's no such thing as a local development environment, which is like kind of anathema to software engineers. Like the idea that the only way I could possibly run my workload is in Amazon rds, like that's not a thing, or it was a thing 25 years ago. The other thing is that basically all software engineering ecosystems are fundamentally built on a compiler or an interpreter. In the case of interpreted languages like Python, that compiler defines kind of the ground truth for what works in an ecosystem. And then on top of that, you have libraries and package management and a whole ecosystem builds up around it. And because of these two things together, you end up having like a dysfunctional software environment where every company that you go to, you have to build everything from scratch all over again, because there's not good shared libraries. Because at this company we use a different data platform and the languages between these two data platforms are different enough that you can't reuse the code across them and blah, blah, blah, blah. And so one of the things that we have been very focused on over the last six months is we acquired a company called sdf. SDF is fundamentally a compiler. The technology involved is a SQL compiler, a multi dialect SQL compiler. And so it aims to abstract across all of the differences between all the different SQL dialects and then pull it down into a place where you can actually emulate that database with full 100% fidelity on your local machine and give developers tooling that they can trust there.
[22:35]
Jennifer Lee
What is the product work you're doing now on DBT fusion?
[22:38]
Tristan Handy
The DBT fusion engine comes directly from technology that we acquired from a company called sdf. This is a group of very smart humans who essentially rebuilt the engine at the heart of the DBT ecosystem in Rust and gave it a bunch of new capabilities. At its heart, it is a SQL compiler, multi dialect SQL compiler, and so can do a bunch of things like understand the most granular level, how a query will operate when it's sent to a database and it can emulate that locally. That allows us in this new world to do a bunch of neat things. It allows us to give developers local development environments. It allows us to give developers much better developer tooling in their IDE than they've ever had access to before. Error handling, automatic refactoring, all of these kinds of things that like you would expect in a modern software language. It also will allow us to do a bunch of neat things that are kind of new in the data engineering space. So the original technology for this came from when the CTO Wolfram was at Meta. He was hired at Meta in the wake of the Cambridge Analytica scandal. And the task was we have over a million tables in our data warehouse and we don't know how PII flows through that data warehouse. And we've got eight different compute engines and their SQL dialects are all a little bit different. And we need to make sure that everywhere the PII flows we can track it. And so that is a capability of this engine. And so you can, on the source level, you can tag all of your PII phi and then it will perfectly track that for you through your entire data estate. It will also give you the ability to orchestrate your pipelines in a much more sophisticated way so that it never does any work that it doesn't have to, which has the efficient, much more efficient. It has the ability to reduce your overall kind of infrastructure costs by like meaningful double digit percentages.
[24:46]
Jennifer Lee
Also thinking ahead of, you know, what is happening when we have more AI analyst agents that the compute bill probably is going to stack up if we don't have these more efficient workflow engines.
[24:56]
Tristan Handy
Yeah, Jevons paradox is coming into effect pretty hard right now. I was just talking to Jordan Tigani at motherduck and he's seeing a lot of workloads move onto motherduck. But then what do you know, he saves people a bunch of money and they find a bunch of new workloads. And he shared some quote with me, which I forget the person who this is attributed to, but he said analytics always expands to fill the available budget. So like you want to continue to improve the price to performance ratio not so that at the end of the day like people can stop doing things, but so that they can do more things.
[25:38]
Jennifer Lee
Right. And that's one of the premises of, you know, why modern data stack was popular was a lot of business analytics work that needed to be done in the past were not able to with much limited data sets now that you can store all the data you want to analyze in the data cloud data warehouse at a much cheaper and much more performant, easy to access way. Like, you know, we can answer a lot of questions that we were not able to answer before.
[26:01]
Matt Borenstein
By the way, the STF guys just deserve like a medal of honor for actually doing this work. Right? Like the idea that you can like interpret specific SQL dialects and like run local emulation of each of these engines. It's like this kind of extremely detailed systems work that like is. It's hard to do.
[26:18]
Tristan Handy
Yeah, I kind of didn't believe it at first. I asked them how many automated tests they had to write in order to ensure that, like to guarantee that statement. And the answer is that on top of the SDF database emulation stuff, there's single digit millions of automated tests that run, which is why it has to all be, it has to all be written in rust because it's like a very serious build system required there.
[26:48]
Jennifer Lee
What are some of the things that you're most excited about that haven't been done of borrowing from practices of software engineering that could be applied to data and data engineering?
[26:58]
Tristan Handy
So I just gave you two of them. Local development environments and compilers. Where we go from there, I think it is like healthy, reusable ecosystems. When you build a website, you don't start by writing HTML and css. You typically would use React and then on top of React there's a ton of components and almost never are you going to build any of these basic components. Maybe you'll modify the CSS to make it look like your brand or something.
[27:29]
Matt Borenstein
Like this, and then everything breaks. So you go, oh shoot, better change my CSS back.
[27:33]
Tristan Handy
Right? Yeah. The point of good tooling is to multiply the impact of every individual professional. And that has always been my goal. I started my career as a data analyst, data analyst especially back in 2003. Did not have great career paths, didn't make a ton of money. And the better tooling you can give somebody, the more business value they create and the more you can afford to pay them. And so I think that if we can create really highly functional package ecosystems, we can stop the process of people reinventing the wheel over and over and over again.
[28:15]
Jennifer Lee
Yeah, 100%. And also thinking context of not just going to be humans analyzing and utilizing data, but there will be more and more AI agents that are coming to.
[28:25]
Tristan Handy
And the more you can standardize, the better your agents will be able to interface with your data for sure.
[28:30]
Jennifer Lee
And reusing the components Reusing the libraries, being able to guarantee more accuracy through having these verified sort of components as well. I'd love to hear a bit more of your hot takes on the recent news. DBT has done a couple acquisitions. One more recently you mentioned SDF at the Databricks Summit. People are talking about, you know, Lake base from the recent acquisition of Neon and Snowflake acquired crunchy data. What's happening was these, I would say like one of the SDI companies going into first more operational or like transactional data workflows. And also how are you generally think about, you know, what's happening with the. The tooling stack being more compressed now compared to sort of a few years ago.
[29:12]
Tristan Handy
Compressed? You mean like consolidating?
[29:14]
Jennifer Lee
Yes.
[29:14]
Tristan Handy
Yeah, yeah, yeah.
[29:15]
Matt Borenstein
That's like the C word these days. Consolidating, yeah.
[29:19]
Tristan Handy
One of the most boring things to do as a data engineer is to create pipelines that replicate data using CDC from your OLTP to your OLAP data stores. It is just like these database technologies for operational workloads and analytical workloads. They optimize for different things. And so then I don't believe they're ever going to be the same. So you always have both of them and you always need to get data back and forth between them. The idea that you would have the same vendor being able to provide both seems like obviously a good idea. Now I know that we're recording this on the afternoon where Ali and Reynold went deep into the Lake base. That happened this morning and it was super interesting to hear them talk about it. I think it's based on a lot of good thinking. But this is not. They're not the first ones to do this. Like, let's give people both access modes, OLTP and lap. I think that it will help a bunch of databricks and Snowflake users that their platforms now support that.
[30:36]
Matt Borenstein
And what, like, what do you think is really going on here? And maybe just for our listeners we can do the quick explainer, which is OLTP means kind of one row at a time. So if you're, if you're checking out at Amazon, right, like you sort of insert into an OLTP or transactional database, OLAP means you're going kind of one column at a time. So if you want to look, summarize across all the rows, all the transactions ever done. So it's more analytical. What do you think is really going on here? I just find it so interesting. It was sort of an OLTP world for decades, right? It was Oracle and SAP and even MySQL and Postgres. When you said database, this is what people thought of. It's almost like OLAP kind of became the hot thing right between Snowflake and Data Bricks. And I know I'm abusing the term OLAP a little bit now, but just sort of analytical workloads in general. But now it's very funny, right, because these companies are now getting into oltp. It's like this kind of like market pendulum kind of shifts back and forth and the technology may not change dramatically, but the people kind of running kind of like owning the customer and sort of owning the consolidation point may. So I'm just so curious what you think is kind of going on, why that happens.
[31:44]
Tristan Handy
We have needed databases to process transactions as long as we've had software. And you could certainly get people who know much more than me about the early days of that ecosystem, but you'd have to trace it back to whatever, like the mainframe.
[32:03]
Matt Borenstein
Yeah, it's like the airline booking systems.
[32:05]
Tristan Handy
Yeah, right. You could probably draw some exponential curve of the number of software applications out there in the world. And almost every software applications need some way to store state, so it needs a database. And so the growth of OLTP has been, I think, pretty consistent over time. And for a long time, if you were going to do analytics, you reused whatever system you were using for your transaction processing system. I mean, even I started my career like writing queries on top of Oracle's OLTP database and MySQL and stuff like this. And it was bad. But as long as your data wasn't huge, it wasn't a giant problem. And so why did OLAP start to become a bigger thing? It's just the rise of the Internet. The rise of the Internet led to clickstream data, led to, you know, advertising data, and the data volumes went up. And so you developed more use cases for which you needed the capability to process larger sets of data. I still think, and you folks probably have better market research on this than I do, but my guess is that the OLTP world is still, from a pure dollars perspective, is probably still significantly larger, but it's also a little more stable. We've been doing this for a long time. The growth rate's probably pretty consistent. And so the novelty is in analytical databases. And that's why you see companies like Databricks and Snowflake kind of come from nothing. Because I think the folks who had done OLTP databases for a long time didn't anticipate just how big of an opportunity there was here.
[33:54]
Matt Borenstein
Oh that's interesting. So it was a little bit overlooked by the ltp.
[33:58]
Tristan Handy
I think so. And then and now they're kind of like backwards integrating and on the point.
[34:01]
Jennifer Lee
Of like what is driving storage and compute workloads. I my speculation also on this acquisitions is also what type of workloads you know these players want to see on top of their platform. All the OLTP databases they say at this point added vector search capabilities or vector capabilities already where I think that's a majority of the workload. When you are thinking about AI that's driving a ton of usage on top. It's you know, people who are trying to leverage the data in the database to build applications and OLAP have a role to play in that. But it's still not as direct as sort of these OLTP databases where there's going to be a lot of synergy between the two to leverage. One for predictive, maybe more batch workloads, but the other one for more of these forward looking use cases.
[34:49]
Podcast Host
Thanks for listening. To the end. If you enjoyed what you heard, please do rate the podcast in Apple and share it among your friends and colleagues. And stay tuned for even more talk about AI and data next week.