Podcast Summary: AI, Data Engineering, and the Modern Data Stack
AI + a16z | June 20, 2025
Guests: Tristan Handy (Co-founder & CEO, DBT Labs), Jennifer Lee (a16z GP), Matt Borenstein (a16z Partner)
Host: a16z
Episode Overview
This episode explores how artificial intelligence is transforming the field of data engineering and analytics, bringing automation to once-manual processes, altering organizational roles, and pushing the concept of the “modern data stack” into a new era. The conversation features insights from DBT Labs’ Tristan Handy and a16z partners Jennifer Lee and Matt Borenstein. Together, they discuss the limitations of current AI models in analytics, the sociotechnical aspects of defining organizational truth through data, the evolution (and stagnation) of the modern data stack, the technical debt and tool consolidation in the data infrastructure ecosystem, and lessons the data world can still borrow from software engineering.
Key Discussion Points & Insights
1. The Limits and Promise of AI in Data Analytics and Engineering
-
Task Automation and Human Value Add
- Many routine data engineering tasks shouldn’t require highly paid, skilled humans. Automation will increase rapidly, especially for debugging and maintaining data pipelines.
- “If you literally take all the tasks that data engineer does every day...a lot of them, they shouldn't.” (Tristan Handy, [00:00])
- Many routine data engineering tasks shouldn’t require highly paid, skilled humans. Automation will increase rapidly, especially for debugging and maintaining data pipelines.
-
AI Writing SQL Isn't Transformative—Business Context is Key
- Writing correct SQL is no longer impressive—what matters is constructing shared organizational definitions of metrics (“socially constructing truth”), which AI can’t do without encoded metadata or a semantic layer.
- “The hard part of analytics is...they are socially constructing truth inside of an organization...and a model just, like, doesn’t have access to that unless you give it very specific instruction. And you would do that through metadata.” (Tristan Handy, [02:23])
- Writing correct SQL is no longer impressive—what matters is constructing shared organizational definitions of metrics (“socially constructing truth”), which AI can’t do without encoded metadata or a semantic layer.
-
The Semantic Layer as the Missing Link
- Handy explains the importance of semantic layers for enabling AI to consistently understand and return accurate organizational information.
- Example: DBT integrating techniques from Transform to allow AI models to answer business data questions correctly when semantic metadata is present.
-
AI for Visualization and Data Prep
- Jennifer Lee shares positive experiences with AI automating data visualization and prep, though she stresses that organizational and social context still demand human input.
- “There's this organizational social work to do, which I don't believe ever...maybe through a few agents working together they can gather some truth.” (Jennifer Lee, [03:54])
- Jennifer Lee shares positive experiences with AI automating data visualization and prep, though she stresses that organizational and social context still demand human input.
-
Human-in-the-Loop vs. Full Automation
- Complete replacement of analysts isn’t practical; rather, AI acts as an accelerator for technical users but self-service for non-technical users is limited by their inability to validate output.
- “...the way that people think about the AI analyst is...self service inside of businesses...but those folks don’t have the ability to evaluate is this code actually producing the correct result.” (Tristan Handy, [07:06])
- Complete replacement of analysts isn’t practical; rather, AI acts as an accelerator for technical users but self-service for non-technical users is limited by their inability to validate output.
2. Automating Data Engineering: Where AI is Effective
-
Debugging and Pipeline Maintenance
- Debugging pipeline failures is time-consuming yet often intellectually unchallenging—AI agents are proving effective at identifying issues and even suggesting fixes.
- “One of the things that I think is the most time suck-y and produces very little value is debugging pipeline failures...Agents are quite good at identifying the problem and proposing a fix.” (Tristan Handy, [09:10])
- Debugging pipeline failures is time-consuming yet often intellectually unchallenging—AI agents are proving effective at identifying issues and even suggesting fixes.
-
Automation Across System Boundaries
- AI works best within system boundaries (e.g., schema updates) and less well when external dependencies are involved.
- “If you have to interface with an external system, it’s a lot worse...versus if it’s like, oh, there’s a schema mismatch—it’s actually pretty good at making a guess at trying to align them.” (Matt Borenstein, [10:02])
- AI works best within system boundaries (e.g., schema updates) and less well when external dependencies are involved.
-
Jevons Paradox in Analytics Automation
- As data engineering processes become more efficient and less costly, organizations expand their analysis to fill the new capacity, not to reduce work.
- “Jevons paradox is coming into effect pretty hard right now...analytics always expands to fill the available budget.” (Tristan Handy, [24:56])
- As data engineering processes become more efficient and less costly, organizations expand their analysis to fill the new capacity, not to reduce work.
3. The Modern Data Stack: Rise, Plateau, and What’s Next
-
Origins and Definition
- “Modern data stack” gained steam with the arrival of cloud-based analytic platforms like Redshift around 2013, which enabled a new level of composability and democratization in data tooling.
- “I would put the start of the modern data stack at the launch of redshift in 2013…you could swipe a credit card and get access to really great analytic technology in the cloud.” (Tristan Handy, [11:44])
- “Modern data stack” gained steam with the arrival of cloud-based analytic platforms like Redshift around 2013, which enabled a new level of composability and democratization in data tooling.
-
The “S-Curve” Model
- The industry goes through “S curve stacking,” with each new wave swiftly adopted until it plateaus—modern data stack has now matured (“it won, so what’s next?”).
- “Every technology goes through an S curve…The way you get technological progress is you stack S curves on top of one another.” (Tristan Handy, [13:57])
- The industry goes through “S curve stacking,” with each new wave swiftly adopted until it plateaus—modern data stack has now matured (“it won, so what’s next?”).
-
Open Standards and AI as Next Innovation Areas
- File and table format open standards (e.g., Delta, Iceberg) and AI-driven tooling are highlighted as the next waves of major innovation.
4. Lessons from Software Engineering Yet to Be Learned
-
Local Development Environments & Compilers
- The lack of local development options (everything is remote/proprietary in data) is hampering productivity, unlike software engineering.
- “Most of the processing engines that we use are proprietary...there’s no such thing as a local development environment.” (Tristan Handy, [19:59])
- The lack of local development options (everything is remote/proprietary in data) is hampering productivity, unlike software engineering.
-
Reusable Ecosystems and Package Management
- Tristan argues the data ecosystem is “decades behind” software engineering—compilers, interpreters, package management, and shared libraries need to be ported over.
- “We can stop the process of people reinventing the wheel over and over and over again.” (Tristan Handy, [27:33])
- Tristan argues the data ecosystem is “decades behind” software engineering—compilers, interpreters, package management, and shared libraries need to be ported over.
-
DBT Fusion Example
- DBT’s new engine (acquired via SDF) offers a multi-dialect SQL compiler, enabling local emulation and more efficient, robust pipelines. Also enables explicit PII tracking—a critical new feature.
5. Industry Consolidation and Evolving Tooling
-
Acquisitions & Platform Convergence
- Recent acquisitions (e.g., DBT’s SDF, Databricks buying Neon, Snowflake buying Crunchy Data) signify a push for consolidated platforms blending operational (OLTP) and analytical (OLAP) workloads.
- “The idea that you would have the same vendor being able to provide both [OLTP & OLAP] seems like obviously a good idea.” (Tristan Handy, [29:19])
- Recent acquisitions (e.g., DBT’s SDF, Databricks buying Neon, Snowflake buying Crunchy Data) signify a push for consolidated platforms blending operational (OLTP) and analytical (OLAP) workloads.
-
Market and Technical Drivers
- Analytical databases grew in response to internet-scale data, but operational/transactional (OLTP) databases remain larger and steadier markets.
- “The growth of OLTP has been, I think, pretty consistent over time...But the novelty is in analytical databases.” (Tristan Handy, [32:04])
- Analytical databases grew in response to internet-scale data, but operational/transactional (OLTP) databases remain larger and steadier markets.
-
AI Workloads as Catalysts
- The rise of AI (particularly vectors) is driving workload expansion and synergy between OLTP and OLAP systems.
- “There will be more and more AI agents...the more you can standardize, the better your agents will be able to interface with your data.” (Tristan Handy, [28:24])
- The rise of AI (particularly vectors) is driving workload expansion and synergy between OLTP and OLAP systems.
Notable Quotes & Memorable Moments
-
AI's limits in analytics:
“The hard part of analytics is what data analysts are doing is they are socially constructing truth inside of an organization.”
— Tristan Handy ([02:23]) -
Automation is about expansion:
"Analytics always expands to fill the available budget. You want to continue to improve the price to performance ratio not so that at the end of the day people can stop doing things, but so that they can do more things."
— Tristan Handy ([24:56]) -
Modern data stack plateau:
"I would say that that S curve kind of came to an end in the same way that the S curve around railroads came to an end. We got all the railroads, and we're not in a deployment phase of railroads anymore, circa 1925."
— Tristan Handy ([13:57]) -
Software engineering gap:
"I've pretty consistently felt like software engineering tool stack was maybe two decades ahead of data...the idea that the only way I could possibly run my workload is in Amazon rds, like that's not a thing, or it was a thing 25 years ago."
— Tristan Handy ([19:59])
Timestamps for Important Segments
- AI's Role in Automating Data Engineering - [00:00]
- Limits of AI in Business Context/Truth - [02:23]
- AI's Capabilities in Visualization & Data Prep - [03:54]
- Humans in the Loop; Analyst vs. Self-service - [07:06]
- Debugging Automation with AI Agents - [09:10]
- Modern Data Stack Origins & S-Curve Theory - [11:44], [13:57]
- Software Engineering Practices for Data - [19:59], [22:34]
- DBT Fusion & Multi-dialect SQL Compilers - [22:38], [24:56]
- Industry Consolidation and the OLTP/OLAP Divide - [29:12]-[34:49]
Conclusion
This episode offers a comprehensive look at how AI is altering the workflow and value proposition of analytics and data engineering, not by replacing humans entirely but by shifting their focus to higher-leverage work. It identifies where data tooling can still learn from the software development world—particularly in local dev, compilers, and reusable ecosystems. The conversation underscores that while the modern data stack has “won,” the next frontiers are standardization, real automation, and a tighter integration between data platforms driven, in part, by the demands of AI-powered workloads.
