Open Source Data Analytics with Sameer Al-Sakran - Software Engineering Daily

Summary6 min read

Podcast Summary: Software Engineering Daily – "Open Source Data Analytics with Sameer Al-Sakran"

Episode Information:

Title: Software Engineering Daily
Host: Sean Falconer
Guest: Sameer Al-Sakran, Founder and CEO of Metabase
Release Date: December 3, 2024

1. Introduction to Data Analytics and Metabase

Sean Falconer opens the discussion by highlighting the challenges faced by data-focused organizations, particularly making data accessible without large dedicated teams. He introduces Metabase, an open-source business intelligence tool designed for data exploration, visualization, and analysis. Metabase aims to empower users to interact with data effortlessly, regardless of their proficiency in SQL.

Notable Quote:

"Metabase has been around for nearly a decade now... we're less about creating a dashboard that's consumed as is, and more about creating a dashboard that sparks interest or curiosity."
— Sameer Al-Sakran [01:53]

2. The Evolution of the Analytics Stack

Sameer Al-Sakran delves into the evolution of data analytics tools, comparing Metabase to other solutions like Tableau, Looker, Streamlit, and DBT. He positions Metabase as the "last mile" solution, enabling everyday employees to access and analyze data without relying heavily on analysts or engineers.

Key Points:

Metabase facilitates a discovery process for non-technical users.
It emphasizes empowering "the poor sucker with the day job" to independently satisfy their data curiosity.
The tool contrasts with platforms like Looker and Tableau by fostering iterative exploration rather than static dashboard consumption.

Notable Quote:

"We're trying to make it really easy for someone to answer those [additional] questions."
— Sameer Al-Sakran [01:53]

3. Target Users and Democratizing Data Access

The discussion emphasizes that Metabase is primarily set up by engineers but serves non-technical end-users. It aims to reduce the dependency on engineers as bottlenecks, allowing broader organizational access to data insights.

Notable Quote:

"Our heart and soul is helping the poor sucker of the day job get their questions answered themselves."
— Sameer Al-Sakran [04:50]

4. Evolution Trends in Data Analytics

Sameer shares insights on significant trends reshaping data analytics:

Natural Language Processing (NLP): The rise of NLP has broadened its applicability beyond initial expectations.
Tool Simplification: Tools have become more user-friendly, reducing inherent complexities in data analysis.
Data Shaping: Emphasis on presenting schemas that are intuitive for non-expert users, avoiding overly normalized databases that hinder accessibility.

Notable Quote:

"The general user experience has improved dramatically over the last 10 or 20 years."
— Sameer Al-Sakran [05:15]

5. Designing User-Friendly Schemas

Sameer discusses best practices for creating understandable data schemas:

Simplification: Avoid overly normalized tables with excessive columns.
Clear Naming Conventions: Use language that reflects the business domain, making it easier for non-technical users to comprehend.
Specialized Data Sets: Create views tailored to specific departments or use cases to enhance usability.

Notable Quote:

"The columns should have English or whatever language your company runs under. You should be able to understand what's in a column without having to look something up."
— Sameer Al-Sakran [07:53]

6. Impact of Generative AI and LLMs on Analytics

The conversation shifts to the role of Generative AI and Large Language Models (LLMs) in data analytics:

Sameer is cautiously optimistic about integrating LLMs into analytics tools. He distinguishes between using natural language as a user interface and relying on LLMs to generate accurate queries. He emphasizes the critical need for accuracy in analytics, suggesting that LLMs should complement deterministic tools rather than replace them entirely.

Notable Quotes:

"An LLM as an analyst... probably after the game has been played in one."
— Sameer Al-Sakran [12:16]

"There's still going to be someone that... for a super weird DSL."
— Sameer Al-Sakran [45:54]

7. Metabase’s Setup and Configuration Process

Sameer explains how Metabase is designed for ease of setup, particularly for early-stage projects. The primary installation involves downloading a Docker image or an Uber JAR file, pointing it to the data warehouse, and creating user accounts—all achievable within minutes.

Key Points:

Installation Options: Docker, Jar files, or Metabase’s cloud service.
User Empowerment: Enables users to run SQL queries, use templates, or leverage the query builder without extensive technical knowledge.
Pre-Analytics Setup: Encourages organizations to implement Metabase early to democratize data access before scaling data operations.

Notable Quote:

"We are the laziest possible option... it's literally a couple of minutes."
— Sameer Al-Sakran [17:09]

8. Technical Architecture and Choice of Clojure

A significant portion of the discussion centers on Metabase’s technical underpinnings:

Language Choice: Metabase transitioned from Python to Clojure to achieve a streamlined, single-atomic binary for easier deployment and maintenance.
JVM Benefits: Leveraging Java Virtual Machine (JVM) allows access to robust JDBC drivers and a reliable ecosystem.
Transpiler: Metabase uses an intermediate language called MBQL (Metabase Query Language) to translate user interactions into executable SQL or other database queries.

Notable Quotes:

"The ability to manage the Transpiler and just dealing with parse trees made the choice of Clojure specifically compelling."
— Sameer Al-Sakran [21:26]

"Our whole bag has been that we're the laziest possible option."
— Sameer Al-Sakran [17:09]

9. Caching Strategies and Data Freshness

Sameer outlines Metabase’s multi-layered caching mechanisms:

Query Caching: Storing recent queries to speed up repeated requests.
Pre-Computation: Regularly computed metrics and models to enhance performance.
Data Warehousing: Utilizing centralized data warehouses as read-only caches to aggregate data from multiple sources efficiently.

He acknowledges challenges with data freshness but notes that Metabase manages these through scheduled updates and handling inherent inconsistencies across data sources.

Notable Quote:

"Analytics still is not fully real time... there's often multiple writers into it that have different schedules."
— Sameer Al-Sakran [30:42]

10. Permissioning Model and Data Security

The permissioning model within Metabase is complex yet robust, designed to balance accessibility with security:

Collection-Based Permissions: Utilizing a folder-like structure where permissions can be set at departmental or functional levels.
Data-Level Restrictions: Ability to restrict access to sensitive data (e.g., PII) based on user roles.
Data Sandboxing: Creating secure environments where users can access aggregate data without exposing raw sensitive information.

Notable Quote:

"Permissioning is kind of the bane of my existence."
— Sameer Al-Sakran [32:32]

11. Open-Source Philosophy and Business Model

Sameer emphasizes the importance of open-sourcing Metabase, citing benefits like transparency, ease of audits, and community-driven improvements. The open-core model allows Metabase to offer advanced features in its Pro version while maintaining a strong free offering.

Monetization Strategies:

Cloud Services: Providing hosted versions of Metabase for ease of use.
Advanced Features: Offering premium functionalities for larger organizations.
White Labeling: Allowing companies to embed Metabase within their applications under their branding.

Notable Quotes:

"We're open source first and foremost because I think that's the right way to consume software."
— Sameer Al-Sakran [35:31]

"Understand what you're going to charge for very, very early on."
— Sameer Al-Sakran [38:02]

12. Future of Software Development and AI Integration

In the latter part of the conversation, Sameer reflects on how AI, particularly LLMs, will reshape software development and business operations:

Value Shifts: As AI handles more coding tasks, human value shifts towards problem-solving, creative ideation, and system design.
Skill Evolution: Emphasis on strategic thinking over mechanical coding skills.
Continued Human Oversight: Despite AI advancements, human expertise remains crucial for ensuring accuracy and relevance in analytics.

Notable Quote:

"There's still going to be someone that... there's still going to be some number of people."
— Sameer Al-Sakran [41:01]

Conclusion

The episode provides an in-depth exploration of Metabase’s role in democratizing data analytics, the technical decisions behind its development, and the broader trends shaping the future of data tools. Sameer Al-Sakran articulates a vision where ease of access, open-source collaboration, and thoughtful integration of AI technologies drive more organizations towards data-driven decision-making without the overhead of expansive data teams.

Final Notable Quote:

"If you can reduce the barrier to entry... then you're going to get a lot more creative work that's going on."
— Sameer Al-Sakran [45:54]

Resources: For more information on Sean Falconer’s work and to access show notes, please refer to the Software Engineering Daily website.

Loading summary

Transcript58 lines

[00:00]
Sean Falconer
Data analytics and business intelligence involve collecting, processing and interpreting data to guide decision making. A common challenge in data focused organizations is how to make data accessible to the wider organization without the need for large data teams. Metabase is an open source business intelligence tool that focuses on data exploration, visualization and analysis. It offers a lightweight deployment strategy and aims to solve common challenges around data driven decision making. A key aspect of its interface is that it allows users to interact with data with or without SQL. Sameer Al Sakran is the founder and CEO of Metabase. He joins the show to talk about the challenge of data accessibility, the evolution of the data analytics field. Key lessons from his 14 years leading metabase, why the platform uses the Clojure language, and much more. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.
[01:09]
Sameer Al Sakran
Sameer, welcome to the show.
[01:10]
Thank you, thank you. It's great to be here again.
[01:12]
Yeah, thanks so much. So we're talking analytics today. I think, you know, there's been various approaches to analytics have been around for a long time. There's been a bunch of different generations of BI tools, things from like Tableau and Looker. And there's also, I think, like different takes on the problem of analyzing data. Things like driving insights, using things like Streamlit, where you're actually building your own dashboards. And then there's frameworks like DBT that are focused on like data transformation and preparing the data before it reaches the visualization stage. And Metabase has been around for nearly a decade now. You've probably seen a few evolutions of products in the space. So what's the background of the company and where does Metabase kind of fit into this world of analytics?
[01:53]
Yeah, I think we're the last mile. So in general, in most companies there's data that people are interested in that lives in one or more places, depending on how buttoned up you are. It could all live in one really gleaming, perfect data warehouse. Everything is perfectly organized and everything's right. Or it could just be this complete defiderated mess. I think we are the thing that lets you get data into the hands of the people that have real jobs. And so I think one of our core kind of center points has always been the poor sucker with the day job. And really it's about how to get as much of their curiosity satisfied by them without help from anyone else as possible. So, you know, there's been like you say, a lot of really compelling ways to instrument your company to have better Understanding what's happening to have better awareness of what certain segments of your customers are doing, what people are clicking on, who's signing up for what, when. Like all these things have been like pretty dialed in for decades. I think the playground that we're trying to really do well in is just there's a normal person that has questions, and we don't want there to have to be an analyst or an engineer that has to deal with every single question that comes up. So a way to think about us is that most times you see a dashboard, there's a certain set of questions that dashboard answers, and then there's an easily anticipated set of like questions 3 through 20. And we're trying to make it really easy for someone to answer those. So, you know, compared to a looker or a streamlit or tableau, we're less about creating a dashboard that's consumed as is, and more about creating a dashboard that really just sparks some amount of interest or curiosity and that the subsequent clicks and subsequent iterations and refinements are where a lot of the magic of metabase shows up.
[03:41]
So essentially the typical sort of target user then is like a non technical user that needs to be able to not only analyze the data, but perform essentially this discovery process. Because they might not even know necessarily what they're looking for or what questions they have. They want to be able to kind of like mix and match and explore.
[03:58]
Yeah, like that's the final person, like the final constituent. I do think that analytics is often a multiplayer game and there are different roles people fill. And we are generally set up by engineers. So for the most part, an analyst is not the person who is setting a metabase. It's usually an engineer, and it's usually someone that has a database that's lying around. And there's people out there in the company that need stuff from the database. And so usually the person that like downloads the docker image and then runs it is not the final user. It's the person that is serving the final user. So we do separate kind of our own internal lingo of the installer Persona versus the end user versus analysts and pro users. But in general, our heart and soul is helping the poor sucker of the day job get their questions answered themselves, release a subset of that and taking the burden off of the engineer that's currently a bottleneck.
[04:51]
Okay, yeah. And I definitely want to get into some of the details on like some of the configuration and setup process, but maybe before jumping there, since you've been in this space for A long time. I think you founded multiple companies related to analytics. What are your thoughts on the evolution of this analytics stack through your career? And what surprised you? What trends I guess are you paying attention to today that you maybe were not on your radar a few years ago?
[05:15]
I mean I think the big one and the easy one is just like natural language processing finally kicked up a viable solution to many things and it turns out to be even more broad ranging than previously feared. So I think that's an easy knee jerk response. I do think there is something, I think that's part of a larger secular trend that I've been following for, I don't know, I get decades at this point, which is just that tools have gotten easier to use and that there is an intrinsic complexity in analytics and there are certain things that are just naturally annoying about how to calculate net revenue retention, for example. There's some math you got to be aware of, there's some choices you have to make and the actual equations are kind of annoying and encoding those in SQL or Python or what have you is just kind of a pain. But there's also a lot of just unnecessary complexity that has over the years been chipped away at. And if you were trying to calculate anything in the 70s you had to write a bunch of code and gradually SQL took over and you kind of walk that forward to Excel that the SQL tableau and kind of the last couple generations of more like post iPhone software where everyone just has a much higher bar for like interaction quality and just ease of use. And I do think that there has been a very palpable sense of simplification of the tools itself. And it's not that what users are doing is getting simpler, it's just like there's less self inflicted annoyance. And so I do think that just the general user experience has improved dramatically over the last 10 or 20 years. I do think there's also a lot of interesting things happening around data shaping itself. I think, you know, there's always been a question how you should store data, what the appropriate format is, how to deal with consistency. There's all kinds of textbooks about data warehousing. But I do think one of the things that I don't say is surprising, but I don't think I would have given it as much weight as I currently do. But I think the success of a self service data organization largely revolves around what schema you, you present to users. And so given a choice of where to spend time getting the schema cleaned up and specifically in a way that Lets a normal person with a normal cognitive model of their business look at that and recognize what they're looking at. I do think there's often data at rest formats that make a ton of sense from the efficiency or consistency or just convenience perspective that essentially make it impossible for anyone that's not eyeballs deep in the actual code base of that application to make sense of.
[07:45]
Yeah, so what's that look like in order to present like a schema that's understandable by someone who's not in the weeds of the database or the data warehouse?
[07:54]
I think it's fundamentally about resisting the urge to normalize everything, to have workhorse tables that are both enriched and manageable in size. So anything over like 20ish columns becomes harder and harder to use. The columns should have English or whatever language your company runs under. Names like you should be able to understand what's in a column without having to look something up. And that there should be a relative, let's just call it simplicity to how concepts are represented. So like users have addresses, and ideally it's not like two tables with a foreign key from an address into the user. And while that may be accurate, it may represent the fact that certain people have addresses. Historically, that sort of thing makes it really difficult for someone that's just trying to look up customer data to make heads or tails of it. And there's a lot of things like that where you probably want to have specialized data sets that are just views on whatever data looks like at rest, and you probably want to iterate those based on like department or use cases. And so there's a lot of things that are very brilliant ideas that an average database designer would have that essentially make that data set unusable by anyone who is not as smart as them. And I do think there's a need to, I want to say smart because I think people in tactical roles are generally fairly intelligent in my experience, but they know their world. They don't know the world of relation databases and making them learn the world of relation databases, get their stuff done, kind of puts up an initial barrier. If you'll forgive me. I've often used the blogging metaphor for this, where once upon a time to write something on the Internet, you had to learn PHP and how to do various command line shenanigans and set up this, that and the other thing. And at some point, the amount of effort it took to set up a blog was more about configuration and installation than it was about the quality of writing. And hence, you know, you had really really good writers that weren't able to get their word out. And when you reduce the technical burden of how difficult it is for someone to get the written word out there, the people that are actually skilled at writing versus skilled at setting up Unicorn or nginx or what have you to actually get the word out. And so I think there's something similar happens with most organizations where like inside of a company the people have most nuanced view of like revenue retention or active users or like the specific mechanics of a checkout funnel are not necessarily people that know how to write Python or SQL. The people that are running that funnel or that like retention analysis and that actually are talking to users and have a fairly specific understanding of what people are doing. And they're the ones that know whether you should count a plan upgrade as part of like whether you should attribute that to the retention of the originating plan or the final plan.
[10:38]
I think also besides having sort of that domain specific knowledge that the person who's maybe writing the query doesn't have like sort of the business side of it, the different people also are going to have different sort of perspective and experiences where they might be able to solve a problem like bringing in new information from this other domain that feels disconnected, but because they're have experience in it, they're able to essentially recognize those patterns across things that seemingly on the surface maybe seem disconnected.
[11:08]
Yeah, for sure, Yeah. I mean, I mostly agree with that. But I do think that there's in terms of just the data set like shape just to kind of pop the stack a little bit. I do think that one of the critical things is not prematurely abstracting and letting the different usage of data sets have different data set shapes. I know that's not exactly what you're talking about, but. Sorry, my brain just went off on a rail there.
[11:31]
One of the things you talked about there, I mean you had the analogy about blogging, like if we can essentially reduce the friction to getting up and allowing someone who's good at writing to write and not have to deal with sort of these technical hurdles, then you're going to end up with a lot more people just writing it. So if we can reduce the sort of technical hurdles and configuration steps involved with accessing data, then we're going to have a lot more people who are maybe good at actually driving insights for the business from the data available to do that. Now you know, you mentioned essentially all the interest of course around LLMs and I think there's a number of companies that are trying to leverage Generative AI now as essentially this like interface to democratize access to data. And I'm curious, like what are your thoughts on that and how does sort of metabase fit into that world?
[12:16]
Yeah, I mean, I think there's maybe two different angles on that that I had to cut and then the remainder. So kind of carve off two pieces and there's some like residual. I think that it's pretty clear to me at least that some subset of people want to talk to the computer. And the idea of unstructured, just natural language as an interface for existing functionality is pretty much like written into the timeline. I suspect that there's going to be some set of things where people will just naturally and organically want to start talking or typing in a way that's conversational natural language. And so I think there's going to increasingly be just the hard expectation for all tools in analytics otherwise to support that as a UX like paradigm. Much the same way as like mice existed. And suddenly like all of a sudden you need to have a menu system and if you don't have a menu system and everything is command shortcuts, you're kind of weird and like you have to kind of explain yourself. Now there's still tools that are 95 plus percent driven by keyboards today, even in a world with phones and touchpads and mice. And I do think there will be like going forward a need to figure out where squishy natural language is the right user interface. I think there's a separate notion of kind of using LLMs to generate queries or generate analyses or generate deep dive execution plans or whatever you call them. I think I'm somewhat less bullish on that. And I do think that. Let me invert that and say I'm fairly excited about what you can do with agents that are wielding deterministic tools. And I think that there is going to be a lot of ways to push forward what a malleable squishy agent that is basically working LLM land with hallucinations with all the usual caveats you have there. What it's able to give users if it is able to then invoke tools that return absolutely correct numbers. So I think one of the things with analytics is it's a very harsh place in terms of expected accuracy and that if something is wrong 2% of the time at organizational scale that just doesn't work. Like if 2% of your numbers in your company are wrong and you just can't tell which 2%, that really doesn't fly. Especially if that 2% changes randomly on you. So I think that trying to generate SQL or generate whatever like target language you have is probably a rocky road and that that will work well after I think the game has been played in one. And I do think that what is exciting and what I think will start taking hold is, you know, I have this toolbox of deterministic stuff. I have agents or you know, single or multiple agents can like use that. And then a lot of the heavy lifting is going to come from the actual deterministic tools themselves and just to like kind of bring it all in. I do think it's important that if the machine produces a number, that number is right. And I think that a world where the number is just like eh, it's kind of cool. Rapidly falls apart once you're talking about like real operations with real stuff that people care about. I do also think that most people that are not in analytics underestimate how much time goes into understanding why number X is the same as number Y. So you know, my revenue number here is like 1.25. My revenue number here is 1.27. Which one's right? And working analysts tend to spend a disgusting amount of time dealing with that. And so I think things that make that harder are net net a larger burden on analytics. But in terms of what that, what's happening with Metabase, I just think that for us, our target Persona has always been the non engineer or the non analyst. And I do think those people are rightfully going to want to talk to computers. And so, you know, we've had two different iterations of having a chatbot that we've had. We're constantly playing around with stuff. We have a couple dark alphas. I do think that we're also playing a lot with how LLMs can. You know, we've had various classification, clustering, recommendation algorithms woven into the code base for ages. We've gradually played around replacing some or all those with LLMs and LLM locations. But I think that there's some very, very interesting stuff around. Again, the new idiom being I can talk to the computer. And so I think that's where we're putting a lot of our chips in. And I think that an LLM as an analyst is still. I think that'll happen. I just think that'll happen well after a bunch of really, really cool stuff gets produced in other ways.
[16:37]
Yeah, I mean I think that what you're saying is right. You need to start sort of with the types of tasks that LMS are reliable for today, especially when we're talking about analytics. Like you can't get the wrong revenue number, you can't get these numbers wrong, or it's going to lead to all kinds of problems. But you know, back to Metabase, like if I'm, I'm using this product, I want to get started with it. Like, what is that process? So I'm assuming that an engineer is sort of the first person that's working with Metabase to get this set up. What is that setup and configuration process?
[17:10]
Yeah. So our whole bag has been that we're the laziest possible option and I think that we've tried to make it very easy for someone to spin a stuff alongside a very early stage project. And so you just pull up a Docker image, you run it, you point us to your data warehouse. I read at that point you're just a database and give people accounts. There's a couple SSL options in the open source version, there's some better ones in the Pro version, but generally just download a jar if you run jars, download Docker image if you don't. Or you know, we have a cloud service if you don't want to do either. But I think it's literally a couple of minutes. And for folks that are super early in the cycle of their product or their project inside of a larger company, we actually suggest that you don't do anything else, don't make dashboards, don't write reports like let that happen organically. But that set us up before there is an analyst is usually our very strong recommendation because it can delay the need for analyst by just having there be a controlled place where people have accounts, they can run SQL questions if they know how to write SQL. You can give them SQL templates to run, there's a query builder they can use on their own. There's like potentially lots of easy ways to click and hunt and peck their way to nirvana. But I do think that for us, the primary thing that we're trying to do is delay the need to get serious about data. So I think that there's a certain desire people have to set up a data warehouse, to set up dbt, to set up bunch of other stuff. And that all makes a ton of sense. But you should probably have something that lets the normal humans in your company ask questions like months or years before that moment.
[18:48]
Yeah. And then is the cloud services that the main way you commercialize?
[18:52]
It's one of the main ways. I think that there's three ways to commercialize. One of those is just, hey, you don't want to run it yourself, we'll run it for you. We do have an open core model, so there is some features that will help you at a larger scale that you can buy from us. And then if you want to slap your logo on it, embed in the application, there's a separate license for that. So potentially if you want to white label us in your application, that is also a thing for you.
[19:15]
Okay. And then as a user interacting with this sort of the front end of this, what's that experience like? And then what is going on behind the scenes to essentially pull the data?
[19:24]
Yeah, I mean there's a couple of different folks that I'll talk about and I think the person who's setting this up is probably going to be smashing SQL together. So you kind of show up, you hit a button, you can write SQL, you can save that, you can write dashboards. And so there's a power user mode effectively where if you know what you're doing, you can do all kinds of rich dashboards, templatize SQL data transformations, model things persist, models, et cetera. I think there's also, from the end user perspective, there's just the ability for me to click on stuff. When I click on stuff it changes. And then when I can use a simple query tool and I can just click on buttons and I get answers. And so for that we have a target language called mbql. It's just kind of a pre parsed pseudo SQL ish kind of thing. Our user interface generates mbql, mbql then gets transpiled to various SQL dialects or Mongo or some other basically a couple other community drivers for non SQL based languages that gets executed. So everything that is run runs on your database data warehouse, then it gets pulled back and there's a bit of post processing, then gets chucked over to client. So for the most part, for a whole host of reasons, we don't want to generate SQL directly and we don't want to force people to have to write SQL directly. So the heart of the application is a transpiler.
[20:37]
So who's writing the MBQL statements?
[20:40]
The computer is. So I click on stuff and then we have essentially react components to do some stuff. They invoke an MBQLib library, this EnclosureScript, the closure script manipulates this parse tree effectively and then that gets kicked over the wire and that's how most of our queries get represented.
[20:57]
And then how many different like languages do you have to transcompile Into I.
[21:02]
Always get this wrong, but I want to say there's like something like 20 first party drivers and then maybe another 10 third party drivers. So we wrote a bunch of drive detectors for common databases and then every once in a while someone in the community writes something for a database we don't support, but on the order of 30 different databases or targets of MQL.
[21:22]
Okay, and why did you choose Clojure as the development language?
[21:26]
I mean originally it was Python, so first version of this was ran in Python and then when we thought about the deployment installation story, so I kind of glibly mentioned, we made installation and configuration really easy. We actually went through a lot of trouble for that and we use closure to do that in many ways. So we wanted to have a single atomic binary that can download, we wanted to have mature database drivers and so we really didn't want to be forced to run lots of weird processes in the Python like Docker image where there's just multiple modes of failure. And so we ended up deciding to use the JVM language, tried to port the Python to Scala that didn't go all that well, and then decided to move to Clojure, like after a week of banging your head against Scala. And it was the specifically the ability to manage the Transpiler and just dealing with parse trees that made the choice of closure specifically compelling.
[22:20]
Is that that was the main advantage of over like say writing the code directly in Java against the jvm?
[22:26]
Yeah, so we knew we wanted JDBC drivers. I still think that's like in general, the driver ecosystem in the Java world is pretty robust and pretty reliable, especially compared to go or JavaScript at the time. JavaScript's gotten better. Go is still what it is. It's all right. And yeah, so given the choice between writing it in Java or Scala, but the decision to use a JVM is probably the first decision to be made.
[22:49]
Has it been, you know, that choice around that language, has that been a challenge in terms of like bringing in new engineers to the company? Like is it harder to find people that language? No, not at all.
[23:01]
I think, I mean it's actually been beneficial. Net I think a lot of people want to write enclosure. It's one of those languages which just has a specific set of ergonomics and you know, if you don't like parentheses, sorry, it's really not going to make you have fun. But I do think it's given us a pretty concrete advantage where lots of people just want write closure for a living and we have that as a benefit of working on our code base. So I think that from that perspective it's been very, very beneficial. I also think that, I mean this is my personal opinion, but they're good engineers and bad engineers and good engineers can pick up new languages. Not really as a consequence of being good engineers, but I think that if you're a good engineer in C or F, you can probably learn Clojure. And so in general we have been very cool with people coming in wanting to learn Clojure but don't have it, won't necessarily have it dialed in yet.
[23:49]
Are there certain advantages disadvantages with running on the JVM for this particular application?
[23:55]
The main advantage we have is specifically for the open source self hosted world where it does just a single file. You download an Uber Jar, it's a single download. Other works doesn't work, you hit Java Jar run and other works doesn't work. And there's just certain predictability and atomicity to the installation. So that's been a huge, huge thing. And I really don't think that we have grown as fast or as well had we had a 20 page installation process that required compiling native extensions and you know, scouring some repository for the right version of something. So our ability to build that single binary has been critical. And so I think that, I still think that was a categorically the right thing to do all along. You know, dealing with JVM is a dark art and there are certain times when we've had to deal with strange memory issues like debugging some things. Enclosure land and JVM land have been challenging at times. The ecosystem is definitely leaps and bounds where it was when we started. And yeah, I'd say that it's probably like a bigger, fatter binary than we might have gone in other places. And we're definitely because it's an Uber Jar, because it has everything bundled in it is like a heavier like just file than if it was just here's a strip down, set the code base and go pull in all your dependencies.
[25:15]
With the transpiling to different versions and flavors of SQL in different DBFs. Like was there particular hard engineering challenges with creating that?
[25:25]
I mean it was a pain. Yeah. So it's a lot of code. I've lost track of exactly how much it is, but I think I want to say it's like 50, 70,000 lines of just fairly dense enclosure. There's a ton of just adjacent stuff we use. So it's highly non trivial. I think it's fairly Gnarly, complicated code. It was a difficult task. I think the folks on the team did it really well. You know, we've gone someplace really cool with it and I do think it is a fairly difficult undertaking that people managed to pull off and I do think we've gotten a lot of benefit from it. But it was probably a dumb idea. Like, looking back on it, I was like, hey, we're going to write a compiler. Probably a more sensible, measured person might have been like, yeah, let's try to figure out a way to win without doing that. So, yeah, it was, you know, in some ways it was taking the hard way down the mountain.
[26:17]
What do you think? Like, if you do it again and you go a different direction, like, what's that direction look like?
[26:22]
I still think I would make the big decisions the same way, given what I knew. And I still think that having there be a target independent intermediate language is the right way to do it. I think I'd probably doing it all over again really change the level of granularity, abstractness of language and now have it be even further away from SQL than it actually is. And I do think that one of the things that has been challenging has been every once in a while there's a set of conceptual, like domain models we have about user land, where metrics models, there's these things that live in that world that are hard to map to MBQL primitives. And so there is a tension between the primitives MBQL is built off of, which kind of is almost. I'd liken it as like, if SQL is assembly, it's like C. And if I were to do it all over again, I would. Rather than creating a C compiler, I've created a Lisp compiler where there's just the ability to have a higher level DSL closer to what actual userland concepts are, rather than having to try to express things in userland down to a C kind of like degree of level language or the level of abstractness of C on top of a assembly and rather have like just had a more scaffolding and in some ways more abstract concepts that build up that target language.
[27:51]
Do you think, like, if you wanted to go in that direction, you wanted to basically build this different level of abstraction? Like, is that something that is like would be like a reasonable project to take on now, or is it essentially too much time in product dependencies exist for the MBQL system?
[28:10]
I think it's one of those things where there's a lot that's working that we don't want to mess up. And so rewriting the target language, which I want to say is at the center of like on the order of 200,000 lines of code, like is that additional benefit worth it? I'm not sure. I think that given where we got to things worked out, I think we probably could have gotten here faster. So some of this is not just are we at a place that is good, but it's also getting here took a while. And I think that we could have speedrun a lot of it by having better abstractions. So I think for things like metrics and models and some of the higher concepts we now have, and the way we deal with dimensions, the way we deal with column abstractions, unifying those across different databases when they point to the same thing. So for example, latitude really means the same thing in any database. It's not like a column is latitude, there's just a latitude concept. I think we could have speed run where we got to and maybe half the time by having a higher level scaffolding, but I don't know if I would rip it all out now.
[29:16]
Is there some level of caching of the data that's happening within metabase as well?
[29:21]
Yeah, so there are a couple variants of caching. So the simplest one is just like hey, you run a query, we'll cache it for you. And that has some speed up at some level. Like I don't know, this is caching, right? Like different vendors have different ways of saying we can speed up your stuff by 2000%, by whatever. So we have like in memory caching we do a fair amount of pre computation, especially models and metrics, where we will essentially pre compute on some schedule or some like push nature. And so those are two different ways of viewing it. And then there's sort of like a more manual version where as you start thinking about cross database data sets, just having those live in a centralized data warehouse or some sort of centralized place. And depending on how you structure things, you can do that as a cache where you're pulling things from a like a database of record, you're stuffing them into this other place that's much faster. And then you're using that as kind of a read only cache. But then it's like pop pull or usually pushed from the centralized data source databases. So 2ish levels, layers of cat levels of caching, and arguably that third level as well.
[30:31]
Do you run into any challenges with essentially like the data getting out of sync so what the user's pulling is being pulled from the cache, but the actual underlying data has changed in some significant way.
[30:43]
In theory, yes. In practice, not that often. So I think that usually manifests when something's busted. So I think data staleness is usually the way that this stuff comes up, as opposed to the cache itself being a problem. So in general, we cache things for n seconds or N days, but I think it often a lot of analytics still is not fully real time. So you don't have a single database that is consistently and always and forever up to date there. There's often multiple writers into it that have different schedules. And so it's not uncommon to either have daily numbers for some data sets or to have, for example, every 20 minutes you pull salesforce and you get some stuff. And so the underlying data set often has a distribution of data freshness. And I think that the overall analytics profession has just kind of learned to absorb this and to try to find ways to like, both live with the fact that this would be different data freshness and try to propagate freshness through lineage or through whatever tools you have, as well as try to make the way that you calculate numbers that matter have it be done in a way that doesn't require you to be able to hit a fully fresh data set that's completely consistent. So just as an example, you'll often be pulling, I think we pull from 20 different data sources into our data warehouse. We have stuff in stripe, stuff in our CRM, stuff in different services we run. And those are all happening on different schedules, and they're not all exactly happening on like the minute that the data point gets generated. So there is often a little bit of soft inconsistency, but for the most part you can kind of get around it, get around the implications of that most of the time.
[32:28]
How does the permissioning model work and how fine grained is that?
[32:33]
So permissioning is kind of the bane of my existence. And if you were to ask me, like, what did we mess up? You know, a lot of those roads, good permissions. I do think it's actually really, really hard to construct a permission system that gives everyone the knobs they need without creating a monster. So I think that we've had very different perspectives on this over the years. And so maybe just to make this somewhat entertaining, people can like have a good time off our misery. You know, once upon a time we were just really centered around this idea that you give people access to data and then the actual products of the data figure out whether someone has access to a given report or not. That didn't really go down very well. And we rapidly, rapidly, but we, after a lot of like kicking and screaming, we're pulled over into a world where we have a parallel system of folders where you have collections, collection permissions, they have sub collections. And so there's a mixture of the ability to lock things down at a like department by department or you know, function by function. But anything you put in collections, you can kind of use like that folder metaphor. And you know, people have read, write and admin access to those. We simultaneously have the ability to lock things down by data sets. So for example, you can say, you know, these three data, these three tables have pii, and these eight groups can't, can't touch that. So you're not able to look up user addresses, for example, if you're an intern. And then on the kind of more paid side, we also have data sandboxing where you have the ability to lock things down by call Moreau, where you can basically say like, interns are allowed to see aggregate metrics based on users, but they're not allowed to look up phone numbers of customers. And so there's kind of this, like there's effectively three different mission systems around just data access, collections, permissions, and then lastly, kind of more bespoke and more complicated conditional ways of either creating hierarchical permissions or column or role of controls.
[34:33]
Can I also control, you know, if I create some sort of view of the data, can I control what level of access someone has to manipulate it? So I could essentially create a view of the data that is maybe like a read only view that I embed in my application.
[34:47]
95% of our usage is read only. So I think that in general we do have the ability to do write back, but that's not a common thing. So yeah, I think. But yeah, there is definitely lots of ways to create safe little sandboxes for people that have differential trust to play in. A lot of what we sell is really things that help you in these various scenarios. So I think for most people that are operating metabase in a pretty high trust environment where everyone has the same permissions and you kind of like you're all part of the same team, the open source version is more than good enough. And then at some point, as you have less trust and less homogeneity in your group, the paid features kind of really kick in.
[35:23]
You have around, I think, 40,000 GitHub stars. So tell me about the motivation behind open sourcing Metabase.
[35:32]
Yeah, I mean, I'd say we probably have less stars than we should given our footprint. I think we've never really played much in the way of the GitHub Vanning metric games. I think we're open source first and foremost because I think that's the right way to consume software. And I think that if you're running something in your data center and you're touching data warehouses that matter, like, I actually think that it being open source is a better format to consume it in. I mean, I think if you want to consume a service that's great, those work out really well. But I do think there's something to your data stack being open source first and foremost. I do think there's just a lot of things that that simplifies. It's easy to do audits, it's easy to be paranoid and security measures, it's easy to fix things and you'll feel like they'll not be fixed by a vendor at a speed you like. I just think there's a lot like maybe this is just me talking my own formative career, but I've often had to run software from vendors that was just not being fixed or like we were breaking weird ways. And the ability to go into the source code and muck around was something that I really valued. And so just on a personal level, I just think that's how most software should be delivered, at least at this point in time. As the world changes, my opinion there will change. And I think given that it had a lot of interoperability, so we're targeting 30ish databases, having people be able to inspect the drivers and be like, actually the way you're hitting the index here is kind of like hokey, you should do it this way instead is very beneficial. And I do think that we have gotten a ton from being an open source project in terms of information adoption usage. And so we still very much appreciate people complaining. Like, I know it sounds kind of like weird, but we get a lot of value from people complaining because it gives us a pretty clear sense of who wants what, how badly they want it. And I think it's an amount of information that in other contexts I would have spent a lot of money to generate. And so having something that is in the public eye is actually very valuable and just on its own for something.
[37:31]
Like this, where you talked about how you feel like this is, open source is essentially the model that software should be consumed. So then like from a business side, the value that the business is bringing where they can charge money is no longer essentially the lines of code that they've written, they have to find other ways of essentially bringing value. So in companies that are sort of open source first or really investing in open source, how do you think that they need to think about bringing business value so that they can actually, at some point they have to pay bills essentially.
[38:03]
The general frame that I have there is that you should understand what you're going to charge for very, very early on. So I think that it's dangerous to write the project, release it, to run it for a year or two and be like, gee whiz, how to make money off this thing. And so I think that most software ideally has a specific user, it has specific set of constituents and people that get value from it. And you should understand who's using it, why they're using it, what they value, what what the other cast of characters are, and then how to somehow, assuming you're going to commercialize it, what the lines of commercialization are, and then try to do a really good job of seeking those lines. So I think, you know, we from very early on knew we wanted to charge for white labeling and that if you wanted to embed us in your application, that's great. But we're an application first and foremost. So if you want to white label us, that's going to be a paid thing. We're not building a library to build your own grant, building open source library to build your own analytics applications. We're explicitly building an application that you can embed. And I think that has that created a lot of clarity. It made it easy to just understand how the roadmap should look. It hopefully made us predictable to our users. So I don't think we've ever pulled any rugs out from anyone where we took bad features or did anything too capricious. And so I think that if you're planning to, as an entrepreneur, as a founder or as a company trying to release software through open source, like understand what people will eventually pay for. And I think the clearer that vision is and the more justifiable it is, the more likely you are to get the lines right. And I think there's a lot of projects that have tried to commercialize and it's kind of bombed. Like for a long time there was no open source companies, then there was a flurry and then a lot of them had kind of become Jesus moment. And I think that one of the things that has separated the people that have won has just been some sense of like, okay, this is why someone pays. And I think it's important to Separate out like the winning products. Because I think without a winning product, you're not really playing the open source game. You're just kind of having some weird half assed marketing adventure side adventure. So understanding like what you're giving away and why, and why people want it and making sure that it actually like can replace the alternatives and it's not just a crippled version of it and secondarily like cool, if you win that, what exactly is it you're selling? And for us, I think a lot of that just boiled down to understanding, you know, the installer and then their boss. And we try to make the things that installers value free. I think their bosses will demand for successful paid. And that was kind of the general heuristic we ran with. You know, it's worked in some ways, not in others. But I think having something like that from the very, very early days that you believe in and that you're able to validate somehow some way even before you start charging money is really important.
[41:02]
Yeah. So I mean, I think that what you can sort of charge for in how people evaluate the value that they're delivering has changed over time. Like there was a time where you could write sort of shrink wrap software and you were, you're explicitly charging essentially for that software and obviously bring value, but you were in a lot of ways charging for essentially the lines of code that you were written. And I think now, especially with like managed services and other different ways of essentially monetizing commercializing businesses, it's changed the model where you can essentially give away the source code and the value is not there, it's something else. It's in terms of like making it really easy to run or certain enterprise features that are maybe not available in the open source model or whatever it is. Do you think now where you know, more and more code is essentially being written with at least the assistance of AI, that in some ways like even lowers the value of the lines of code even more. Where it makes sense to like figure out other ways of, you know, essentially delivering value to your customer.
[42:01]
I mean, I think this depends on what the implicit rate of improvement for AI is. So I think there's a version where like no humans have any value, therefore don't bother. I'm not quite like that extremist, but I think there's another version where it's like, it's most just going to be where it is today with slightly better ergonomics. And somewhere in those two poles there is the path that we'll be on. The reason I bring that up is there's some parts of that spectrum where the ability to turn arbitrary incantations in something resembling natural language into something that works remains very, very valuable. And that, you know, LLMs and co pilots and all that are really just a higher level language, but you're still fundamentally working in a higher level language. And in some ways the LLM is really just a compiler for your or interpreter for your super weirdly leveraged dsl. And there's still someone that has to make the incantation, and the people that can make that incantation will have valuable skills. And the people that are able to pull that together to solve actual problems are still valuable. I do think that as the level of skill required to build a certain system decreases or changes, it starts to shift value into the people that are able to understand what to build. And that the relative value of someone that knows actually I need to fill this specific Lego to make money is even more important. But I still think that for most that spectrum of how far AI goes, there still needs to be someone holding a wand and speaking the incantation. I just think that the nature of that language will change. And how much of the value is in the prompting versus the actual post processing or pre processing, how much of it is in the actual model you train, how much is in fine tuning, there's going to be a lot of stuff that is still high value that has to get done by somebody. Unless you assume that LLMs and sort of the systems you build around them get so advanced so fast that all that gets done by them, then there's still a humans going to be doing all this. And whether we call them a software engineer or a prompt engineer, or a product creation specialist or a magician, it doesn't really matter. There's still going to be some number of people. I do think that it will probably change the leverage. And so I do think so. You will not need a thousand software engineers to build something. You might only need 10 prompt engineers to build something of equal scale. And hopefully this means we do bigger and crazier stuff and that we have better toys in the future and that we're able to tackle bigger projects. But I still think that for again, quite a portion of that spectrum, there will be someone that. And companies will still need to figure out what those legos are, identify them, build their best version at Lego, and then somehow find a way to get in front of people and have people want to buy from them.
[45:12]
I think this kind of is a nice way to tie things back to what we were talking about, even at the beginning, where you use the analogy about blogging, where if we can reduce essentially the configuration setup steps to help people who want to write and put their. Their stuff out there, then you're going to get a lot more creative work that's going on. And I think it's similar where if you can essentially lower the barrier to entry, being able to create a lot of code and eventually, you know, products, then I don't think it's that there's less people doing that stuff. There's actually more people doing that stuff. Because now it's not that anybody could do it, but someone who has some level of skill can now essentially create some kind of product experience, or at least we'll get there at some stage.
[45:54]
If I can give maybe a concrete example which might crystallize this, I think, again, barring some weird singularity, we're probably going to still want iPhone workout apps and someone's going to have to build the best workout app. And the question of whether the primary skills behind the person building apps will be of everyone who is at least this level of proficiency with iOS development and objective C and blah, blah, blah, or is it like who has the best idea for a workout app? I think that what's going to happen is that having those mechanical skills, which were critical when the iPhone launched, and the best workout app of the first generation, whoever was able to write a bug free app, has shifted to who has the best ideas around how to structure the thing. But there's still a market for it. You still got to build it and you still got to build a better one than the next person. But that. And there's still going to be people that build that app again. They just might have a different title and might be working in a different editor.
[46:52]
Well, Sameer, thanks so much for being here. I really enjoyed the conversation. We ended up going deep at the end, which I like that. I think there's a lot to digest, especially when we're talking about products that are really focused on reducing, I think, the barrier to entry or the friction involved with accessing, analyzing driving value from data.
[47:11]
Likewise. Had a great time. Thank you for having me on here.
[47:14]
All right, thanks and cheers.