Summary7 min read

Podcast Summary: Software Engineering Daily

Episode: Context-Aware SQL and Metadata with Shinji Kim

Date: September 4, 2025
Host: Sean Falconer
Guest: Shinji Kim, Founder & CEO of Select Star

Episode Overview

This episode examines the challenges and opportunities in metadata management, data context, and AI-driven SQL generation in modern organizations. Guest Shinji Kim discusses how Select Star builds dynamic, context-rich knowledge graphs over enterprise data, facilitating improved discovery, trust, and operational efficiency for both teams and AI agents. They delve into the technical hurdles of metadata curation, the importance of semantic layers, and how context-rich metadata is transforming the effectiveness of LLMs in generating SQL queries and democratizing data use.

Key Discussion Points and Insights

The Evolution and Mission of Select Star

Origins & Motivation: Shinji founded Select Star to solve the persistent problem that understanding and using data in enterprises is slow and relies on outdated documentation and tribal knowledge, especially as organizations shift to cloud data warehouses.
Core Value: Select Star provides a continuously updated knowledge graph by analyzing schemas and usage, capturing not just structure but context—popularity, lineage, and semantics—across UI, APIs, and integrations.

"We are almost like drawing a knowledge graph for you in terms of how your data assets are connected and utilized inside the organization today." (Shinji, 03:51)

Tribal Knowledge and the Metadata Gap

Why Metadata is Hard: Documentation is unpopular; data models evolve rapidly; and tribal knowledge gets lost as organizations scale.
Select Star’s Approach: They parse activity and query logs to reverse-engineer knowledge graphs, revealing not just structure but how teams actually use data.

"No one likes documentation, especially I think developers. Most of the databases do not have table column comments... manual documentation doesn't scale and it's always taken as after the fact." (Shinji, 05:06)

Metadata Inference from Usage

Reverse Engineering Relationships: By inspecting user queries and applications, Select Star constructs a map of how tables and columns relate, which is further enriched when integrating BI tools.
Three Layers of Metadata:
- Physical/Operational: Names, descriptions, size, and freshness.
- Usage/Behavior: Popularity metrics, lineage, frequency, and entity relationships.
- Business/Semantics: Domain groupings, tags, business glossaries, and metrics definitions.
  
  "There’s the third level...mostly around business context and semantics." (Shinji, 11:48)

Business Value of Usage and Lineage Metrics

Trust and Discoverability: Popularity and lineage help users and AI agents select trustworthy, up-to-date datasets and improve query reliability.
Operational Insights: Monitoring these metrics can highlight unused or redundant datasets, reducing storage and compute costs.

"Combining both lineage and popularity is...a big understanding [of] cost implication." (Shinji, 13:38)

Data Discovery, AI Integration, and Next-Gen Use Cases

Modern Data Discovery: The principal use case is enabling efficient data search and exploration, now increasingly for AI agents needing to generate or edit SQL queries with high accuracy.
AI’s Dependency on Metadata: AI agents need rich, current metadata to understand organizational data and avoid the pitfalls of relying solely on a global LLM’s training.

"This is actually something where a company's...not going to be able to move forward and leverage all the greatest innovations...until they solve this fundamental problem." (Sean, 16:49)

Semantic Layers and Automated Context

Definition & Importance:
- A semantic layer is an abstract data model mapping logical business concepts to underlying tables and columns, used to define certified, reusable metrics for BI or AI.
- Recent interest is fueled by their ability to certify and govern metrics for both people and LLMs.
  
  "Semantic layer and semantic modeling just have gotten a lot more interest recently because that itself can really provide the certification to the AI..." (Shinji, 19:34)
Automation and Human Validation: Select Star combines AI and metadata infrastructure to automate semantic model creation, but always involves human verification to prevent cascading inaccuracies.

"We highly recommend our users to actually take a look at it to actually validate the model." (Shinji, 22:39)

MCP Server and AI-Driven Natural Language to SQL

MCP Server: Provides programmatic interface and tools for searching metadata, fetching details, tracing lineage, and surfacing documentation, enabling AI agents and users to interact naturally with data systems.

"Our MCCP server today...is more of an interface...One is for searching the metadata...getting asset details...getting lineage and traversing the lineage." (Shinji, 25:00)
Accuracy Boost: Leveraging business context, usage, and lineage drastically improves AI-generated query accuracy versus relying only on schema information.

The Complexity of Natural Language to SQL

Why It's Hard: Real-world databases are messy, large, and often insufficiently documented compared to the well-structured benchmarks LLMs are trained on.
The Role of Context: Supplying contextual data—popularity, lineage, example queries—reduces hallucinations and improves output relevancy.

"Real world data is a lot more messy...a lot of similar looking tables and columns...easier for LLMs to hallucinate." (Shinji, 29:12)

Security, Access, and Scaling Context

Security: Select Star supports policies for metadata access but defers enforcement of actual data access to the downstream data warehouse.
Context Optimization: The platform abstracts away the complexity of context window limitations for both AI agents and developers, providing relevant context on demand via the MCP server.

Trends and Emerging Use Cases

Automating Governance: Select Star is developing automated agents to handle tasks like tagging, ownership assignment, and propagating documentation, aiming to keep metadata and semantic layers current as data evolves.
The Growing Value of Metadata: As storage and compute become commoditized, metadata’s “map of data” is becoming more strategic—enabling better operational practices and unlocking AI’s full potential with enterprise data.

"If you have the map, you can actually leverage that for operational purposes like automating impact analysis...There's just a ton of different things that you can add when you have this context." (Shinji, 39:20)

Notable Quotes & Memorable Moments

On the pain of documentation:

"No one likes documentation, especially developers." (Shinji, 05:04)
On the purpose of a semantic layer:

"...so that they can separate out what is considered as verified or certified data sets that should be...used by their business users or in their reporting purposes." (Shinji, 19:07)
On human-AI collaboration in metadata modeling:

"It's more of a way to speed up the process like human in the loop. A human still there to be involved but you can automate a significant amount of the manual work." (Sean, 24:10)
On the “new oil” in AI:

"I would say it [metadata] is the map of where the data is. And that's why metadata is being taken a look at now." (Shinji, 38:52)
Analogy to Semantic Web vs. LLMs:

"...foundation models, they're much more of like an explorer where they don't have that predefined structure. They're just kind of going out, stumbling around and figuring out...what are the associations between these things. But ultimately to make them more useful...they need a map. And that map can be...ontologies or, in the context of what we're talking about, metadata and the semantic layer." (Sean, 40:12)

Key Timestamps

02:10 – The origin story of Select Star and defining the core problem of data context and discovery
05:04 – Why documentation and metadata capture have historically lagged
09:22 – How Select Star infers relationships from activity and query logs
11:48 – Layered approach to metadata: operational, behavioral, and semantic
13:06 – Practical value of popularity and lineage tracking
16:16 – The impact of metadata on enabling AI agents
18:32 – What is a semantic layer and its role in AI and BI
21:21 – How Select Star uses AI internally for semantic model generation
22:35 – Human validation and risk management with AI-generated metadata
25:00 – The MCP server: enabling natural language data exploration
28:35 – Why natural language to SQL is difficult in messy enterprise environments
32:51 – Expanding beyond data warehouses to broader system integrations
34:53 – Context optimization for AI/engineering teams
36:31 – Future directions: automated semantic modeling, agent-driven governance
38:52 – Metadata as the strategic “map” in the cloud era
40:12 – Ontology, semantic web, and LLMs: explorers need a map
41:16 – Wrap-up and thanks

Episode Takeaways

Context-rich metadata is crucial for empowering both humans and AI to leverage enterprise data effectively.
Automated, usage-driven knowledge graphs can make documentation scalable and continuously relevant.
Semantic modeling and the right metadata feed are foundational for reliable AI-generated SQL and analytics.
In the age of commoditized storage and compute, high-quality metadata is becoming the new strategic asset in data-driven organizations.

Loading summary

Transcript52 lines

[00:01]
A
A common challenge in data rich organizations is that critical context about the data is often hard to capture and even harder to keep up to date. As more people across the organization use data and data models get more complex, simply finding the right data set can be slow and create bottlenecks. Select Star is a data discovery and metadata platform that builds a continuously updated knowledge graph of an organization's data by analyzing both its structure and how it's actually used. It enriches data with context such as popularity, lineage and semantic models, making it easier for AI and teams to discover, trust and use the right data. These enriched metadata layers are also highly valuable for large language models, significantly improving the accuracy of generated SQL queries. Shinji Kim is the founder and CEO of SelectStar and she joined Sean Falconer to discuss solving metadata curation challenges, managing data context at scale using LLMs for SQL generation, emerging trends in metadata management, and more. This episode is hosted by Shawn Falconer. Check the Show Notes for more information on Shawn's work and where to find him.
[01:29]
B
Shinji, welcome to the show.
[01:31]
A
Thanks Sean. Great to be here.
[01:33]
B
Yeah, I probably should have said welcome back since you've been here before, although it's been a couple years.
[01:37]
A
Yeah, more than three years ago to introduce Select Star. But I am really excited to be back and software engineering daily has always been also morphing and changing a lot.
[01:50]
B
Yeah, well it's been three years so why don't you catch us up? Maybe three years. Especially in the world of tech, the world of startups, and now what's increasingly becoming the world of AI is a lot of time. A lot could happen in three years. So what's happening with Select Star today? Maybe go back even to the beginning sort of. What's the story behind where you guys started and where are you today?
[02:10]
A
Amazing. Sure. Yes. So much changed. I started Select Star five years ago after noticing time and time that a lot of enterprises collect, store and process data. But to try to use the data, it takes days or weeks to find the right data and actually use it properly. You have to rely on outdated documentation. Usually you need to just find somebody else, rely on tribal knowledge to understand how to use the data. I mean this is something that I saw firsthand at Akamai when I was running the product for their IoT data processing, partnering with consumer electronics and automotive enterprises building their next consumer applications. They were looking to pull a lot more telematics data and especially in enterprise perspective this was an issue and hence there are solutions like traditional enterprise data catalogs that are trying to solve this issue. At the same time, I've noticed that there was a lot more demand around this also as more companies are adopting modern data stack of cloud data warehouses and building their data lakes on the cloud with snowflake databricks. Data discovery, finding and understanding data has been a lot wider issue in organizations. So that's where SelectStar is really focused on. We provide a very Easy to use UI now. MCP Server, APIs, Chrome Extensions, Slack app, all different places where end users. So whether you are a data scientist, data analyst, software engineer or product managers, whenever you have to touch or see data or data products, you can easily access the context about that data, documentation about their data. Where did the data come from? Who else is using this inside the company? What other data assets or analysis are already attached or have been built on top of? So there's a lot of. I would say we are almost like drawing a knowledge graph for you in terms of how your data assets are connected and utilized inside the organization today. So that's the core of what we do.
[04:34]
B
Why do you think so much of this kind of like metadata has historically been kind of this like tribal knowledge? Why haven't we been focused on capturing that as part of the data we collect? Like we built so much technology for actually collecting data, but then this kind of stuff about why the data exists, how it relates to each other. We've historically I think just relied on communicating within the company to ask people why is it this way? Rather than encapsulating that in some sort of piece of technology.
[05:05]
A
Yeah, I mean, correct question. I can just go back to that. No one likes documentation, especially I think developers. Most of the databases does not have table column comments and it just follows the code. A lot of the data tables have descriptive names, but I think today also more so the prolification of data models and how easy it is to transform and build your own data models. I think it also adds to that continuing to writing manual documentation doesn't scale and it's more of a always taken as after the fact. So in the beginning when you are starting off from scratch, you will have entity relationship diagrams as part of like modeling the data. But afterwards as you are or as you have more people building different types of domain models on top of the data, I think it gets lost very quickly. Now I think the metadata collection in that sense of it. A lot of companies have their own internal tools where they just refer to information schema. That's where most of the metadata resides in and where most of the data catalogs really depend on. But the core part of where we focus on at Select Star and now more modern systems focus on is really what happens in between the data assets. So who is accessing the data, which query is accessing this data and how is this accessing the data? These are the parts of the activity information that I would say if you can parse them through and look at them in aggregate, that analysis of metadata is something that's very valuable and that is the full system that we built around. So like any data warehouse that we connect to, we will parse through all of the activity logs or SQL query logs to understand how is the data actually is being created all the way to where it's being used and also how is this accessed, which type of select queries coming from which applications are querying the data and how often is this being queried by how many unique users in the last certain time period. Which helps us to understand the trends of the data usage as well. I think these are the parts that I would say hasn't been looked at as much. But as there are more consumers of data and there are more usage of data, I think there is more need to understand this. The other. I think a big part of it is that a lot of companies have now moved to. It's been now easier than ever to have all of your data in one place in your data lake data warehouse or to create data mart system. There are so many connectors of all different SaaS tools and business tools that can share that underlying system of record data into one place so that you can join them and then model them on top. So I think yeah, it comes from multiple places, but I think in the past when we used to rely on primarily relational data warehouses or more like a Hadoop based systems, there were a lot less number of, I would say consumers of data directly. And this is probably maybe why metadata hasn't been as the main highlight that people are looking into primarily.
[08:43]
B
Yeah, I mean I think your explanation of no one likes documentation is a good one. I think even if someone starts out sort of documenting these things, it's just sort of inevitable that it gets stale over time. It's just every company has the best intentions with a lot of this stuff. Even when it comes to like coding or how certain functionality works. Internal tools, we all have internal wikis there where we have documentation that's multiple years out of date. So you described kind of looking at the activity log. So from the activity log are you kind of like reverse engineering what the relationships are between the Data based on how queries are run against it.
[09:22]
A
Yeah, so basically like the way that we look at the metadata is that we look at each of the queries that are coming through and we attach them to each mentioned assets and then we run a separate analysis on top in terms of how often this has happened or how many unique users run that through. Did I answer your question?
[09:48]
B
Yeah, it sounded like to me that you're sort of inspecting what the actual behavior of individuals within the organizations or applications that use the data are actually utilizing the data to figure out what the actual knowledge graph behind the data is. How are different concepts related based on the query execution?
[10:08]
A
Yeah, so we can see like a certain amount of information about the user. Like we may see the username, but we may not know who that user is, which team they belong to, so on and so forth. That would come from other places, whether if we were to connect to Active Directory or having our customers to group their users, and so on and so forth. But the main piece of where we are putting together this knowledge graph just primarily comes from tracking the usage. So which tables are joined together, what's the join condition look like? What are the most used to the least used tables and within those tables columns look like. This actually gets a lot more interesting when you connect it to other applications like BI Tools for Power BI and Tableau or Looker for this sales dashboard that a lot of people are relying on. Which are the specific fields and tables that really power them and how are each of KPIs being measured, actually defined or calculated? So I think there are multiple steps of sort of insights that you can get. So we see it kind of in like A3 levels. So once we connect and ingest the metadata query logs, there is like first layer which is the core metadata, just the physical asset names descriptions, the operational metadata of how big the table is, or things like when's the last updated, things like that. And then on top of that there is the second level of usage and behavior signals. So this would include the things like popularity, how widely is this being used and trusted? And this also would include entity relationships and lineage. Where did the data come from, where does it go to? And how is this data model related to one another? What are the common queries and joins that's related to this asset? And then there's the third level that we also see that primarily it will be driven by the users, but like we will help automate, which would be mostly around business context and semantics. So this will be something like collections, if you Were to group them for a certain business domain, what would that look like? Any tags that we can infer or actually put in so that you can actually govern the data having business glossary and metrics definition. A lot of this is what we see also as part of now the metadata context that you can put on top of physical assets so that you have a lot richer context for any of the access or whenever you're trying to leverage data that you have access.
[12:53]
B
To for things like popularity, usage metrics, how are those used or what is the value of tracking those for like an organization that's using selectster.
[13:06]
A
So usually what we see the most popular or most interesting have been leveraging both popularity and lineage. The first thing I can think of is just there's always a lot more added benefit when you have customers starting to realize what is the right data to use. Because it's not just about the semantic relativity, it's the trust score. If I'm looking for data related to active users or our sales regions, the types of data that you want to use would need to come from data sets that other people are also using. Right? So popularity really comes in handy for that. And if you were to also leverage lineage with that, then you can also see what other impact that the data also has in other parts of the system or across system. This is actually also interesting when you are thinking about cost perspective of running a data infrastructure. We have number of customers that have saved cost on their cloud billing primarily for their warehouse billing. By looking at what is the popularity meaning they noticed that a lot of models or tables that they have that they thought were being used but they weren't. They weren't either being queried or they are there to load reports on the BI system. But the BI dashboards actually weren't being viewed by the end business users. So combining both lineage and popularity is there is a big understanding cost implication to that.
[14:43]
B
Okay, what are some of the other use cases? Not necessarily restricted to the popularity and usage, but to selectstar in general.
[14:51]
A
I would say the number one use case for us just always comes from data discovery. This is something that also correlates really well with how our customers are using SelectStar with their AI agents and doing data work with AI. It's really just providing the right types of results when you are trying to, let's say build a new model model, edit a SQL query or do exploration around data. Popularity score will allow the agents to be able to find and use the right type of tables and columns. It will also be able to provide example queries that are relevant so that the AI agents can actually build queries that are a lot more accurate. And I would say this is a very much of a use case that flew in from, or more of a native or next generation from what our end users used to do. Our end users that are in data teams used to come to SelectStar UI to find and understand data so that they can query those data tables directly or build dashboards. Now new use cases that we're seeing is that it's their agents and AI tools that's using our MCP server to find and create, create queries and model notification directly.
[16:16]
B
Okay, going back to the original problem that we're talking about, the fact that people haven't historically had a good way of really capturing all this metadata and relationship information or at least keeping it up to date. And maybe historically we've been able to get away with that in some capacity. But do you think now when we start to enter this world of people wanting to have AI agents that can in some capacity leverage this huge amount of data they're collecting, and part of them being able to effectively leverage that data is they need to be able to understand it, which gets back to like sort of the knowledge graph, the metadata that's associated with it. Does that make these pain points, like, I don't know, elevate them to a place where this isn't just kind of in a minor annoyance now. This is actually something where a company's like essentially not going to be able to move forward and leverage all the greatest innovations that are happening in AI until they solve this fundamental problem.
[17:14]
A
Yeah, the way that I see this related to AI, or quote unquote, hydrating AI with enterprise data, trying to use AI on top of your own data. Today, the ways that this has worked in the POC environment has always been by putting their very specific schema information, query examples, synonyms, so on and so forth. Almost like a build your own semantic layer in order for AI to work well. Or yeah, actually like to train the model with just those data. Specifically, this is an approach that I would say kind of gets you to 90%, but is very hard to scale without a metadata platform that will continuously evaluate the schema and popularity and leading edge and everything else, there's going to be the manual work of human needing to figure out what should be the part of the metadata that the AI should primarily use. So I would say this is starting to become a lot more important and lot more companies are starting to look for ways that they can actually scale this part.
[18:28]
B
Can you explain what a semantic layer is and what the components are to it?
[18:33]
A
Sure. So semantic layer is usually a separate layer like a lot of. I think now the definition is starting to get blur. But usually a semantic layer contains an explanation of data model that's laid out as a logical data model. So it will describe which tables, columns it should be part of a logical data model and how those fields will make up of a metrics definition which are the dimensions and measures and facts and how it should be put together hence by an AI or by semantic. Any tools that supports semantic layer. The biggest I guess difference or the reason why people have their semantic layer on top of their physical data layer is just so that they can separate out what is considered as verified or certified data sets that should be and can be used by their business users or in their reporting purposes. Semantic layer and semantic modeling just have gotten a lot more interest recently because that itself can really provide the certification to the AI and AI can just really follow those definitions to use. And this is a piece that we've noticed how like we are starting to see a lot more automation that we can build on top of by just scanning what is being used by your BI dashboards. So for example, if you have a POWER BI dashboard and you have have a data set or semantic model defined within POWER bi, we can map the lineage for the fields and the calculations that you might have defined within BI tool and then translate that into a SQL model that your AI can also use to query. So semantic model and the semantic layer generally is more of just focused on defining like how should metric X be calculated and what does that definition look like. Whereas that used to be seen as a way to consolidate or govern the metrics calculations when you're connecting multiple different tools together. But today this is something that. And from where I'm seeing of the use cases of how the AI can really use that definition to make the queries instead of trying to come up with its own definition for querying the.
[21:11]
B
Data for construction of the semantic models and the data lineage and ultimately the knowledge graph. Are you leveraging AI internally to automate some of that?
[21:22]
A
Yeah, we are using number of different models regarding coming up with I guess the generating the queries and also validating the queries related to semantic model. There's also more of like the formatting of the files, whether that's a markdown or YAML in order to have this integratable with other systems as well. But the core part of let's say like where that data comes from when we are defining the logical table or fields or verified queries. Those are, I would say more coming from Select Stars Metadata infrastructure system is something that we've built over the last five years.
[22:02]
B
Is there any, I guess danger or consequences to using AI to automate some of the construction of this and then AI relying on ultimately the construction to deliver some value, like some AI system is going to leverage this, you know, what Selectster provides in order to be able to say, you know, understand the underlying data better. But since AI is used to construct that, there could be some risk where it's not done 100% accurate. And does that create a situation where you get like a, I don't know, a cascading set of inaccuracies that could impact each other?
[22:35]
A
I think that's an interesting question. So every time we're generating a semantic model for our customers or any metrics definition, this is a part where we will have the user to verify. So it can exist and may be used by AI agents readily. But we highly recommend our users to actually take a look at it to actually validate the model. And then the other side of this that I think is also really important is the evaluation side. So for the business, when someone is considering building any text to SQL, bot or agent, having this set of business questions that are likely to be asked and all the definitions to be correct, you know, I think this is more of, you know, you're trying to build a product that you do need a set of, you know, tests to go along with it. So I guess to answer to that like yeah, I don't think it's something that you should 100% trust. The way that we see this really helps the customers is that you can really kickstart the journey of being able to focus on like actually the important part which is testing and iterating rather than trying to like manually create the YAML files and pick the tables and figure out whether which tables and columns make sense, what their relationship should look like. A lot of the times we see companies going back to doing a ton of data modeling on top of their data mart and a big part of that is almost like a rewriting what they already have in other systems already implemented in bi.
[24:10]
B
Yeah, so it's more of a way to speed up the process like human in the loop. A human still there to be involved but you can automate a significant amount of the manual work.
[24:20]
A
Yeah, that's first I would say benefit to start with this approach and then the second benefit which we're working on is that because we are tracking the underlying metadata when there are changes, such as new calculations being added on the BI front or underlying tables missing, things like that, these are operational issues. And having these semantic models to be up to date with the current data model is another piece that will make the semantic model scale with usage.
[24:54]
B
So you mentioned MCP earlier. Can you talk a little bit about what you're doing, what your MCP server does, and how people use that?
[25:01]
A
Yeah. So our MCCP server today is more of an interface of SelectStar. We have I think four or five tools today. One is for like searching the metadata. The second one is to get asset details. And then third one is like getting lineage and traversing the lineage. So, you know, just searching the metadata. Like I've had this kind of like a test before. I wanted to understand what the customer distribution looked like and I asked my cloud desktop that it was connected to our mcp. It will start from getting all the metadata which had like more than 2, 300 different tables. But from there, using the Select Star's popularity score and other relevancy metrics, it would narrow it down to like 20. And then from there it will pick the tables and columns that it would use to create a query and will execute the query to get that result. And from here I just talked about like search metadata front every time that there is a table, then it would use an MCP tool for the get asset details that will get back like all the information about that table, including the descriptions, example, queries and joins and when it was updated the last. So those are all something that mcp, sorry they cloud was looking into to determine and also use those examples to actually put together the query. And then other tools like Lineage, walking through the lineage and then checking the lineage, usually that is used for checking the impact. So if I were to update my DBT model or SQL query while it's doing that, it can check if there will be any downstream impact from changing the column names or dropping a column. And it can also bring a list of users or owners that may get impacted and needs to be notified from making that change. So those are some of the areas where we've seen a lot of our customers using Select Star, MCP server for, with their Claude or cursor, different ides that they use for AI work.
[27:18]
B
Right. So this gives them an interface to kind of speak natural language but be able to interface with the data.
[27:23]
A
That's right. And we are hearing that this has been a really Great addition because they've been using DBT or Snowflake MCP or their own homegrown MCP to just execute queries or having it to grab the schema metadata. But the schema metadata alone does not provide the queries that they want. The accuracy only really came after having SelectStar starting to provide this direction of popularity score, lineage, example queries, all the documentation, so on and so forth.
[27:59]
B
Can you share anything around like the accuracy boost that you get from using this approach versus only having the bare bones schemas?
[28:06]
A
I would say this is not something that we have a scientific measure for other than the anecdotes and the numerous customer interviews that we've done and we've been like watching customers in terms of how they've been using it. But it's more of like, you know, if you start using it then you would never go back to pre first time CPUs. Basically what we've seen.
[28:28]
B
Why is it that like natural language to SQL against like real world databases? It's difficult.
[28:35]
A
Well, I think that's a really good question. I think there are like multiple reasons why like if I think about just large language models directly, we are just at this point only because the foundation models have been trained with just like the whole world's data. All the books and written literature and every single one of them are almost just examples of how language has been used. It's not just because you're training the system with instructions of how to speak a language. It's not because you put a rule of how this should work. It really kind of comes from just having a lot of data or examples of how things have been used. And I think this is why example queries come in as one of the parts that makes the query accuracy much higher. I think the other part is just data models as data models get bigger. So if we're just talking about relational database with everything is completely normalized and all the data names of columns and tables are very accurate, then it might be easy enough to get accuracy with text SQL. I mean this is why you know, I think we are starting to get to a really high marks on like spider or any of the industry benchmarks for it. But it's only the real world data. When you're trying to use any of the benchmarks it actually fails and that's really comes from the real world data is a lot more messy. There are a lot of similar looking tables and columns and how they are being used. There are also like second third level calculations and metrics that are built on top, you know, that you can easily find in a lot of organizations. I think a lot of those all contribute to complexity. That makes it easier for LLMs to hallucinate than actually generating the seemingly easy queries. But I think it fails because of that.
[30:42]
B
Yeah, I think that makes sense. I mean, I always say that foundation models are really, really smart about kind of general information, but they're really dumb when it comes to your specific, you know, business information because they were never trained on it. So to get value out of them for specific tasks, it's all about, like, how can you correctly contextualize the prompt? And if you're doing this kind of, you know, natural language to SQL generation against complex data models that exist in your warehouse or your lakehouse or something like that, then without the correct contextualization of how that data, essentially, how do you encapsulate the tribal knowledge that people have within the company? If you can't feed that into the model, then there's not really a way for the model to probably accurately run a reasonably complex query against it.
[31:27]
A
Yeah, I think that's really well put. Everyone says context is the king. And for data, how do you structure that context that's actually relevant for SQL generation and analysis of data, I think has a particular flavor to it, and I think that is primarily what we've been focused on because we understand that something like popularity or lineage has a very specific implication of how the data should be retrieved or what type of impact it will have on the use of the data.
[32:07]
B
Have you thought about, like, extending any of this approach? Like, it sounds like, you know, if I'm using Snowflake or something like that, then, you know, I can run Select Star against my Snowflake, I can go through some process to make sure that what it produces is accurate, and then I can start to use something like Cloud Desktop and use your MCP server to explore that data in a natural language way. But what about situations where I might want to pull data from other types of systems, not necessarily the Warehouse, but I might want to talk to, I don't know, like a SaaS, API endpoint, or maybe even transactional database. Is there potentially a role for this approach to extend beyond just the understanding of the warehouse data?
[32:51]
A
Yeah, for sure. I think there are now different ETL systems that we connect to as well as applications that we're starting to connect to. And I'm seeing that as we add more integrations. So it's not just data warehouse queries. But even in the future, I think we'll be able to start generating dashboards in Power BI and Tableau once they have their own MCSP server, for example. I think that that is kind of like the future that we see with.
[33:20]
B
Some of the stuff that you're doing around the MCP server. Given that you're primarily serving metadata, is there the same, I guess, do you need to be concerned about what a specific user is accessing, or is that more going to be a security requirement on where they ultimately are executing that query because that's where the actual data lives?
[33:41]
A
Yeah, that's an interesting question. So right now it really kind of comes down to the end user role of where the query gets executed. We do have policy based access control support so that you can, you know, limit the user to query or even just look up metadata within a certain set of schema tables or logical grouping that you may have. But in terms of the actual query execution, we are a little bit decoupled in a way where we're leaving that to the data warehouse user, because that's where the query gets executed, that we will generate the query and you can limit the query to only access certain parts. But in terms of security perspective of end user querying, this is something that we kind of offload to the data warehouse side today.
[34:35]
B
And then are you offloading the like context optimization problems to the engineer that's building this application? Because I would think that some of this metadata could get like pretty big where it starts to eat up a reasonable amount of the context window. So how does that optimization work?
[34:53]
A
So if the engineer is using the mcp, this is just really kind of like we control it on our end. When you say context optimization, it really, I guess, comes down to what context we are exposing. So we have our own embedding that we use for our Ask AI, which is an AI assistant of Select Star. But that whole context isn't something necessarily like we expose fully to the developer. We do this through the MCP server. So what I'm saying is like, we don't necessarily put like the full embedding of raw metadata for AI agents to use, but rather have the MCP server to, you know, provide that information upon request by the agent. Right. So right now there isn't really a challenge regarding fitting into the context window, but I think the piece that might actually be interesting to you in this regard is a semantic model generation. So we do have a way to now summarize and build a semantic model for the customer so that the customer can basically use that and feed that into their AI application. And I would say that we haven't gotten to a point where it is like so large that it doesn't fit into the context window, but it's fairly early days. So we've been just testing with a number of customers on this, but haven't really run into that issue. Great.
[36:21]
B
In terms of what's next for you guys, where's your focus? Is there anything that you can talk about in terms of the challenges that you're working on now or things that you have coming out relatively soon?
[36:32]
A
Sure, yeah. First and foremost, the semantic model is a big part. We have seen that this really helps the text to SQL approaches on building LLM to speak the language and surface business questions really well. So we are looking at ways for this to be more general and available for more customers today. So that's one part. The other part is having select stars ask AI to kind of use that model and querying the data directly for the end users so that more users can just ask questions to data their data and then about their data and get answers right away. And last but not least, we have different, like agent workflows coming up that really helps building the business context metadata more automatically. I'm talking about we already have ways that we're starting to do a lot of auto documentation of data assets. But like a lot of things that we're looking at that are coming up would be tagging the data assets and assigning ownership or propagating different ways of documentation, which we already do. But putting it into the hands of an agent that maintains the governance is really kind of like the direction we're heading today.
[37:53]
B
What are your thoughts on where some of the value of metadata is going? So historically we've put a lot of value in the data that a business collects. And historically, when it came to databases and warehouses, there was sort of tight coupling between the compute and the storage. And then eventually we separated those things and now we have these open table formats of iceberg and delta tables, and we're getting to a place where that data might actually exist, just as some cloud storage bucket that's outside of where the actual compute runs. And people want to own sort of the computer, the compute work. And there's not as much value sort of attributed to like the hosting of the data itself. But now I think one of the things I'm seeing happening in the industry is that a lot of the big data players, they really want to own the catalog and the metadata. So is the new oil of data, especially in the world of AI really all about the metadata.
[38:53]
A
I would say it is the map of where the data is. And that's why metadata is being taken a look at now. It tells you like what exists and what's really important. And for cloud provider perspectives, I think it's really to expand more capabilities under the same umbrella. And that's why a lot of the cataloging or these metadata features are being introduced by larger players as well. But yeah, nonetheless, I think because if you have the map, you can actually leverage that for operational purposes like automating, impact analysis and let's say PR or letting the downstream users know of what's going to change. Or even using something like popularity to have an AI agent to write the right correct query and pick the right columns and tables. There's just a ton of different things that you can add when you have this context. And I think this is wild why a big focus is starting to be put on metadata and also high quality of metadata.
[40:03]
B
Yeah, that's great. I wrote an article recently about sort of comparing the Semantic Web to the world of large language models and use this analogy of how things like the Semantic Web and ontologies and Sparkle and all the things that kind of came from that world was like essentially an architect's view of the world where we're going to predefine these things and sort of architect the structure of the web at the time. And then with foundation models, they're much more of like an explorer where they don't have that predefined structure. They're just kind of going out, stumbling around and figuring out, based on the patterns of behavior of the way that we write, what are the associations between these things. But ultimately to make them more useful and accurate and prevent things like hallucinations, they need a map. And that map can be things like, you know, ontologies or in the context of what we're talking about, metadata and sort of semantic layer. So kind of all comes together eventually.
[41:01]
A
Yep, exactly. And having that up to date knowledge graph of the data is really the key and in order to make it accurate and also to scale for multiple use cases and as the data changes underneath.
[41:16]
B
Yeah, absolutely. Shinji, I want to thank you so much for your time and for coming back. I really enjoyed this.
[41:22]
A
Thanks so much, Sean.
[41:24]
B
Cheers.