Podcast Summary: Software Engineering Daily
Episode: Context-Aware SQL and Metadata with Shinji Kim
Date: September 4, 2025
Host: Sean Falconer
Guest: Shinji Kim, Founder & CEO of Select Star
Episode Overview
This episode examines the challenges and opportunities in metadata management, data context, and AI-driven SQL generation in modern organizations. Guest Shinji Kim discusses how Select Star builds dynamic, context-rich knowledge graphs over enterprise data, facilitating improved discovery, trust, and operational efficiency for both teams and AI agents. They delve into the technical hurdles of metadata curation, the importance of semantic layers, and how context-rich metadata is transforming the effectiveness of LLMs in generating SQL queries and democratizing data use.
Key Discussion Points and Insights
The Evolution and Mission of Select Star
- Origins & Motivation: Shinji founded Select Star to solve the persistent problem that understanding and using data in enterprises is slow and relies on outdated documentation and tribal knowledge, especially as organizations shift to cloud data warehouses.
- Core Value: Select Star provides a continuously updated knowledge graph by analyzing schemas and usage, capturing not just structure but context—popularity, lineage, and semantics—across UI, APIs, and integrations.
"We are almost like drawing a knowledge graph for you in terms of how your data assets are connected and utilized inside the organization today." (Shinji, 03:51)
Tribal Knowledge and the Metadata Gap
- Why Metadata is Hard: Documentation is unpopular; data models evolve rapidly; and tribal knowledge gets lost as organizations scale.
- Select Star’s Approach: They parse activity and query logs to reverse-engineer knowledge graphs, revealing not just structure but how teams actually use data.
"No one likes documentation, especially I think developers. Most of the databases do not have table column comments... manual documentation doesn't scale and it's always taken as after the fact." (Shinji, 05:06)
Metadata Inference from Usage
- Reverse Engineering Relationships: By inspecting user queries and applications, Select Star constructs a map of how tables and columns relate, which is further enriched when integrating BI tools.
- Three Layers of Metadata:
- Physical/Operational: Names, descriptions, size, and freshness.
- Usage/Behavior: Popularity metrics, lineage, frequency, and entity relationships.
- Business/Semantics: Domain groupings, tags, business glossaries, and metrics definitions.
"There’s the third level...mostly around business context and semantics." (Shinji, 11:48)
Business Value of Usage and Lineage Metrics
- Trust and Discoverability: Popularity and lineage help users and AI agents select trustworthy, up-to-date datasets and improve query reliability.
- Operational Insights: Monitoring these metrics can highlight unused or redundant datasets, reducing storage and compute costs.
"Combining both lineage and popularity is...a big understanding [of] cost implication." (Shinji, 13:38)
Data Discovery, AI Integration, and Next-Gen Use Cases
- Modern Data Discovery: The principal use case is enabling efficient data search and exploration, now increasingly for AI agents needing to generate or edit SQL queries with high accuracy.
- AI’s Dependency on Metadata: AI agents need rich, current metadata to understand organizational data and avoid the pitfalls of relying solely on a global LLM’s training.
"This is actually something where a company's...not going to be able to move forward and leverage all the greatest innovations...until they solve this fundamental problem." (Sean, 16:49)
Semantic Layers and Automated Context
- Definition & Importance:
- A semantic layer is an abstract data model mapping logical business concepts to underlying tables and columns, used to define certified, reusable metrics for BI or AI.
- Recent interest is fueled by their ability to certify and govern metrics for both people and LLMs.
"Semantic layer and semantic modeling just have gotten a lot more interest recently because that itself can really provide the certification to the AI..." (Shinji, 19:34)
- Automation and Human Validation: Select Star combines AI and metadata infrastructure to automate semantic model creation, but always involves human verification to prevent cascading inaccuracies.
"We highly recommend our users to actually take a look at it to actually validate the model." (Shinji, 22:39)
MCP Server and AI-Driven Natural Language to SQL
- MCP Server: Provides programmatic interface and tools for searching metadata, fetching details, tracing lineage, and surfacing documentation, enabling AI agents and users to interact naturally with data systems.
"Our MCCP server today...is more of an interface...One is for searching the metadata...getting asset details...getting lineage and traversing the lineage." (Shinji, 25:00)
- Accuracy Boost: Leveraging business context, usage, and lineage drastically improves AI-generated query accuracy versus relying only on schema information.
The Complexity of Natural Language to SQL
- Why It's Hard: Real-world databases are messy, large, and often insufficiently documented compared to the well-structured benchmarks LLMs are trained on.
- The Role of Context: Supplying contextual data—popularity, lineage, example queries—reduces hallucinations and improves output relevancy.
"Real world data is a lot more messy...a lot of similar looking tables and columns...easier for LLMs to hallucinate." (Shinji, 29:12)
Security, Access, and Scaling Context
- Security: Select Star supports policies for metadata access but defers enforcement of actual data access to the downstream data warehouse.
- Context Optimization: The platform abstracts away the complexity of context window limitations for both AI agents and developers, providing relevant context on demand via the MCP server.
Trends and Emerging Use Cases
- Automating Governance: Select Star is developing automated agents to handle tasks like tagging, ownership assignment, and propagating documentation, aiming to keep metadata and semantic layers current as data evolves.
- The Growing Value of Metadata: As storage and compute become commoditized, metadata’s “map of data” is becoming more strategic—enabling better operational practices and unlocking AI’s full potential with enterprise data.
"If you have the map, you can actually leverage that for operational purposes like automating impact analysis...There's just a ton of different things that you can add when you have this context." (Shinji, 39:20)
Notable Quotes & Memorable Moments
-
On the pain of documentation:
"No one likes documentation, especially developers." (Shinji, 05:04)
-
On the purpose of a semantic layer:
"...so that they can separate out what is considered as verified or certified data sets that should be...used by their business users or in their reporting purposes." (Shinji, 19:07)
-
On human-AI collaboration in metadata modeling:
"It's more of a way to speed up the process like human in the loop. A human still there to be involved but you can automate a significant amount of the manual work." (Sean, 24:10)
-
On the “new oil” in AI:
"I would say it [metadata] is the map of where the data is. And that's why metadata is being taken a look at now." (Shinji, 38:52)
-
Analogy to Semantic Web vs. LLMs:
"...foundation models, they're much more of like an explorer where they don't have that predefined structure. They're just kind of going out, stumbling around and figuring out...what are the associations between these things. But ultimately to make them more useful...they need a map. And that map can be...ontologies or, in the context of what we're talking about, metadata and the semantic layer." (Sean, 40:12)
Key Timestamps
- 02:10 – The origin story of Select Star and defining the core problem of data context and discovery
- 05:04 – Why documentation and metadata capture have historically lagged
- 09:22 – How Select Star infers relationships from activity and query logs
- 11:48 – Layered approach to metadata: operational, behavioral, and semantic
- 13:06 – Practical value of popularity and lineage tracking
- 16:16 – The impact of metadata on enabling AI agents
- 18:32 – What is a semantic layer and its role in AI and BI
- 21:21 – How Select Star uses AI internally for semantic model generation
- 22:35 – Human validation and risk management with AI-generated metadata
- 25:00 – The MCP server: enabling natural language data exploration
- 28:35 – Why natural language to SQL is difficult in messy enterprise environments
- 32:51 – Expanding beyond data warehouses to broader system integrations
- 34:53 – Context optimization for AI/engineering teams
- 36:31 – Future directions: automated semantic modeling, agent-driven governance
- 38:52 – Metadata as the strategic “map” in the cloud era
- 40:12 – Ontology, semantic web, and LLMs: explorers need a map
- 41:16 – Wrap-up and thanks
Episode Takeaways
- Context-rich metadata is crucial for empowering both humans and AI to leverage enterprise data effectively.
- Automated, usage-driven knowledge graphs can make documentation scalable and continuously relevant.
- Semantic modeling and the right metadata feed are foundational for reliable AI-generated SQL and analytics.
- In the age of commoditized storage and compute, high-quality metadata is becoming the new strategic asset in data-driven organizations.
