Summary6 min read

Software Engineering Daily: Foundation Models for Structured Data

Hosted by: Sean Falconer
Guest: Jure Leskovec, Professor at Stanford, co-founder of Kumo AI
Date: June 23, 2026

Episode Overview

This episode delves into the emerging field of foundation models for structured (tabular/relational) data. Sean Falconer interviews Jure Leskovec about the limitations of traditional predictive modeling, why structured enterprise data demands unique AI architectures, and how graph-based transformer models can bring the "neural network revolution" to databases. The conversation spans the technical underpinnings, practical applications, and the future of deep learning for enterprise data.

Key Discussion Points & Insights

1. Jure Leskovec’s Background and Academic/Industry Crossover

Research focus: AI for structured/tabular/relational data, large-scale Graph ML
Impact: Models developed by his teams are in use at major companies (Meta, YouTube); ex-Chief Scientist at Pinterest
Academia vs. Industry:
- “The future happens at Stanford… it allows us to flow between industry and academia and really understand what are the biggest problems out there that are worth solving.” [02:46]
- Academia explores riskier, long-term and foundational problems without commercial pressure.

2. Predictive Modeling: Definition & Limitations

What is predictive modeling?
- "It's all about forecasting, risk estimation, filling in some information that doesn't exist yet." [05:57]
- Used across domains: loan approvals, fraud detection, customer churn, recommendations.
Traditional pipeline: Manual feature engineering, domain-specific data scientists, slow to build and deploy.
- “You need about two full time people to build a single model.” [10:38]

3. Why Can't We Use LLMs or CV Models for Tabular Data?

Text/image models succeed due to specialized architectures and massive datasets; tabular data is a distinct modality.
Structured enterprise data: Often multi-table, interlinked (relational), quantitative.
- “To process this tabular data, we need special neural networks. It's a different data modality.” [13:07]

4. The Pain of Manual Feature Engineering

Traditional methods rely on summarizing tabular data (via SQL, aggregations) before feeding to models.
“There is kind of an infinite way of creating these summaries... the number of ways you can summarize the data blows up." [15:00]
This process is manual, brittle, slow, and requires retraining for every new task.

5. Relational Deep Learning & Transformer Revolution

Proposes treating databases as graphs and directly learning representations via generalized attention.
- “Let the attention mechanism... attend over the raw events in the database… to make the accurate prediction.” [16:40]
Analogous to how transformers revolutionized vision (pixels) and language (tokens).
The "foundation model" for tabular data, like Kumo RFM, generalizes across tasks without redoing feature engineering.

6. Processing Relational Data as Graphs

Is this new?
- “Any database is a graph... but nobody has put it together” for scalable, foundation model-based deep learning. [22:42]
- Graph ML, once niche (for social networks), now shown to apply broadly to enterprise data.
Can be applied to data in databases, spreadsheets, JSON, or other structured/semi-structured sources. [24:11]

7. How These Models are Trained

Training Approach:
- Build pre-trained, schema/task-agnostic models with in-context learning over subgraphs sampled from data.
- "You need large amounts of structured tabular relational data." [26:56]
User workflow:
- Connect database, define outcome (e.g., "churn" as zero purchases in 30 days), prompt model in NL or with a predictive query. [28:06]

8. Performance, Cost, and Infrastructure Needs

Inference:
- Models are "smaller than large language models... so predictions and output are faster and cheaper." [30:39]
- Requires a strong data backend (efficient graph engine) and GPU runtime.
Embedding generation:
- “Now you have the embeddings… that capture the semantic meaning of those entities that is predictive of the downstream activity.” [32:30]

9. Why Attention > Message Passing in Graphs

Attention mechanisms provide more flexibility and context than traditional message passing.
- “The attention mechanism is so flexible and so contextualized... it really allows for much better generalization.” [34:38]

10. Do Foundation Models Outperform Traditional ML Models?

Yes, often achieving "superhuman" performance, as seen with computer vision.
- "It's very hard to engineer perfect features... If you train in a purely data-driven way... you can get to superhuman level of performance." [35:22]

11. Rules vs. Data-Driven Learning

Encoding rules is brittle and incomplete; neural networks leverage the richness and noise of raw data.
- "The world is so unique, so diverse... Data... is king. Let the neural network learn directly from the data." [37:50]

12. Beyond Enterprise Data: Graphs Everywhere

Biology (molecules, proteins), object-oriented code, and more can be modeled as graphs; relational deep learning is widely applicable.
- “The graph view of the world becomes very important whenever you basically have a set of entities, objects... that interact with each other.” [41:03]

13. Multimodality and the Future

Tabular/structured data will be an integral modality in future multimodal foundation models.
- “We need special domain-specific, modality-specific encoders.... for this structured tabular data.” [42:33]

14. Final Thoughts: The Coming Revolution

The world of structured data has been left behind by AI, but is now primed for the neural network revolution.
- “We have now foundation model for structured relational data… that can now learn directly over this structured tabular data. So actually the structured data world… is ready for the AI revolution to actually take place there as well.” [44:04]

Notable Quotes & Memorable Moments

On the future of structured data and AI:
“The structured data world that has humongous business impact is ready for the AI revolution to actually take place there as well.”
— Jure Leskovec [44:04]
On attention mechanisms for tabular data:
"It's really about generalizing the notion of attention mechanism to this structured multitabular data."
— Jure Leskovec [16:40]
On the need for modality-specific models:
"To process images, we have special neural networks. To process text, we have special neural networks... to process this tabular data, we need special neural networks. It's a different data modality."
— Jure Leskovec [13:07]
On practical inefficiency of traditional ML workflows:
"You need about two full time people to build a single model… I need two employees to support one model."
— Jure Leskovec [10:38]
On the inevitability of the neural network revolution in structured data:
"We shouldn't be surprised by it because we've seen it before... for the machine learning database, structured data world, the same revolution is out there."
— Jure Leskovec [36:35]

Important Timestamps

01:45 — Jure Leskovec introduction and academic background
05:56 — Predictive modeling basics and importance
10:38 — Manual feature engineering; why it’s slow and costly
13:07 — Why tabular data needs its own model architectures
16:40 — Generalizing attention mechanisms to relational databases
22:42 — Viewing databases as graphs; what’s new in this approach
24:41 — How training works and distinction from LLMs
28:06 — How users interact with these models; predictive queries
30:39 — Cost and performance of inference
34:38 — Why attention-based graph models generalize well
35:22 — Why foundation models outperform traditional models
41:03 — Applications of relational deep learning beyond databases
42:33 — Structured data as a future modality in AI
44:04 — Final thoughts on the AI revolution in structured data

Summary Takeaways

Structured, relational (tabular) data underpins much of enterprise predictive analytics; it is overdue for the same neural network advances that have disrupted vision and language applications.
Treating databases as graphs and applying attention-based transformer architectures unlocks generalization and eliminates painful manual processes.
Foundation models for tabular data (like those from Kumo AI) can generalize across tasks, follow prompts, and potentially exceed human performance with less overhead.
This approach is widely generalizable—to domains from biology to software—whenever data naturally forms graphs of entities and relationships.
The adoption of these models could bring transformative efficiency, flexibility, and power to the core data operations of modern enterprises.

Loading summary

Transcript67 lines

[00:00]
A
Predictive modeling is a core element in modern systems and powers capabilities such as fraud detection, loan approvals, and recommendation systems. These systems typically operate on structured relational data stored in enterprise databases with rows, columns, and interlinked tables. While computer vision and natural language processing have undergone a neural network revolution, the tabular data layer underpinning predictive modeling still largely relies on manual feature engineering and tasks for specific models. Relational deep learning proposes a new approach. It treats databases as graphs and applies transformer style attention mechanisms directly over structured relational data. Researchers are now building foundation models for tabular data that aim to generalize across predictive tasks without painstaking feature engineering. Jura Leskovits is a professor of computer science at Stanford University and he previously served as Chief Scientist at Pinterest and was an investigator at the Chan Zuckerberg Biohub. Most recently he co founded the machine learning startup Kumo AI. In this episode you rejoin Sean Falconer to discuss the limitations of traditional predictive modeling, why structured enterprise data requires its own modality, specific neural architectures, how graph transformers generalize attention to relational databases, and more. This episode is hosted by Shawn Falconer. Check the show notes for more information on Sean's work and where to find him.
[01:46]
B
Yuri, welcome to the show.
[01:47]
C
Thanks for having me. Great to be here.
[01:49]
B
I'm really excited to get into. I think there's a variety of topics we can dive into today, but maybe before we get there just kind of ground the audience a little bit. Like who are you? What do you do with sort of your background?
[02:00]
C
Yeah, I am primarily a professor at the computer science department AI lab at Stanford. Been there 15 years. My research focus on AI, especially AI over a structured tabular relational type data. I work a lot with graph data and in my career the models we've developed are today used at Facebook, at meta, at YouTube and at a number of different places in my career. I was also a chief scientist at Pinterest for 6 years, group interest from 150 employees post IPO and was basically building large scale AI machine learning platforms there.
[02:39]
B
What drew you back to academics after having experience at a place like Pinterest?
[02:43]
C
I think Stanford is the most amazing place. It's where the future happens. And what is also amazing about Stanford is that it has this allow us to kind of flow between industry and academia and really understand what are the biggest problems out there that are worth solving. We go to the industry, we come back with the ideas. The students at Stanford are amazing and I would really say kind of the future happens at Stanford and that's the most exciting part to me.
[03:08]
B
Yeah, I spent some time at Stanford myself as a student and then was drawn out to industry and never found my way back. But I guess now that there's so much going on in the space of artificial intelligence, especially with companies, you have your OpenAI's, your anthropics of the world, and every large cloud provider doing amazing work in the space. How do you think in terms of where research is going to make significant contributions versus where maybe the private sector of companies are going to make significant contributions?
[03:40]
C
That's a great question. I would say research academia is different, right? We cannot compete on scale, we cannot compete on pushing products to customers and things like that. That's why we have startups, that's why we have industry. And whenever we make some new research breakthrough and we want the world to see that we spin off a company, we spin a startup and scale it up there. But then at the same time, there is huge value in academia, in research, in education, because the risk profile for us is very different. Right? Like we can truly explore, we can truly fail. We don't have performance reviews, we don't have all that that is in the industry. So we can always kind of ask about what are the paths not walked yet? What are the interesting new directions that maybe the industry is too conservative to take? Can we show the path there? And there's been examples of this throughout history where basically academia found a new path or researchers found a new path where the industry was just like plowing forward full scale. And right now it's similar in, I would say in the field of AI. Of course, the frontier labs are making humongous progress, scaling up these models and so on, but there is so much unexplored, there is so much more to do, and that's what we are focusing in on our research.
[04:50]
B
Yeah, I mean, I think a big part of that would be there's no sort of commercial obligation in academics. You could chase a problem that may or may not ever have some sort of commercial application just because it's an important problem, or at least to the individual to explore and maybe means something in the long run to how we think about ourselves, our own intelligence, some other type of scientific endeavor.
[05:13]
C
Exactly. And even with that. Right. I would say that we are very careful what kind of questions we ask. And we always ask ourselves, if we solve this problem, who's going to care about it? Who can benefit from the solution? So being connected to the real world is a very important part of the way we think of the research we are doing.
[05:29]
B
I want to get into this a little bit and first talk a little bit about predictive modeling, which I think is something that has a long, rich history in machine learning, artificial intelligence. There's fairly simplistic ways of doing some form of predictive modeling and then there's very sophisticated approaches as well. Can you give a little bit of background in terms of sort of the history? What are we talking about when we say predictive modeling? Why does that matter? And then what is kind of the history of the discipline?
[05:57]
C
Yeah, that's a very interesting question and I think as we go through it will become clear why this is interesting. Right? But predictive modeling has been around forever and it's all about forecasting, risk estimation. It's about filling in some information that doesn't exist yet. And where does this matter? This truly matters for, let's call it quantitative decision making, right? If I go ask for a loan, there is a predictive model that estimates the probability that I'm going to pay back that loan. If I'm in a hospital, hospital, the hospital wants to estimate what's the risk, that if I get discharged, that I get readmitted, right. If I am, let's say, dealing with customers, I want to estimate what is the lifetime value of the customer. I want to estimate how likely this customer is going to churn. I want to estimate what next product or what next show item to recommend to the customer. If I'm a, let's say, financial institution, I want to estimate what's the likelihood that this particular transaction is fraudulent. When, let's say, user logs in, I need to estimate how likely is this a stolen identity, somebody else is logging in and things like that. These are all predictive type problems where based on the historical patterns, based on the data, we want to estimate something that we don't know yet, we want to forecast something and so on. And this has been around for a very long time. And people have been building machine learning, statistics, data science, have been building these predictive models for the last 20, 30 years. And the point is that every percentage improvement in accuracy of these models means humongous business impact, right? Even 1, 2% improvement can have humongous business impact.
[07:34]
B
How would I think about this in terms of, with forecasting, say it's like financial forecasting. I'm taking a bunch of history and I figure out what is the function that describes that history and then I'm projecting that out to see where could this, I don't know, trend in the stock market go or Something like that. How is it when you think of something like classification? So I want to go and take a bunch of history as my training set and I'm going to use that to train some sort of classifier to figure out whether an email is spam or not. Is that predictive modeling or is that a different type of classification of how we would think about that AI model?
[08:10]
C
This is all what I would say falls under predictive modeling. Both examples, time series forecasting, any kind of classification churn modeling, as you gave the example, where the idea is that based on some historic patterns you are trying to forecast, is this person going to cancel subscription in the next month? And because we don't have the information what is happening next month, we have to forecast it.
[08:33]
B
If we take a specific example of like a recommendation systems, I think Amazon's been pretty famous for using recommendation systems for products for a long time. Like how do those systems typically work?
[08:45]
C
Yeah. So the reason we started opening this is because this field has remained practically unchanged for the last 20, 30 years. Right. It's all based on this idea that you bring in, let's say a data point, a unit of something that is described with a set of characteristics or a set of features, and then you are making that prediction. Right? So for example, for a recommender system, the idea would be that you have a description of the user which would be maybe how long ago did the user register, when was the last time the user logged in, what were the last seven products the user visited and what categories are these products from, and so on. And you have this kind of profile of the user. And then you would say, okay, I also have to now build a profile of the product. And now I need to learn some function that takes the profile of the user profile of the product and tells me how likely is, let's say, the user to purchase that product. And then if I find top 10 products that the user is most likely to purchase, I show those to the user and my sales go up 20, 30, 50%. Right. That's how this is generally done. And now majority of the work goes in building that scoring function that takes the user profile and the item profile and gives me the prediction. And if you say, how accurate is this prediction going to be? You have two aspects to it. One is how accurately do I build the user and the product profile, let's call it this way, and then how powerful that predictive model on top is. And the point is that this is like super painful, super slow and super manual to do. Right? Like you need to hire a team of data scientists. They need to build these profiles. This is called feature engineering. They come up with some historical summaries of the user activity, put those into the user profile, they do something similar with the product and then they create these training data sets to build the models on top. And the models on top can be this kind of decision tree, xgboost, cat boost type, things that work kind of well in practice and are completely respectable as well as to more sophisticated neural network approaches and so on. But the bottom line is that you need about two full time people to build a single model, right? Like if you say how expensive this is, it's like I need two employees to support one model. So if I now want to have 10 models in production that are making these decisions on the fly, I need that number times two, number of people to support that.
[11:06]
B
And if I build like my E commerce recommendation system, but I also need fraud detection, I can't just pick up my recommendation system model and apply it to fraud detection. I got to go train like essentially a new model. I got to do feature engineering just for the fraud detection, probably use maybe even a different type of model to train and test against and then probably operationalize that model with a different set of people.
[11:28]
C
Exactly, exactly. And I would say now, like what is the exciting thing, right? Why are we talking about the past? I think the exciting thing here is that this entire area hasn't seen real progress in the last 20, 30 years. And if you think about what has happened in the broader AI ecosystem is that we went fully neural network. Right? What I mean by that is on the, let's say in computer vision we used to do some edge detection, some feature detection and then build a classifier on top to say what's in the image. We don't do that anymore, Right. Today a neural network just learns directly from the pixels of the image. The same thing I would say happened on the, let's call it natural language processing area, right? Where we used to do all kinds of parsing and feature extraction from sentences to try to say something about what the text is saying. And today the attention mechanism just attends over the tokens of the text and kind of the AI, the reasoning is born, right? So I think the exciting thing here is that machine learning hasn't gone through this neural network revolution. And that's the exciting new thing here is that there is the neural network revolution for machine learning ready to happen. That completely changes how we are building these models, how accurate they are, how much feature engineering it Takes and things like that.
[12:43]
B
So why is this kind of hard in practice? And why couldn't we just take. We've had this revolution around things like large language models that understand text very well, and now they understand images and audio files and even video. Can we not take those and just apply them to this problem?
[13:01]
C
I think the argument is the following. I would say if you ask, what kind of data are we using when we are making this predictive modeling, predictive problems? It's structured relational data. So this is data that is stored in tables that are interlinked with primary foreign key relation. This is usually stored in a database in some structured form, right? And this is the most useful data that enterprises have because it's kind of the ground truth of the enterprise. All the events, all the activity, it's all stored there. Right? So now what I'm saying is if you say for images, to process images, we have special neural network to process images, to process text, we have special neural networks. And to process this tabular data, we need special neural networks. It's a different data modality. Image has its own set of networks that are different architectures trained in certain ways. Text has a set of networks trained in specific ways and so on. And the tabular data also needs its own set of neural network architectures and its own way to train this that can directly kind of attend over this structured tabular data. So I think it's a different data modality. That's why it needs a different approach.
[14:08]
B
Right? I mean, I think with something like text, where there's probably billions, if not trillions of examples of how sentences come together, and I can take a document from one place and a document from another place, and learning one of those documents can probably help me infer something about the other document. I think with the way I think about this problem, around tables, rows, columns in a database, is there that much I can, from a pattern standpoint, learn from one table versus another table, those patterns seem like they could be fundamentally different in terms of how the person has modeled the data. So it makes a lot of sense that this is sort of a different class of problem. But how do you go about, I guess, attacking that problem? Does this make sense in terms of the patterns from one table? Not necessarily allowing you to infer something about the patterns from another table.
[14:57]
C
That's a great question. I think the way to think of this is as you have this private data, right? You have patterns, properties in it that are unique to you. I think another point to make is also like, you cannot Just textify a table and give it to a large language model. Large language models are amazing at what I would say qualitative, human like reasoning, but they are not really good with numbers, if you want to say it very simply. And if you think about what are we storing in tables, we store quantitative data. So we need to do quantitative reasoning, not qualitative reasoning over huge amounts of data. Another point I think that is important here is to say that no enterprise has data in a single table, right? You have data spread across multiple tables. Usually you would have your customer catalog, you would have your product catalog, you would have a set of transaction records, you would have your website browsing, click data, you would have your supplier data, you would have your returns data. And all these tables are interconnected. And the only way to learn over this is to learn over this collection of tables as they are interconnected with each other. And maybe to say more, right? In the past, the way we deal with this, we would say, oh, let's take the user table, let's take the transaction table, let's join them and then somehow summarize the number of purchases, the number of transactions you had in the last time period. But there is kind of an infinite way of creating these summaries. I can count, I can sum over some time period, over shorter time period, in the mornings, in the evenings, I can add the prices, I can look at product categories, right? So the number of ways you can summarize the data is kind of blows up. And the problem with machine learning is that we kind of predetermine how to summarize the data before we start building the model. So the way to make these things better is to generalize the attention mechanism to attend over the raw events in the database and learn how to summarize them to give you that prediction. And that's the key differentiator. You don't need to be joining tables anymore. But let the attention mechanism very similarly, as in a language model, it attends over the previous words to say, okay, what's truly the meaning of this word here? We are saying if we are making a prediction, if we are filling in, let's say some cell in a table, let's attend over the other rows in the same t, other columns, other tables, far out and figure out how to bring all that information so that we can make the accurate prediction. So it's really about generalizing the notion of attention mechanism to this structured multitabular data.
[17:31]
B
And if we have that, what does that unlock from an application standpoint? Today I feel Like a lot of people are trying to apply large language models to databases for the purposes of being able to do intelligent natural language to SQL conversion. Does this yield a better version of that?
[17:48]
C
That's a great point. Right. So text to SQL is amazingly useful. But if you think about SQL, SQL is summarizing what has happened last month, what has happened last week. So you are kind of aggregating some past and maybe creating a dashboard to understand historical trends, and then maybe you can use those historical trends to do some kind of qualitative decision making about what to do tomorrow. But if you think about predicting transaction fraud, deciding which customers to send an offer to, and so on, for that you need predictive modeling, right? So text to SQL won't get you anywhere with that. You may use SQL to generate historical patterns and then build a model on top. But as we talked about that, that's super brittle manual and takes a lot of time. So the approach we invented in my research group at Stanford, and then we founded a startup around it called Kumo AI is this notion of relia relational deep learning, where we basically take the transformer architecture and generalize it so that it can attend over this structured relational enterprise data. And the key to this approach is to think of the data as a graph, to think of your enterprise's data as a set of connections between the entities in your database. Right? To think of the tables, how they are interconnected as a graph, and then generalize the attention mechanism to be able to attend over this relational structured information.
[19:16]
A
Agents are getting smarter every day. But even the smartest agents get stuck without the right context and the right tools. That's where Notion comes in. With the recent launch of custom agents, Notion became the collaborative AI workspace where teams and agents work side by side. And now their new developer platform is turning that workspace into infrastructure developers can build on. Most agent platforms are single player, making you stand up your own infrastructure just to start. Notion's developer platform flips both. You get primitives to sync any data source in, give your custom agents tools that plain MCP can't deliver, and orchestrate agents like Claude or Codex alongside your team. The CLI authenticates in one line and workers run on Notion's runtime, so there's nothing to provision because the workspace your team lives in is the same thing you build on. Permissions and governance come standard. Write your code, deploy. Done. Learn more about Notion's developer platform today at notion.comsed. that's all lowercase letters. Notion.comsed to try notion's developer platform today. And when you use our link, you're supporting our show. Notion.cesed if you're running postgres in production,
[20:30]
B
you've probably felt the moment analytical queries start fighting your transactional workload. Most teams end up adding a second database and all the pipeline complexity that comes with it. Tiger Data creators of TimesCaledB takes a different approach. We extend postgres with hybrid row and columnar storage, so one table handles both writes and analytical scans. Native compression cuts storage costs up to 95%. Continuous aggregates keep dashboards live without bash jobs and it scales to petabytes without you re architecting. Companies like Cloudflare, Octave Energy, Schneider, Axpo and Flowco run production workloads on Tiger Data today. No stale data, no second system to operate, just postgres managed for you. Ready for the workload you're building toward? Try it free@tigerdata.com you know fidelity is
[21:13]
A
a financial services leader, but did you know that inside Fidelity is a community of technologists working together to shape the future of finance and tech. Fidelity is always investing in tomorrow. From emerging tech to cutting edge tools that will transform what comes next. Their technologists are encouraged to keep learning so they can expand their skill sets, explore new ground, and stay ahead of this rapidly evolving industry. And right now, Fidelity is hiring technologists to join their team. Fidelity technologists get the best of both worlds. Startup energy that's grounded in the stability of a financial institution. That means support, resources and amazing benefits. Bring your skills to a culture where you're empowered to dream big and build the tech that drives an organization and makes a real impact on people's lives. Lives. Find out more@tech.fidelitycareers.com that's tech.fidelitycareers.com Fidelity is an equal opportunity employer.
[22:12]
B
So I mean, thinking of a database as a graph, I think people in the data modeling world have been using those concepts for a long time. Like how is this different? Like if you think in sort of designing schemas and things people used to build like entity relationship diagrams where you essentially you have your tables as a node and then you have relationships defined in terms of edges across foreign keys and so forth. Is this something fundamentally different or is it using a similar concept as essentially the basis for doing this? Deep learning.
[22:43]
C
Yeah, it's a great point. Any database is a graph, right? As you said. And these concepts have been around for a very long time. But I feel like nobody has put one one together in a sense, right, that we've been working on this graph based machine learning, graph transformers and things like that for a long time. But it was mostly applied to social networks. And people kind of that community didn't realize that actually the database is, any database is a graph. And I think the database community was so stuck in feature engineering and running historical SQL queries that they did not think about, oh, how can we take these AI tools and apply them to the database? Right. So what I'm saying in some sense is supernatural, right? We knew for a very long time the database is a graph. Nothing changes. But what in some sense changes is that. Now the field of graph machine learning is not this obscure. It only applies to social networks type thing, but it applies to any database. And the benefit that comes with it is that is the same neural network revolution that we have in computer vision, that we have in text, natural language, understanding, videos and so on. Now to another data modality, which is the structured tabular data modality with the same set of benefits and the same set of amazing outcomes that we have already seen play out in other data modalities.
[24:01]
B
Does this only sort of work against like a traditional database or could you potentially generalize this to other tabular forms of data like spreadsheets?
[24:11]
C
Yeah, I mean where the data sits, it kind of doesn't matter that much. It can be in a database, it can be in Salesforce, it can be in databricks, can be in Snowflake, can be in spreadsheets, can be in JSON, as long as it's structured, semi structured. And it may include images and it may include text and include columns and categorical values and all kinds of geographic information. All kinds works nicely in this framework.
[24:37]
B
How does training work and how is it different than training with traditional transformer models?
[24:42]
C
Yeah, that's a great point. So so far we said two things, right? The first thing we said was any set of enterprise data is structured, semi structured is usually stored in these relational tables. These relational tables are a graph. So now what we can do is we can take, in the old days we would take a graph neural network and apply it to a graph. But in today's age we take transformers in particular graph transformers that have the generalized notion of attention and positional encodings that can attend over this structure to give you the prediction. Now that, okay, we have these two steps. Now the third step is how do we train? And we have I would say two options here. One option is to actually build a pre trained foundation model that is database schema and predictive task agnostic so what this means is that now you have a single pre trained model that can connect to any structured data because it just represents and thinks of it as a graph and then you get basically the specification of the predictive task on the fly and the model is able to make you that prediction. So this means that you don't even need to be building task specific models on the fly, but you basically have the same type of ChatGPT type experience. But now for predictive type of questions, right? So for predictive modeling type questions, for churn, fraud, readmission, prediction, all kinds of advertising, use cases, customer 360 use cases, marketing, sales and so on. And the way this is trained, as you ask, is it's trained basically to teach the model how to do in context learning, right? So basically how to look at subsets of your database, use those examples to then generalize and be able to make the prediction. And when we say what are the subsets of the database, are these small local subgraphs around the entity or around the entity that you are making the prediction about?
[26:32]
B
So if you're doing in context learning, is there warning though that goes on in order to first force sort of this attention mechanism or when I think of training traditional large language models against text, sort of the input is all these massive amounts of text that then become tokens. Those are essentially the inputs of the model that help adjust the weights. And so is something similar going on, I think I'm not quite following before the context learning.
[26:57]
C
So what you have to do here is you also have to amass humongous amount of pre training data. This pre training data now needs to be structured, so you need to amass a number of different tabular data databases, things like that. And then you also need to define a number of different predictive tasks on top of this data. What is interesting is that we have just shown, we just published a paper about a method called plural, where we can basically synthetically generate a lot of data and then train these models on top of that. But you are exactly right. The same way as in let's say training large language models, you need large amounts of text here. You need large amounts of structured tabular relational data.
[27:36]
B
In terms of say a user interacting with this, they want to be able to do predictions against their own databases or tabular data. What is the input there? Because again with going back to the large language model, my input is essentially, you know, text that then becomes tokens as the input, which is the same thing that the model started as a training corpus. Here it's a bunch of, you know, database data, you know, graph structures of these databases. So how do I actually ask questions about churn prediction?
[28:06]
C
Yeah, great question. So at the setup time you need to connect the model to your database and say, here's my database, these are my tables. And then to prompt the model you need to specify the task. You can specify the task in natural language or you can specify it in a domain specific structured language that we called predictive query. And you know, to say I want to predict churn, you need to specify what does that truly mean. So you could say I want to predict whether the count of transactions or count of purchases of this particular user is going to be zero over the next 30 days. Right. That's the proper definition of churn. And if you define it this way, then what the model is going to do, it's going to go into the database, take a set of sub samples from the database and then send them through the neural network to give you the prediction for that specific, let's say user or customer that you want to say, what's the likelihood they will have zero transactions in the next 30 days? Right. So the way this operates is that the model basically goes fetches data from the database and then this data from the database is sent through a frozen model to get an output in the end. So the point is there is no task specific training needed here. There is no need to train the model for your specific database. It's very similar to large language models where the model is pre trained, you give it text, it understands text, you ask it question, you get the answer. So here is similar, the model goes to your data, understands the data and gives you this forward looking predictive answer. Of course, just to say right now if you write this kind of ad hoc querying VC is very useful for let's say humans using this type of models or agents using this type of models where you don't know the question ahead of time. If you are a large bank and you need to predict 1 million times per second whether a given transaction is fraudulent or not, you would of course go and maybe fine tune a smaller version of the model to make this faster and more cost effective. So you can use the big pre trained model or you can fine tune it for a specific task to get more speed. Again, similar to what we see in
[30:12]
B
large language models is the mathematics behind inference for these relational deep learning neural networks. Similar to that with sort of large language models.
[30:23]
C
You mean in terms of the model sizes and things like that?
[30:26]
B
Yeah, the model size And I guess the matrix computation that you're going to be doing behind the scenes, the bottom line is what is the cost of inference and how does that sort of compare to the other models that now have become sort of familiar to people at large?
[30:40]
C
That's great. Right? So these models are transformer attention based models. So you need GPUs to run them effectively. They are smaller than large language models. So the amount of compute that is needed is way smaller. So what this also means then that predictions and output is faster and cheaper. So that's the way I would describe this. What's the difference? The difference is that you need to have a very strong data backend where the data is represented as a graph. So you need a special, let's say graph engine that is tuned for this type of AI workloads that allows you to make these computations scalably and quickly. And on the other side you need a GPU for the results to come out. But since the models are smaller than large language models, you know, they are, let's say sub billion parameter models or, or something like that, you can run them quite efficiently and quite quickly.
[31:35]
B
What's involved with getting that graph structure for your own database?
[31:39]
C
Yeah, so what we have built at our startup, Kumo AI is exactly the infrastructure that allows you to basically take in any database. The engine will internally optimize that representation into this graphical form so that then these graph transformers can efficiently be run on top. And that's the hard part, right? Like the linear structures are kind of easier because you can just kind of feed them through. Graphs are hard because they are just these interconnected sets of objects. You cannot chop them, you cannot linearize them. So you really need to be able to kind of do these quick breadth first searches or subsampling of sub graphs over this database. And that's the hard part. It's basically building the optimized infrastructure that allows you to do this at scale of tens, hundreds of billions of nodes and edges.
[32:25]
B
Is that equivalent to, I guess with text generating the embedding?
[32:30]
C
That's very interesting. Yeah, exactly. What the model does internally, it generates the embedding. Right. So now the embedding in the text kind of captures the, let's say, semantic meaning of the word. If you think of it this way here in this database graph view of the world now you have the embeddings of your entities that capture the semantic meaning of those entities that is predictive of the downstream activity, whatever is the task that you are trying to solve.
[32:58]
B
The concept of a Graph neural network. How long has that concept been around for?
[33:03]
C
So I think the graph neural networks were invented maybe 10 years ago, something like that, even a bit less. That's when the field started and I would say a graph neural network. It's this idea that when, if you have the data organized in a graph, when a node wants to compute something about itself or make a prediction, maybe this is a user is a node, then not only we are using the information about that node, but you also use the information from neighbors and neighbors of neighbors and so on. And the idea is that basically now neighbors are passing information messages to their neighbors and this way kind of to the node in the center that is of interest. That has worked really well for small scale task specific models, but now the field has moved forward to, to basically these generalized graph transformer type architectures where we are not passing information across the edges, but it's actually the attention mechanism that attends over the center node, its neighbors, neighbors of neighbors and so on. And you can see how this nicely applies to the database world where if we're making a prediction about an entity, maybe a user, maybe a product, we are then attending over the nearby tables and one hop tables, two hop tables and bring that information all in the same place.
[34:15]
B
So is the value there of using the attention mechanism to sort of apply, distribute these weights over the graph versus using the traditional message passing that's happening in a graph neural network is that essentially it's more dynamic, so I don't have to have more of a purpose built model versus a generalized model that I can apply to any essentially database problem problem in this particular context.
[34:39]
C
That's exactly the right intuition, right? The attention mechanism is so flexible and so contextualized and context dependent that it really allows for much better generalization and for very effective pre training. This then means that these types of models can really be applied to a very large diverse set of use cases and can generalize to them very effectively.
[35:04]
B
How does the performance of these more general models compare to. If I was to go and take more of a traditional approach where I take my two data scientists and I go and I build more of a purpose built model using my domain specific data, build that model. Obviously it's more rigid. But do I get is there a quality difference?
[35:23]
C
There is a quality difference and we shouldn't be surprised that there is a quality difference. And the way it works, it's the same as it, it worked in computer vision where we went to basically and in LLMs as well, where we are basically going to this superhuman level performance, right? You could say, hey, you know, I'm so good at recognizing any, like let's say in computer vision, you could say, I want to build, detect and build a model that detects whether there's an elephant on the image or not. And you could be saying, hey, I studied zoology, I know elephants inside out. I'll build a perfect elephant detect detector. And if somebody would go and say, I know how to detect elephants, I'll build these amazing features to detect elephants, everyone would be laughing at them. And now when we go to tabular data, I think the outcome is similar, right? It's very hard to engineer perfect features that give you that accurate prediction, the same way as it's very hard to engineer perfect features that detect whether something is an elephant or not. So if you train in a purely data driven way, a neural network that starts attending from raw pixels, in our case this would be rows and columns and cells of the database, and learns how to aggregate all that information into this representation, this embedding that then tells me, is this an elephant or not, or will this user churn or not? It's kind of the same effect, right? So the point is that with these types of models we can get to the superhuman level of performance. And, and what I'm trying to say, we shouldn't be surprised by it because we've seen it before. Just as I was now giving this example with computer vision that today nobody is surprised about. And for the machine learning database, structured data world, the same revolution is out there. And it's not surprising, it's amazingly natural. That's what I'm trying to say.
[37:10]
B
Kind of just take a step back and looking at the larger trend that's happened around the field of AI, especially in probably the last decade or so where we've gone full neural networks, it's a lot more sort of this bottoms up learning where we're kind of learning the patterns versus something where if you look at the early days of rule based systems, where we're kind of trying to encode all the rules that we could imagine into some sort of tree structure that we then navigate. Why do you think that this approach has ultimately been so much more successful than trying to kind of take our knowledge of that system and then encode it into a set of rules that a computer can execute.
[37:50]
C
That's a great question. And exactly what you just said also applies, let's say, to this predictive modeling world, right, where we cannot anticipate all the rules, all the hard coded signals that give Us that prediction. And I think the reason is the following. Right. The world is so unique, so diverse. There is so many inputs that it's very hard to preconceive all of them and write them down as rules or signals or whatever we are talking. And data that kind of captures the world is the king. And let the neural network learn directly from the data, directly from those raw signals, as noisy as they are, how to combine them into the final signal. So I think the reason it works so well is because when we are designing these rules as humans, we don't understand, let's say the noise. We don't kind of really understand the richness and also our visual and our perception neural networks are already kind of learning how to take this raw noisy inputs that our bodies are sensing into some higher level representations. Right. So it's very hard for us to reconstruct. Right. It's all like the way we think, it's through the neural network. So rather than having one neural network build deterministic rules, we should just train our artificial neural network on the same set of inputs and it will be very good.
[39:13]
B
Yeah, I mean, in a lot of ways it seems to, I guess, mimic the behavior of humans or maybe even animals to some degree in terms of how they learn. It's not like if you have a baby and it's trying to learn how to navigate the world. The parent is giving them, here's your set of rules that you need to execute in order to crawl across the floor. They're kind of doing that more through, I would say like plain experimentation and like the. It's almost like a reinforcement learning or something like that.
[39:40]
C
Exactly. And then of course. Right. I think when you talk about rules, I would say those are also in some sense important. But the way I think of those is more like explanations. Right. When the model, when the neural network makes a decision, makes a recommendation, you can always ask why. And at that time, it's important for it to give you some general rules, some general reasoning. What led it believe to make this type of prediction recommendation? Also what is important, I think in this machine learning or this predictive modeling side of things is having very strong accuracy estimates. You cannot have hallucinations. It really needs to be kind of data driven and rooted in the patterns that are in the data. And with these models you can achieve that as well.
[40:24]
B
Now this idea of the relational deep learning, you're applying this to databases, helping understand tabular structure. But there's lots and lots of other domains that can be described in terms of Graphs. So can this approach be applied to other sort of domains? Even if I think about object oriented programming, I can describe the class structure and the relationship between certain objects as a graph. I'm sure there's things in you. Take this to the world of biology and biological structures. I'm sure a lot of those things could be described as graphs. Is there a way to apply this same approach to other types of domains of problems?
[41:03]
C
Oh, definitely. I think a lot is very interesting, right? If you think about biological structures, molecules, proteins and things like that, those are essentially graph structures, right? And Even models like AlphaFold and so on, they are really in the end learning to reason over this graph of amino acids and this graph of proximities as the proteins starts to fold, the same happens at the scale of, let's say, small molecules for drug development and so on. So I would say this. The graph view of the world becomes very important whenever you basically have a set of entities, objects. This can be atoms, this can be amino acids, can be users, products, whatever those entities are that interact with each other, right? And now what are you predicting? You can be predicting or estimating the toxicity of the molecule, you can be Outputting the 3D structure, the 3D coordinates of the protein, or you can be doing fraud detection over this interconnected graph of entities to understand whether a transaction is fraudulent or whether the account has been taken over or something like that. In the end, it's kind of mathematically it's all the same and that's what makes this so beautiful.
[42:12]
B
With foundation models today, more and more of these are multimodality, where they can handle images, audio files, text, of course, video even. Are we eventually going to move to a world where also essentially tabular data or semi structured data is part of one of those modalities?
[42:33]
C
I think so. I think that's where this is going, right? You cannot textify an image, you cannot textify a table. It just makes it kind of super hard to learn over that, right? But in principle you could, you could just write the image as a sequence of RGB values in ASCII and say, you know, tell me, is this a cat or not? Right? Like so. But the point is we figured out that we need special domain specific, modality specific encoders. And I think where this is going is now that, right? Basically we have these modality specific encoders on top of the reasoning based large language models, right? For images, for videos, and of course now also for this structured tabular data.
[43:14]
B
Yeah, I mean, even in audio we saw A huge performance increase when we stopped textifying audio. And we actually trained models during. Directly against the audio because there's so much nuance in audio versus what might be just available in the text. So it ends up, I think, leading to. When you textify it, essentially you run into problems where the model doesn't really understand certain pauses and stuff like that.
[43:37]
C
I think it's a beautiful example and also just shows that the two of us right now are just talking. So everything we say should be just captured in words, but it's actually not. You see, when you work on the RAW signal, on the raw speech signal, the performance, there is more information there. The performance goes up.
[43:54]
B
Yeah, absolutely. Well, Yuri, is there anything else you'd like to share? What is sort of some of the, from your perspective, the big problems that need to be solved in this space?
[44:04]
C
I think the exciting part is that I feel like the structured data world has been kind of a bit left behind. Right. People know how to run SQL and then for everything else, they feel like they need to build manual machine learning models super painfully and super slowly. And I think what we discussed today, the exciting thing is that we have now foundation model for structured relational data, for example, Kumo RFM is one such example. We have this approach of relational deep learning. We have the transformer architectures that can now learn directly over this structured tabular data. So actually the structured data world that has humongous business impact is ready for the AI revolution to actually take place there as well. And, you know, today we shed some light on it and that's what I'm very excited about.
[44:51]
B
Yeah, absolutely. Well, thank you so much for being here. This was really, really interesting.
[44:55]
C
Yeah, amazing. Thank you for having me.
[44:57]
B
Yeah. Cheer.