
NVIDIA RAPIDS is an open-source suite of GPU-accelerated data science and AI libraries. It leverages CUDA and significantly enhances the performance of core Python frameworks including Polars, pandas, scikit-learn and NetworkX.
Loading summary
Shawn Falconer
In an upcoming special podcast miniseries, Software Engineering Daily sits down with Turing Award recipients, the most prestigious honor in computer science, to explore their lives, achievements, stories and insights. What inspires these innovators who have transformed the field of computer science, and how do their groundbreaking ideas continue to shape technology? Today, we delve into pioneering work in programming languages, breakthroughs in computing performance, revolutionary advancements in chip architecture, and more. Join us this March and April for rare and thoughtful conversations with Turing Award winners and learn about some of the most influential breakthroughs in computer science. Nvidia Rapids is an open source suite of gpu accelerated data science and AI libraries. It leverages CUDA and significantly enhances the performance of core Python frameworks including Polars, Pandas, SciKit, Learn and NetworkX. Chris Diote is a senior data scientist at Nvidia and Jean Francois Puget is the Director and a distinguished Engineer at Nvidia. Chris and Jean Francois are also Kaggle Grandmasters, which is the highest rank a data scientist or machine learning practitioner can achieve on kaggle, a competitive platform for data science challenges. In this episode they join the podcast with Shawn Falconer to talk about Kaggle GPU acceleration for data science applications, where they've achieved the biggest performance gains, the unexpected challenges with tabular data, and much more. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
Jean Francois Puget
JFP and Chris, welcome to the show.
Chris Diote
Thanks for inviting us.
Jean Francois Puget
Thank you.
Yes, absolutely. I'm excited to get into this, so I think we have a lot to cover, but I wanted to start off by talking about Kaggle and being a Grandmaster, which is a distinction that I believe both of you have. For those that are unfamiliar with this concept, can we start there? And what's it mean to be a Grandmaster?
Chris Diote
Yeah, so Kaggle for me means many years of entertainment. I've been participating for six years. It's an online community for data science and there's currently over 20 million users. And on this platform you can engage in conversations, you have access to Jupyter notebooks, you can share code, host data sets, and you can also compete in competitions and the website. You can earn achievements and you can gain titles. And yeah, you've heard people say Kaggle Grandmaster. So what is that? So that's one of the titles you can gain. It's the best title you can acquire and you can actually become a Grandmaster in the four categories discussions, notebooks, competitions and data sets. And the most desired one is the competition's Grandmaster. And to achieve that, you need to actually win five gold medals and five separate competitions and one of them has to be a solo, that you won it by yourself. And the competition is incredibly difficult. On the website, people are competing from around the world. Typical competitions have thousands of people, so it's very hard to obtain. And there's only I think a couple hundred competition Kaggle Grandmasters in the world. It's an amazing thing. And then I'll mention you could also, as I said, get the other Grandmasters. So you might have heard the expression a double Grandmaster or triple Grandmaster. That's someone who's actually received awards hosting discussions or notebooks and they've acquired another Grandmaster title. And then they. You could sort of stack them up.
Jean Francois Puget
When you cite them, I would add Corporation Grandmaster. It's based on your merit, how the quality of the models you build. The other ones are based on community votes. So it's a bit different. I would also define Kegirl as a legal drug. It's really adrenaline flows when you compete. It's really addictive. So when you start, you can't stop.
Can you talk a little bit more about the competitions? What does a competition consist of? Is this something that's happening live or is it more like a problem goes up and then people are asynchronously putting time into that and you have to essentially try to solve it within that timeframe and come back with a solution.
Typical competition is like a short time data science or machine learning project. You're given some data sets or data. Kaggle curates a data set and Kaggle also curates a question. For instance, you want to predict next month's sales for a retail chain and the data is passed by product, by store, by what have you. Or it could be some image classification, medical image diagnosis. Is this a cancer or not? And you have images to train and its duration is typically three months. And basically you have to submit some code that will be run on a hidden test set and then a number is computed from your predictions. That's the score. You see a leaderboard of the score obtained from some of the data set. And after the competition they will compute the final score on the rest of the test data. And the reason they do this is to avoid what is known as overfitting. So making sure they select good models and not lucky ones. There is another form of competition they call analytics, which is a bit different. It's also a form of data science. You're given some data and you have to find an interesting story out of it. And then it's Judged by a human jury.
So Chris, who actually creates the competition? What sort of expertise do they need to be able to create these competitions for? Presumably some of the people that are the best in the world at this.
Chris Diote
Particular job, they're sponsored by actual businesses. So a business approaches Kaggle with a certain challenge they need solved. So maybe a university wants a model that can read student essays and assign scores to it. So they approach Kaggle and say, I would like you to host the competition. They put up money, they put up cash prizes, and then Kaggle will kind of help curate the data and do all the infrastructure and logistics. But it is interesting to note that it actually starts from a business need. So the competitions are real problems and the company afterwards gets, when they give out the prizes, they receive the code of the top solutions and oftentimes they'll immediately implement the code. So it's nice to know that you're competing, you're helping a good cause and that your code can actually be used to solve a real world problem afterwards.
Jean Francois Puget
And then what do you personally get out of participating in these beyond just the satisfaction of a job well done?
I'd say the main thing is you learn. You learn both from the problem and reading relevant papers or blogs or code base and from the community. So if you want to know state of the art models for a given topic, best is to enter Kaggle competition on that topic and you will learn from the top teams. So the winners, the prize winners, usually they have to disclose what they do to get the prize so you can learn as well from that.
And then how did both of you get involved with this? And for those that are maybe interested in dabbling, like how would you get started?
Chris Diote
A friend recommended it to me six years ago. But I would say that the kind of purpose it played in my life was the learning process. So my formal Training is in PhD in Mathematics with a specialization in computational and simulation. And then I started learning data science on my own. And after you learn all the ideas, I wanted a way to practice, to test it out, to build some models and to talk with people. And then someone said, hey, do you know about Kaggle? So for me it was wonderful. Immediately I met people to talk with. There was problems to solve, there was competitions, playground competitions. So really I went to it for the learning process. And then as JFP said earlier, it's highly addictive. I mean, once you're there, you get involved in a comp or just talking with people, it's tons of fun. And then you're checking it all the time and you're participating in more and more, but it's really helped my learning tremendously.
Jean Francois Puget
Yeah, for me it's a bit different. I did a PhD in machine learning but in the previous millennium, so it's irrelevant now. And then I went during my professional life working on something else on mathematical optimization. Then at my previous employer people saw I had some machine learning background. So I said, oh, why don't you go back at this. We are developing tools and I say where can I find an update on what's the current state of the art machine learning practice? I found Kaggle watch a little bit and then I jumped in the, in the water, got hooked and it's a great again it's. For anyone developing machine learning and data science tools, that's a wonderful place to see how it's working. What are the needs for today?
Yeah, I totally get the addiction component of this. I was never involved in these types of competitions, but I did compete in things like TopCoder and the ACM ICPC programming solving competitions through university and I became completely addicted to that experience competing in these. And I think one of the things that even though they're not necessarily business driven problems that I got out of participating in this is that it just made me a lot more comfortable with software engineering because I was putting so much time into just the act of practicing so much of coding really sharpened my skills and made me way more employable than I was before. Even if it wasn't necessarily directly the types of problems that you'd be solving day to day at work. I'm curious, how is this experience of competing in these types of competitions translated into your day job and, and how you're leveraging some of those modeling techniques and other things that you've learned in your day to day job.
Chris Diote
So participating on Kaggle has just taught me so much and made me a better data scientist. So yeah, again I just learned tons of new techniques how to do things correctly. So I'll mention that, you know, I had read a lot of books but there's a lot of techniques that you learn on Kaggle that are not really in textbooks yet and also oftentimes a lot of new things. I think even Gradient Boosted Trees was developed on Kaggle. So basically people, I mean it is on the fringe of research. It's the latest ideas, you're learning the best techniques and you also get a chance to work problems in all different domains from computer vision, natural language processing, tabular data. So just all that exposure and then I guess One thing I'll add too, the way they set up a competition with a hidden test set, you really have to make your model generalized to unseen data, which is one of the most important things in the field. Building models in the field of data science. So yeah, all the skills you learn, plus repeatedly learning to make models that truly generalize. Then immediately when I'm inside Nvidia building models, working on projects, it's just all that knowledge just, just comes and it's. It all benefits what I do.
Jean Francois Puget
I would add that we both got our job at Nvidia because we were Kaggle Competition Grandmasters. So that's also a nice outcome of all the learning we got. And we have like 15 or 16 Kaggle grandmasters at Nvidia now.
Yeah, I think that's something that I always think about and recommend, even going back to my own experience in things like topcoder and ACM ICPC competitions, is that top tier companies are paying attention to these types of competitions. So if you're interested or just starting your career in the space, competitions like this is a good way to not only learn and build up your own skill set, but sometimes you might have a company that just comes up to you because you participate in something like this, or you've done well in them. And even beyond just being ranked in the top 10 in the world, not everyone's necessarily going to achieve that. But it just shows that you have a passion for the space and that you're pushing yourself and learning and working at it. That is also really attractive to companies. So I wanted to talk a little bit about Nvidia and the RAPIDS platform. This is an open source suite of GPU accelerated data science AI libraries. So first of all, what problem is this helping data scientists with and how does that set of libraries actually work? Maybe. Chris, let's start with you.
Chris Diote
Okay, so it's a whole suite of libraries and it helps with a whole variety of tasks, but its main goal is to speed up all sorts of things. So the two libraries that I work with the most are CUDF and qml. And CUDF helps with all your data frame needs. So it's got an API similar to Pandas. It's all the same functionality and with that you can speed up all your data frame needs. I should probably take a step back and sort of say, you know, maybe why was it? Or in my opinion, what role did it play? But today all companies are getting more and more data and it's getting harder and harder to process all the data. So even things like computing statistics, doing data framework, we need to run that faster. So that's where the QDF comes in. Basically it does all the computations on GPU and it's, you know, it could be a hundred times faster than using other libraries. And then as we move forward with more data, it's, it's going to be getting faster and faster. So that's great. And then I also use a lot qml which has a similar functionality as a Scikit Learn. It does machine learning models, but once again it'll train all these models on GPU and do them much faster. So if you're doing tasks requiring support Vector machines, KNN and other models like this, they could train the models hundreds of times faster. So basically I would say that yeah, it helps with things that maybe we've been doing all along, but because now it moves the process to gpu, it's incredibly faster. And if you're working on experimentation or you're iterating things and trying to make more accurate models, or just try to get your work up quickly, I would say it's becoming a necessity with how data and everything's growing in terms of.
Jean Francois Puget
The GPU acceleration that's happening. What needed to happen in order to make it so that you could do things like KNN, for example, on GPUs.
Rather than traditional CPU as Chris said, Rapids. It's more than that, but it's also the GPU accelerated version of Pandas, Polars and Scikit Learn. There is more, there is graph, there is signal processing. But recently, over the last year we made the move from CPU based to GPU based seamless. So if you have a nice Pandas code you just have in your notebook, you just have to load an extension at the start and then all your code will be GPU accelerated seamlessly. You don't need to change any line. And more recently we did the same with Polars. So this is a way for people to just experiment what they gain from moving to GPU very, very easily.
Is there, I guess a cost associated with that that you have to take into account given that this is running on GPUs?
Chris Diote
I guess the cost would be just that. I guess obviously you would need a GPU to run on a gpu, but a lot of modern day systems have both a CPU and GPU in your system. So I think for most people it's a matter of just flipping the flag and you'll just immediately get speed up and it'll just use your machine's gpu.
Jean Francois Puget
Yeah, just agree when we say cost, mindsar was on the time of the data scientist cost, we reduced this to the bare minimum. But there is still a compute infrastructure that remains.
Besides getting like a better, you know, 100x better performance. Does the fact that you can do these things so much faster so that you're sort of shortening the learning cycle, also change things in terms of how you think about building models or how it might even impact your existing work?
Yeah, I see data science and machine learning as an experimental science, just like physics. So ideally, to build a good model, you have a baseline, you want to improve your design. You say, oh, I have this idea, maybe more data, maybe different parameters, what have you. You design an experiment to test if the change is really improving and then you run it and you look at the results and depending on the result, it becomes your new baseline or not. So if you can do this faster, you will try more ideas and you will just. That will lead to a better model just because you can experiment more, you can perform way more experiment in a given time.
And then does that also change from an experimental standpoint, like I guess, the types of models that you might be able to try in a given data set, because now you're less worried about how long it's going to take to train something, you can move much faster?
Chris Diote
Yeah, it absolutely does allow you to do new things. I am actually. So, yeah, I guess there's sort of two things that enable you to do. So JFP pointed out that you can do experiments faster, so you can do what you were previously doing. But we could do it better because we could try out more things. But it is actually doing a second thing which it's allowing people to do things that were not previously even able to do. So for example, if you would try to use KNN or actually one thing you could do is you could actually take tabular data and you could push it through UMAP to create features and then put that into an image model and do these kind of weird pipelines. But back in the day running this on cpu, you really couldn't use some of these models that are way too slow. So we recently saw a coworker, a colleague won a competition where he actually used a combination of deep learning and machine learning. So deep learning has a backbone which sort of generates features. The head sort of will then do the regression. But because QML has accelerated machine learning models so fast, he was actually able to just take the features out of the deep learning model and then train support vector regression and he was able to do this cycle over and over so fast because of the new speed that in the end his model won first place. And it was a hybrid actually. So it was actually a combination of a deep learning model fused together with a support vector regression head and these hybrid models and other there's been other advances in feature engineering. So there's a lot of new techniques that I'm seeing that are a direct result of having this speed. And we can sort of do some new model designs and some new techniques.
Jean Francois Puget
And another thing we can do is also to just run deep learning models on tabular data. So there are a lot of papers claiming it's the best. We don't usually that's not what we find. Gradient boosted, like XGBoost, like GBM CADboost, all GPU accelerated still outperform but when you ensemble so you take one of these and you take some transformer or some other, there are deep learning models and blend the prediction together, you improve over a single model. So on Kaggle it's used a lot. Of course you can't running deep learning on CPU only. It's not great.
Yeah, definitely. You mentioned this hybrid approach and I think a lot data science works or traditional data science work, we think about predictive ML and now there's a lot of focus on generative AI and generative deep learning techniques, stuff like that. Do you think that because so many people are excited about what's happening in generative AI and there's so much hype around it that sometimes we lose sight of the fact that predictive ML can still do a lot of useful things. We kind of try to throw maybe too big a model at something that we can actually solve with a simpler bespoke trained predictive model.
I would say it depends on what you want to predict. Generative AI as the name indicates, you generate something a text typically or images with diffusion models. So if that's what you need, of course that's what you should do. But if you want to forecast your sales for next quarter, you need to predict numbers. That being said, so classical machine learning. So regression models or classification models, deep learning or not are still the way to go in that use case. But we do find that if your input is text and for instance you want to do text classification, say spam detection or classify in few categories using a generative model LLM, but only take one token, just ask it to output one of few options. This is a great classifier and that's quite interesting because you benefit from all the investment and progress in the zlmc.
Shawn Falconer
The global developer talent shortage is expected to grow to 4 million in 2025, further contributing to developer burnout. With the security and talent shortage growing rapidly, businesses need effective tools to help developers work efficiently and securely. That's where Bitwarden comes in. Bit Warden delivers trusted open source security solutions that empower your developers and security teams to securely manage and share sensitive information online. Protect your infrastructure secrets, API keys, user passwords, mailing addresses, credit cards, passkeys and more. With easy to use and enterprise ready Bit Warden solutions. Start your free trial today@bitwarden.com so you're.
Jean Francois Puget
Talking about essentially instructing the generative model to produce an output that's within a specific range. Like if I only want a value between 0 and 1 based on the probability that this thing is to indicate that this thing is part of a particular category.
This, this is beating just the encoder. Only models like Diberta Roberta, they are much larger so no surprise, but they improve.
Chris, did you have any thoughts on this?
Chris Diote
Yeah. So your question about are people throwing too big a model at it and you are right now with LLMs and the reason they're getting better and better. It's sort of more tempting to do that, but this has been an age old problem. I always see this. There's a lot of people just throw the biggest model they can. But I think it's always been the case that we should try the simple models and I love doing it because it's really fun. There are definitely times when the simple model can outperform. That's exciting. A lot of times the big model can do as good as a little model, but then it's inefficient. You don't want to use more compute than you need to. So when I'm given a problem in the early stages, I actually like to try a whole range of models, simple and even complex and even lately I have been throwing LLMs at every problem I can just to kind of say can it do this, can it do this? Right. But you do the whole range and then in the end I generally try to go with the smaller models, the simpler models.
Jean Francois Puget
Yeah, I like the hybrid approach too where you could use a model or a particular model on the backbone of some generative AI model to check essentially the answer and then do those iterative steps. So I want to talk a little bit about tabular data prediction. So can you talk a little bit about why this is such a challenging problem? Why have people been focused on this and interested in it? For such a long time.
Yeah, so it's really a good question because we see deep learning becoming the way to go for all sorts of data modalities. Except for tabular data, where the jury is still out. There are many reasons, but it depends on which tabular data. If the data is measured from a physical, like you know, you have weather data, what have you, a deep learning model is likely to be better. So my hypothesis, it's not science here, it's if the data is sampled from the physical world, the physical world is smooth and deep learning will work well if it's sampled from human decisions like people, behavior of sorts, sales, sales forecasting or what have you. It's much more discrete and I would say chaotic in a scientific way. So hard to predict. And there are smooth models like deep learning models, not as good as say gradient booster trees that can handle discontinuity very naturally. But that's just one angle. Chris, you may have another one.
Chris Diote
I've been actually fascinated by this particular question for a very long time. A lot of researchers have been wondering because we saw a transformation in computer vision and natural languaging and text, natural language processing about a decade ago, right. So starting about a decade ago, so before a decade ago, people would actually take in computer vision, humans would actually engineer the features. So they would actually take images, process it, extract features and then put that through a machine learning model like a support vector machine. And they did similar things with text with. But then we invented deep learning and then deep learning totally on its own. It does the feature engineering and does predict it. So that's revolutionized computer vision and natural language processing. You can download pre trained models, fine tune them, but that's yet to happen in tabular data. So I would say that the best tabular data models are still involve human handcrafted engineered features where we make new columns. So I am particularly looking forward and curious, you know, will the day come when there'll be some sort of deep learning model that can digest, you know, a variety of different tabular data frames and essentially engineer features and do it on its own. And I think the reason it's challenging is I think that the data is much more variety. Right? So images all have share of the fundamental building blocks of lines and shapes and text has fundamental building blocks of words. But you know, what is the fundamental building block of tabular data? You know, here's statistics from a finance company, here's data from medical data. Data is sort of so different. It's going to take something that's Going to actually have to see how is it all similar, you know, what's the common theme? Maybe the common theme is some kind of cause and effect or logic or reasoning. But some model has has to sort of understand it and find all this similarity and then maybe then it can engineer on its own and it can use past learnings to help with future problems.
Jean Francois Puget
And because the data sets are so varied and different, could you end up with a situation too where even you had a ton of tabular data to train on? The model might not actually tune itself to recognize the patterns. The pattern recognition has less to do with the data, is more about the structure, the fact that it's organized in the rows and columns. You essentially end up with sort of biasing the model and what it's trying to predict and essentially leading to a place where you're sort of overfitting the model against the wrong pattern.
The latter is not happening. When you have an image, you have Pixel Arrange in 2D. If you have video, 3D. If you have text, you have numbers in one dimension. Same for audio. So it's very regular organization of the data. So you can train a model once and it works with hopefully a lot of the instances. Tabular data, sure, it's 2D, but sometimes the columns are independent so you can shuffle them. Sometimes they are not time series. Sometimes the huge correlation between columns, some columns and not others. So as Chris said, the format is not specific enough at this point. Maybe sometime people will have trained a model on every tabular data available online and claim it's a foundation model. Actually, some people do claim they have foundation model and tabular data, but on Kaggle we don't find they are the best models yet.
How do boosted trees work on this type of problem? I'm less familiar with that. Is that a variation of the decision tree?
Chris Diote
Yeah, so it's actually an ensemble. It's a linear combination of multiple decision trees. So yeah, boosted trees are. They just repeatedly make decision trees. And then each new decision tree, it trains on the previous cumulative error and it tries to reduce that. So it keeps just adding a new tree. And the purpose of the new tree is to kind of reduce the error a little bit more. And then in the end you just combine all the. You just basically you take an ensemble of all the trees and that's what it is.
Jean Francois Puget
You could see in terms of deep learning, it's a gradient descent. But each update, you don't update existing weights in a model, be it a linear regression or deep learning, you add a tree that implements the gradient delta. And people think it's a recent technology because the first useful implementation is XGBoost. It's like 10 years old only, so it's quite recent. But the theory was published 25 years ago, more or less.
Are there libraries within Rapids that help doing some sort of feature engineering?
Chris Diote
Yeah, absolutely. I think that's another advantage of the speed of rapids. So we just discussed how. Yeah, so QDF is one and then newly the QDF pan is in CUDAF Polars. But specifically to improve model accuracy, what you often do is given a data frame, you'll make new columns and there'll be transformations or combinations of old columns. That's what feature engineering is. And it's done sort of manually, but with the speed. So with cudf, which operates on data frames and the speed at which it works, you can sort of systematically go through a whole set of transformations. Like let's randomly pick pairs of existing columns, combine them together and then target encode it and we'll make a new column and then we'll see if that improves the model. Right. So you could just, you could basically build these for loops where you just systematically go through typical things that humans will try and then you can train a model and see if it proves and actually I recently won a competition doing just that. It was the, it was a Kaggle playground competition. You had to predict insurance premiums, maybe like a car insurance, you know, the annual premium. And basically I just, I just set my computer running overnight and it just, it actually tried tens of thousands of. So it had. The original Data set, had 23 existing columns and I randomly picked groups of 2, 3, 4, 5 or 6. I combined them targeting, coded it using QDF and then I would, I trained a model, see if it improved the validation score and it just keeps. So in total there may have been something 150,000 combinations to try and it's just randomly tries them and. And it found hundreds of ones that worked successfully and then I added them to my final model and it boosts the score tremendously. So much so there was even a gap with the second place. So this was only made possible by the speed. If I had tried to do those data frame operations on a CPU library, literally the search would have taken months, so it would never have finished. Right. But this search just happened overnight. So absolutely this speed is allowing us to actually do some automated feature engineering.
Jean Francois Puget
I would add expand on something Chris mentioned. So he built on top of Rapids, but he used a built in component called Target encoding. So it is a way in tabular data you have basically two type of data. One is just category, you know, you have a color you have and others ordered numbers, you know, the weight of someone or whatever. For cardinal categories it's very hard to manage for algorithm like linear regression, super vector machines and basically you have to create additional data. It's called one hot encoding, one colon for each possible value. Then they can be combined linearly. It's a pain because it expands your data tremendously. You have to use sparse implementation. It's, it's not. It exists in coup DF in rapids. But there is another way which is smarter, which is to say basically say you have a category with five values you just average. Say you want to predict some numerical value out of your. You just average for each value the target you want to predict. This gives you an indication of how good this value is. This can be done automatically. It's called target encoding. But if you do it the way I do, you overfit to it because you include the target. It's tricky. You need to use what we call out of fault prediction to avoid. So you never use a target of a row to compute a value for that row. So there are ways to do target encoding for one row using other rows. This is built in in rapids. And then what Chris used was to apply this target encoding on column combinations. And that's very useful.
Shawn Falconer
This episode of Software Engineering Daily is brought to you by Jellyfish, the leading software engineering intelligence platform. AI Codegen tools can be force multipliers for R and D organizations, but are you making the most of them? Join your peers on April 17th at Glow Live. It's a dynamic 90 minute virtual event that explores the transformative nature and potential impact of AI CodeGen solutions. At GLOW Live, you'll hear expert insights on navigating a constantly shifting landscape, adopting Codegen tools successfully and measuring their impact on your team, your work and your company's long term success. Register today at Jellyfish Co Glow and get glowing.
Chris Diote
So at the upcoming Nvidia GTC conference, we're actually giving a workshop which we're teaching this, this exact technique, you know, how to target encode, how to use QDF to do that, also some other encodings like count encoding. I hope that, yeah, I advertise that. Y'all check it out, it's gonna be a great time. So we're gonna make the features and then we also train some models and show how it improves the models. But it's I would say for tabular data it's probably the most effective and sort of powerful technique to improve your models. And it's time and time again it's kind of been the key component to sort of winning these cat combinations. It's actually a hands on workshop where if you're, if you're there you can work along with us and follow the code and it's going to be great. So I suggest that everyone checks it out. It'll be taught by some KGMON and some other nvidians.
Jean Francois Puget
Yeah, awesome. I'm hoping to be there too, so hopefully I can participate in that. In terms of like even in competitions, how do you determine where you need to focus on making improvements to your output? It could be part of the feature selection process. It could also be part of the model that you're using. There's a lot of things that could go wrong and you only have limited time. So how do you figure out where to actually spend that time?
That's a great question because basically you have some available budget. So the time till end of competition where you can work on it, you may also have a compute budget and you need to allocate the resource the most wisely. So what I do is I make sure first I have a good test harness, I can really evaluate my model. So typically with a baseline, create a cross validation setup and try different baseline models, submit to the competition to see if my cross validation correlates with the score. If it does, then I don't need to submit too much work with my local setting. And then indeed between feature engineering, trying different models, implementing a more complex workflow, then it's a combination of where you have some feeling based on past experience and the low hanging fruits, you estimate the time it takes to code it and run it. There is a bout of luck. If you investigate the right thing first, you do better. And hopefully after years of doing this, every week we gain some feeling of what might work first.
Yeah, so you start to build an instinct essentially when you see something that maybe feels like it's underperforming and you can understand where that problem might be.
Chris Diote
Yeah, absolutely. So I've been in 80 competitions in the last six years and I have such strong intuitions, you know, I'll train a model, I'll look at its output, it'll always be, you know, getting something wrong. And I oftentimes know exactly where to look that oh, you know, I could even. You really start to get a sense of how you should, you know, alter the model architecture or the training procedure or how you should augment the Data or this that you really get a. Yeah, it's amazing. I actually always make analogies. So for Nvidia, I was the teacher at the university and I'm always making analogies that for me, training a model is actually teaching a student. And you get better with time as a teacher, right? And you learn how to listen to your students. So you teach a student and then you have them do a problem and you watch them, or you watch and then you see how they do it, and then they get the wrong answer. But you look at their work and you see, oh, I see, they just forgot to divide by two here. And you start to learn what the common mistakes are. And then, you know when, when you talk to him again, you have to emphasize the divide by two. And it's the same thing with the models. I'll see models make common errors. I'll sort of know how to address it. I know how to change things.
Jean Francois Puget
I would just add that I also don't rely on automated optimizing tools. I see people using COPTUNA to tune parameters, always do it by hand because that way I get some intuition. I learned from my experience. If I rely on the Black box optimizer, maybe I will learn how to use the optimizer better, but I will have no understanding of what works under the hood.
In terms of where data science tooling is going. If you can make one prediction, I guess, where are things moving, progressing? Where do you think the next big breakthrough is going to come from?
Chris Diote
I would say one thing, and we're already seeing it, is how large language models are to completely change the workflow. So they're basically, we're going to be working together with them. So already we see them helping write our code. We see, you know, copilots, people basically ask them questions, they can suggest ideas. So already, so take a project from start to finish, you know, a company comes to you with a certain task, here's our data, or even we want you to be able to predict this and then to finish, like, here's the finished model and here's it does. And that, you know, there's all these different steps and different roles of people involved in the process. But more and more we're going to see LLMs get involved in all different steps of the process from, you know, the beginning, EDA writing code at various points, giving suggestions for this, maybe even, maybe even taking charge of an experimentation cycle and then running experiments on its own and changing things. So. So it'll be really exciting to see how they'll Be utilized more and humans will be working, I think together with language models in the whole process of building a final model.
Jean Francois Puget
Even around generating test data is massively useful.
Yeah, test generation, definitely. I will come back to this. But I would add to Chris, with Skygirl, we focus on the modeling part of data science and machine learning. But getting the data and then using the model we create in production requires coding, which is something data scientists may not be good at. And transferring to software developers that are not good at machine learning, you also lose something. So maybe we see LLM used as coding assistants gaining traction. They could be assistant for data scientists to write the code they don't want to write, to connect before and after back to generating testing data. We do already do it for text and images. There is a lot for tabular data. There are people. I think generating tabular data to me is not mature enough, except if you model some physical phenomena. In Kaggle, there were a number of competition running at sim. Synthetic data for astrophysics, for particle physics. And there the simulator, the data generator was great because it was based on physics principle. There are playground competition, but Chris is doing more of these. Always have a bit of a fear that modeling means reverse engineering the data generator. But Chris, you may disagree here. I don't know.
Chris Diote
Yeah, what he's referring to is. So Kaggle has increased their frequency every month they're offering a new playground competition. And it's very hard to offer competitions that often because the most difficult thing is getting data sets. So recently they've been using synthetic data sets where they're either generate. Yeah, they're generated by LLM. So LLMs essentially make the data. And the risk has always been when data set is synthetic, you know, you can actually sort of reverse engineer because somehow it's making new data with a target.
Jean Francois Puget
So.
Chris Diote
So if you can think how it thinks and how did it assign targets and how did it make new data, then you don't have to actually forecast the insurance price. You just have to figure out how is the data made. And you do see this often from time to time, people do figure this out and they win comps because they've reversed engineered some process. Yeah, it's something we have to be careful of. But I think as time goes on, the synthetic data is getting of a higher quality, but there still are artifacts that you can take advantage of a little bit. That's always a risk when using synthetic data.
Jean Francois Puget
Yeah, absolutely. Well, we're coming up on time. Jfp. Chris, I want to thank you so much for being here I thought this was really, really interesting. Hopefully we'll see each other at the workshop at Nvidia.
Thank you for inviting us.
Chris Diote
Yeah. Look forward to meeting you in person, John, at the conference.
Jean Francois Puget
It.
Podcast Summary: NVIDIA RAPIDS and Open Source ML Acceleration with Chris Deotte and Jean-Francois Puget
Podcast Information:
The episode kicks off with Shawn Falconer introducing his guests, Chris Deotte and Jean-Francois Puget, both of whom are NVIDIA employees and Kaggle Grandmasters. The conversation begins with an exploration of Kaggle—a premier online community for data science competitions.
Key Points:
Notable Quotes:
Insights:
Both guests share how their involvement in Kaggle competitions significantly contributed to their professional journeys, including securing positions at NVIDIA.
Key Points:
Notable Quotes:
Insights:
The discussion transitions to NVIDIA RAPIDS, an open-source suite designed to accelerate data science and AI workflows using GPUs.
Key Points:
Notable Quotes:
Insights:
Chris and Jean-Francois delve into how RAPIDS facilitates automated feature engineering, a critical aspect of improving model accuracy.
Key Points:
Notable Quotes:
Insights:
A significant portion of the conversation addresses why tabular data remains a challenging domain for machine learning, contrasting it with the successes of deep learning in fields like computer vision and natural language processing.
Key Points:
Notable Quotes:
Insights:
Looking ahead, the guests discuss the evolving role of large language models (LLMs) and generative AI in data science workflows.
Key Points:
Notable Quotes:
Insights:
As the episode wraps up, Chris and Jean-Francois highlight upcoming events and resources for listeners to further engage with NVIDIA RAPIDS and advanced data science techniques.
Key Points:
Notable Quotes:
Insights:
Final Thoughts:
This episode of Software Engineering Daily offers a comprehensive exploration of how NVIDIA RAPIDS is revolutionizing data science through GPU acceleration. Chris Deotte and Jean-Francois Puget provide invaluable insights into the synergy between competitive platforms like Kaggle and professional applications, emphasizing the importance of speed, automation, and community in advancing machine learning practices. As the field evolves with the integration of large language models and generative AI, tools like RAPIDS will undoubtedly play a central role in shaping the future of data-driven innovation.