
Guests are and from Spotify. We spoke to Avin and David about their work building Spotify’s Machine Learning Platform, Hendrix. They also specifically talk about how they use Ray to enable inference and batch workloads. Ray was featured on episode...
Loading summary
Kaslan Fields
Hello and welcome to the Kubernetes podcast from Google. I'm your host Kaslan Fields.
Moviraman
And I'm Moviraman.
Kaslan Fields
In today's episode, our AI correspondent Mophi Rahman talks with Evan Regmi and David Shah from Spotify about their work building Spotify's machine learning platform Hendrix. They also specifically talk about how they use Ray to enable inference and batch workloads. Ray was Featured on Episode 235 of our show, so make sure you check out that episode too. But first, let's get to the news.
Moviraman
IBM has acquired the Kubernetes cost management and optimization startup Kubecost. Kubecost has achieved considerable success being used in production at companies such as Allianz, Audi, Rakuten and GitLab. On their blog about the acquisition, the Kubecost team framed the acquisition as joining forces with Apptio and turbonomic, two other acquisitions IBM made over the last few years, which focus on cost and performance optimization. The team also emphasized that they anticipate no interruptions in the products and services they offer during the transition the Cloud.
Kaslan Fields
Native Community Japan, a subgroup of the CNCF has announced that there will be a Kubecon Japan held in 2025. While Japan has featured Cube Day events, usually alongside Open Source Summit Japan in the past, this will be the first time a Kubecon event is held in Japan. The document shared with the announcement states that the event is expected to feature two main conference days, over 100 sessions and over 1,000 attendees. The event dates and location have not yet been announced.
Moviraman
The call for proposals for Kubecon EU 2025 is now open until November 24th. While the CNCF used to open Kubecon EUCFPS after Kubecon na, the CFP opening has been moving earlier with the introduction of Kubecon India in 2024 and Kubecon Japan in 2025. There will be lots of Kubecons to apply to next year, so keep an eye out for those CFPs.
Kaslan Fields
Artifact Hub has become a CNCF incubating project. The project is a web based application that enables finding, installing and publishing cloud native packages and configurations. Discovering useful cloud native artifacts like helm charts can be difficult with general purpose search engines. Artifact Hub provides an intuitive and easy to use solution that allows users to discover and publish multiple kinds of cloud native artifacts from a single place.
Moviraman
OpenMetrics is archived and has been merged into Prometheus. In July 2024, the technical oversight Committee of the Cloud Native Computing foundation approved and signed the archiving of OpenMetrics and migrating under Prometheus. As the author says in the CNCF's blog post, OpenMetrics is dead. Long live OpenMetrics as Prometheus Format.
Kaslan Fields
Kubectl is an open source Kubectl wrapper that can be used to add colorful highlighting to your Kubectl output. Originally developed by Hidetats, Yagi, Numa or Hidetatsu on GitHub, the project has recently been revitalized by Kale or GitHub user AppleJag and PruneDebastian Thomas or GitHub user Prune998. The newly released 0.4.0 version introduces even more fun and useful functionality like highlighting kubectl outputted logs to make them easier to read. The release also features new paging functionality for long output Contributed by Lennart GitHub user Lennartac. If you use Kubectl, you might consider giving Kubectl a try and that's the.
Moviraman
News David Shah is a senior Engineer on Spotify's ML Platform team. He has helped build and operate a centralized RAY platform that enables Spotify's ML practitioners to easily start prototyping their ideas and scaling their workloads. Before that, he worked on Spotify's core infrastructure for backend services, specifically on deployment tooling. Welcome to the show, David.
David Shah
Thank you Mophie.
Moviraman
Evan is an engineering Manager at Spotify, leading the ML training and compute team for the Hendrix ML platform. His areas of expertise include training and serving ML models at scale ML infrastructure and growing high performing teams. Prior to joining Spotify, Evin led the ML platform team at Bell, focusing on distributed training and serving. Additionally, Evan is the founder of Panini AI, which is a cloud solutions that serves ML models at low latency using adaptive distributed batching. Welcome to the show, Evan.
Evan Regmi
Thank you. Thank you Mophie for having me here today.
Moviraman
So to get started we had your bio and we talked about it. But just like tell us a bit more about how did you get involved with Spotify and the ML platform itself? Starting with you David.
David Shah
Yeah, first of all, very excited to be here. I've been working on Spotify for many years now and I've switched teams most recently worked on the current ML platform team for several years helping them build out the infrastructure. I first got involved because of just my interest in the area and it's always been. Well recently it's been a very hot area of research and development for a lot of companies and I just wanted to see what all the Buzz was about and to work and learn on exciting new technology and to help Spotify apply it.
Moviraman
And the same question to you Evan. How did he get involved in the ML platform for Spotify?
Evan Regmi
For me, I joined Spotify about soon will be two years. But before that I was with Bell AI Labs where I also led their ML platform. It was a very interesting journey because I actually got started into applied ML around 2016. But at that time the platform side or productionization was very, very different. Nobody really did anything much about it. So me trying to productionize a bunch of the models that I was working on kind of led into this experience of starting myself start at Panini and eventually from there to Bell Labs and eventually to Spotify.
Moviraman
So you mentioned a couple of things there. I just kind of dissect a little bit more and this is probably something a personal interest of mine as well because every time I meet someone that they say they're working in ML platform, ask the same exact question because I seem to get different answers from different people. So if you have to define in a few sentences what an ML platform actually is, because every team I speak to, they have a slightly different definition of what an ML platform for them means.
David Shah
For me, ML platform means that it's the tools, it's the SDK. It's essentially the infrastructure layer on which our users, most of them, all of them are internal. Other Spotify employees, like AI researchers, ML practitioners, those users actually use it to do the actual application. So we abstract away things like getting the compute resources, having to know or understand certain implementation detail. I'll talk more about it later. But it's all built on top of Kubernetes. So we don't want or expect our users to know how to use Kubernetes in order to get access to lots of Hardware, accelerators or CPUs or certain nitty gritty details of in this case Ray. And so the platform basically wraps all of it up, makes it super easy for them to quickly get and to use our computational resources for training serving any other applications.
Evan Regmi
Yeah, and just expanding on what David just said, like the interesting fact I think Mophie you said it's very different from person to person is that I think one part of this is also depends on the organization as well. Right. And their adoption and ML journey. Like typically if you ask the question to someone who is starting off or their ML scope is relatively small, what ML platform to them would be very different for larger enterprises where they're supporting many customers and many users. So I Think the need for the ML platform also changes from Org to Org depending on the business case and at what scale you're operating at.
Moviraman
And Spotify obviously have been in this journey of the platform itself, not the ML platform like the application itself uses a lot of machine learning and deep learning things to understand user preference and things like that. So Spotify has been doing this for over a decade now. How has that changed over time? Obviously both of you probably have not been from beginning of the journey, but you have probably seen more of the evolution over time. How has that changed and has it become like. It seems like it has become more of like a unified thing now. And how was the decision being made internally to make it more of a unified platform for all of Spotify?
David Shah
Yeah, I've been on the platform team for a couple of years now. I don't have the institutional memory before that, but I can talk a little bit from as an outside observer seeing how it was used. So I think before there was an ML platform team, each team that wanted to do ML. And one of the earliest teams was the team that built Discover Weekly, one of the most beloved use cases of ML at Spotify. So I think they basically had to roll their own from the model architecture all the way down to how do we get compute resources and how do we schedule this to run every week and at the scale for every single user to generate a customized playlist of recent songs we think you might enjoy. And then you know, use cases proliferated. More teams needed to do similar things for different type of applications. And then I think the ML platform team was maybe, I'm going to guess five years ago, maybe longer it started and basically trying to build common infrastructure for all these use cases so that those teams didn't have to keep reinventing the wheel and could get become more productive at just focusing on their applications. And then for us also recently a big change in the platform team was the choice of technology and the frameworks to use. One of the initial production ready stacks was based on Kubeflow and TensorFlow. And then a newer path that we're working on today, relatively newer, is also supporting other types of frameworks, not just TensorFlow but kind of anything on top of Ray and other Google products like Dynamic Workload Scheduler.
Evan Regmi
And I think our like the way how like we it has evolved in the last few years is I would say draw a very similar line to other companies as well. Right. Like in around 2018 or so the ML landscape was a little bit kind of heavily driven by TensorFlow or TFX suited tools. And that was the kind of default approach for going into production. And that's kind of how we have created our system around that time. But as technology evolved and the way how we do various modeling that evolved, we had to look into different ways of supporting it outside of TFX as well. And I would say around since then, around 2022, going on forward, we've kind of expanded into RAY and Pytorch and other, allowing more of a generative AI use cases, NLP and so on, which we can definitely talk more about in a bit.
Moviraman
Yeah, and our listeners can't really see this, but Evan right now is wearing a T shirt that says Hendrix ML platform, which is the ML platform that both of you work on at Spotify. Tell me a little bit more about this Hendrix platform and how does it kind of operate on top of the tools you mentioned, like Kubernetes and RAY together to bundle all the things people need to do any kind of ML workload on Spotify.
Evan Regmi
And the most foundational layer at Spotify. Right. Like I would say there is the data compute and orchestration being kind of like the core piece that we have. And on top of that, I would say Hendrix resides on top of that. So a certain component of Hendrix would be for a compute infrastructure we use GKE and we have the SKF for that. But also recently we expanded on ray, which resides on top of our gke. And that's what people use primarily user practitioners use to actually train their models, do batch inferencing and so on when it comes to serving, because we also allow users to serve. Traditionally we were working with TF servings and so on toolsets. Recently we've expanded into others like Tridents and vlm, supporting for more wider use cases. Then there has to be ability for users to actually orchestrate this or schedule this in a production. And for that we use flyte, which is very similar to kind of like Airflow, allowing users to kind of orchestrate their various different ML workloads. And for features we have in house, kind of like a feature store jukebox. And there's Hendrix SDK, which is this high level SDK that wraps everything together that users use to actually interface with it. So that's very on a high level. David, do you want to expand anything on that?
David Shah
Yeah, I think that's a really good summary. That's on a high level. And then I can speak a little bit more about how we actually deploy and run and operate the RAY based Stack everything's on Google Kubernetes engine which allows us we don't have to maintain the Kubernetes cluster ourselves. GK has a lot, we're just using the standard but it takes care of a lot of things for us. We deploy Ray in its Kubray form. So Kubray is just the open source way to deploy Ray onto Kubernetes and that works really well for us. We don't do the VM based deployment of Ray and then a lot of the other things we mostly use, yeah, Google products like the logging is just cloud logging. For metrics we do use an internal metrics stack. Sometimes we use the cloud monitoring stuff but mostly it's just internal metrics.
Moviraman
So you mentioned like using Kubernetes and in this case GKE for using. How was the decision made to build a platform? Like when and how did you decide that this platform should be built on top of something like Kubernetes?
David Shah
It was very early when we were playing around with Ray and it was just me and one other engineer, Keshi who's still on the team and I think because of where our team's bandwidth needs and expertise was and also just how well Ray the Kubray version worked. After playing around a little bit with the VM based deployment of Ray and the Kubernetes based deployment of Ray, it was pretty simple that we were able to just get started a lot faster with Kubernetes. The auto scaling, the monitoring, the sort of other value added things like GKE image streaming, those just made it so that because of our limited size of our team and that we had a lot of Kubernetes expertise and that just, it just worked. We didn't want to have to build more things ourselves on top of just plain vm. So that was pretty straightforward.
Moviraman
Yeah, I think again I work in the GK dev rel team so I definitely have a lot of love for Kubernetes and GK as the product. But there is a growing sentiment in the community that for many problems Kubernetes might be an overkill or too complex of a solution to bring in. But it seems like for this kind of a use case where your problem is complex enough, which fits nicely in the model of Kubernetes, that's one thing I tell people all the time, is that Kubernetes is a great solution, but it's not a perfect solution for everything. Right. Like you have that right match your problem set matches nicely with the features Kubernetes is providing. And that's when you kind of get the value. Oftentimes people are trying to force their problem in the world of Kubernetes or any other type of the solution that sounds cool and good and they kind of like get burned because it is not really mapping with the problem they're trying to solve.
Evan Regmi
Absolutely.
Moviraman
The next question I was going to ask is that imagine now I am like a new user to Spotify, a new team that is trying to use this ML platform. What does that onboarding journey look like from that point? Like I have a idea, I have a thing I want to build that is going to use some of that ML platform, like what am I doing to get my application, my ML application on to running on Hendrix and using the resources.
Evan Regmi
So if you're like a new team or let's just say you have a new team. Right. So for us, the way how we kind of do it on Kubernetes in our array infrastructure is we have several different namespaces that we create for various different teams. So first one would be to actually kind of create a namespace and there's a way to do that from Hendrix that allows users to do that and that gives you access to a certain level of coda default one that you can get started with. We've been actually working quite a bit on Workbench that allow it's essentially managed instance of dev environment that allows you to go ahead and create resources that you can interact with it. So with the Hendrix SDK you can actually specify saying that hey, for array approach I would need a Ray cluster with these many head nodes or with these many kind of workers, resources, GPUs. And by typing that and entering that, this is the CLI route, then users are able to essentially go ahead and create the necessary provision, the Ray cluster for you to get started. From there you can actually SSH either into the head node of the array and start your DIV process or you can actually start submitting files and so on. I think the key here is that there isn't a specific one approach. I would say there are three different kind of high level approaches that we allow users to interact with. Ray right. And depends on what journey you're in. If you're relatively early on and you're just focusing on the eda, trying to explore data, try different things, most likely starting up in a notebook and sshing into a Ray cluster, a smaller cluster is fastest way for you to do so. Once you start training your model and you have a certain level of maturity, perhaps now you want to actually deploy this model into production. At that point you would want to orchestrate this rather in a more of a clear defined pipeline using flight, rather than having a notebook or scheduling out of a notebook. So at that point a team would actually kind of move away from a notebook experience into more so of a fully built Docker images that would be actually deployed via flight. So depending on where you are in this journey, you know, you would actually kind of navigate or change the entry point of Ray either from a more of an ad hoc approach to more of a scheduled job via flight.
Moviraman
That makes a lot of sense, I think. Oftentimes people want to think about like a dev platform. They're looking at a like the example a lot of people would think about is something like Backstage, which is very much like you. There's a DSL that if you define your application in a very strict format. But it seems like what gave Hendrix a lot of success is giving people the option of the flexibility of having a very like ad hoc approach as well as a very strict defined using flight. I think Flyte uses YAML to define their pipelines that way. So once people like deploy their application, once I have deployed the application I have onboarded, I tried with notebooks even eventually I wrote my flight yamls to get that running on Kubernetes and Ray. How much of the like the knobs are exposed to the user in that like I know abstraction comes with a lot of things hidden certain times for the ML teams, how much of this underlying platform are they getting to see when they deploy something?
David Shah
The principle we're going for is progressive disclosure actually. So for just people who are getting started, we don't want them or expect them to even know that there's Kubernetes or GKE or have to look at YAML. So like Ivan referenced, we have a Hendrix SDK and also command line executable. You just run Hendrix create cluster and we give them some knobs. But if you don't specify any command line switches like number of CPUs we give you same default and then you can just connect to it, start running a notebook, start writing a very simple Ray function and even to get started we also give them notebook tutorials so they don't even have to like look at upstream open source Ray docs. They can just like plop them into a notebook editor and they can just start, click, run all and just actually look at it that way. Then of course progressive disclosure people will want to customize. So then they probably discover The CLI switches that we have where you can add hardware accelerators, they find out they need more workers, they can add more workers that are more than the default. Maybe they even need like a custom container image and not our default one. Then we have docs showing them how to do that. And then some people even have to drop down to really dealing with the Ray Cluster Kubernetes YAML itself because there's something that we just didn't provide a knob for because it's impossible to kind of provide a knob for everything. And then we allow the command line tool to say path to your RAID cluster YAML. So we allow them to really drop down to being exposed to Kubernetes, but we make it kind of progressive things so that they only need to know as much as they need to get the job done.
Moviraman
Yeah, I think that's a really good point of progressively getting them more exposure to these things. Because if you expose all the Kubernetes knobs, you're back to Kubernetes again. So the thing you're trying to save them from, you're putting them back into that same world. That makes a lot of sense. So another thing you mentioned, like earlier version of this ML platform, if not Hendrix was using Kubeflow and then eventually you thought of like moving to Ray and now a lot of the platform also probably uses Pytorch as well that you mentioned as well. So that decision of choosing Ray and moving over from everything TensorFlow to a lot of things to Pytorch now, was that mostly a bottom up approach where the ML like engineers who are asking for these features? Or is it more about you looking at the trend of the industry and seeing this is where the industry is going? Or it's more of a mixed approach. Like how do you decide that the platform should be supporting Ray as a primary like Orchestrator?
Evan Regmi
I think it's both sides, right? So we definitely saw trends happening in the industry as well. So there were certainly trends where things were more focusing on transform based approaches, nlp and we had use cases around that line as well. I mean it is definitely possible and we still have teams kind of using transformer hugging face packages on skf, but the experience is still not the best. Right. So we definitely got that. We've noticed that as well. But at the same time as more of a NLP approach, models are coming in, LLMs are coming in. For us to kind of fine tune that or those kind of models on an SKF was impossible. So I would say there was Definitely elements of kind of externally which direction the industry is moving in and we definitely see it kind of things going more towards Pytorch Array was kind of evolving quite a bit and a lot of companies were adopting it. But at the same time there were also use cases, business driven use cases that kind of allowed us to more invest in that side as well as.
Moviraman
You are kind of Kubernetes as open source project has evolved quite a bit in the last few years to support use cases like large language models and serving and fine tuning and training like these massive models at scale. But when you were talking about starting your journey in 2022 that is kind of the early days. So what kind of changes you had to make to get kubernetes to work well enough with the size of models that you had working?
Evan Regmi
There are various different optimization right. That we had to do. And in fact we're actually we'll be talking expanding more on this on the race summit a little bit later next month. Just to kind of give you an example, just for our size of the array cluster itself, like our traditional SKF cluster, from scaling that up to a couple hundred nodes to currently we're at a cluster right now which can scale up to over 4,000 nodes. But when it comes to actually training the models itself, leveraging GCP's high optimized compute node pools, right. So if we're using something like H100A2A3 node pool instances, those ones have a high interconnect bandwidth between GPU to GPU communication. So that allows us to get a better support for training these lottery models. The other one is compact placement strategy. Making sure that each of these VMs are physically CO located together in the same location to reduce the network latency also improves on that. The other kind of knobs that we kind of turned off would be nickel fast sockets. So all these Nikkel Nvidia's collective communication that's happening, GCP has this transport layer plugin on top of that that actually optimizes on top of the traditional ncl. Very easy for us to kind of enable. But in a public kind of forum we saw that people were gaining about roughly 30% speed up in terms of training these models. So those are some of the kind of optimizations that we had to do on top of our GKE cluster to accommodate for these larger model trainings.
Moviraman
David, you also mentioned earlier about dynamic workload scheduler that also kind of like is a feature that we have for helping getting resources that are difficult to get. Like GPUs. We all know how difficult GPUs have been to get. Tell me a little bit more about kind of the way your workloads are scheduled so that you can kind of wait to get those resources instead of having to get them right away.
David Shah
Yeah, the current work with Dynamic Workload Scheduler, I'll refer to it as DWs from now on is pretty experimental. We don't have. We're just about to test out some of the functionality with some early users, early teams that we work with. So it's a little bit early to say how they will decide to restructure their workload in terms of code or in response to this new functionality. But I think one of the biggest pain points just in motivations for this added functionality now is that a lot of teams wanted to use the latest, most cutting edge hardware accelerators and there's frequently a lot of stock outs either from, you know, just not enough quota from our end that we've acquired or actually sometimes there's Google Cloud or Google Compute Engine stock outs in a region because we currently our Kubernetes GK node pools are a lot of them are on demand. Sometimes we have reservations and these are for listeners who don't know. These are like reservations are where you pay to reserve compute instances and whether or not you're using them, you're still paying but it guarantees the availability. So the on demand ones teams requests and they might not get them and so they're blocked. And so we're hoping that the work with DWS not only will make it more cost efficient in terms of it's getting scheduled. So if you need eight H1 hundreds you're not going to schedule your workload until you can atomically acquire all eight. You're not going to be hanging on to four and paying for it while you try to get the other four and sometimes failing to get the other four. So it's not only like cost efficiency but also we're hoping that it gets us more availability to and avoid those kind of stock outs so that our users aren't blocked.
Moviraman
When you're talking about this ML platform, about 4,000 node Kubernetes clusters a lot of users, different users in different names, spaces using the same cluster. You obviously probably have some challenges with like resource sharing because they're on the same cluster. How does Hendrix handle multiple people asking for similar resources and then those resources being shared as fairly as possible?
David Shah
So yes, our platform is multi tenant, lots of tenants, lots of teams, everybody's always hungry for you know, give me CPUs, give me latest hardware accelerators. We use several both features of Kubernetes and then processes like human processes to manage those requests in. Yeah, I'm not sure to call it like a fair way, but in a way that works currently for people or is at least like transparent and visible as much as possible for people that provide like guardrails and avoid noisy neighbors and resource contention as much as possible. So we use Kubernetes namespaces. Each team starts off with a namespace. They can actually create more than one namespace if they have like multiple systems. We kind of go with a per system namespace, per system approach. And then we use Kubernetes resource quotas. We start off with like a default amount users can request. They can actually go and edit it or request more. But that's subject to approval by our team to check that they're requesting like a sane amount. It's not going to, you know, hog everything. And then yeah, it's like a combination of both a human in the loop plus Kubernetes kind of resource isolation and then we deploy Q ourselves. I think when we first started playing around with Q, obviously it was very powerful and very would solve a lot of problems that we currently face. But it was also quickly apparent that it's pretty complex. There's a lot of cool things you can do with it like borrowing, lending local queue, cluster queue, all these Kubernetes resources that even we were new to. And so we deployed it as a team kind of in a centralized way. I don't think any of our users would have the bandwidth or the know how to like get started with that quickly. And then we use kind of set of repos to kind of centralize and encode the same defaults we want. And also perhaps like the certain promote certain behavior that we want. For example, we want only if you're using hardware accelerators that to go through the queue provisioning Kubernetes pools. And if you're just using CPUs it goes through the regular on demand auto scaling stuff of the cluster. Yeah. So it's a combination of both human and technology features.
Moviraman
Awesome.
Evan Regmi
I would say that we also have multiple clusters as well. Right. So we have the let's say a smaller cluster that people kind of get started with for the experimentation one and a larger cluster where people are kind of like deploying larger workloads. And for us like a morphe perfect kind of vision in the future would be like as a user, I don't have to worry about context switching between clusters. Right. That entire thing is kind of like abstracted away for me More so here is what I'm trying to do and kind of figure it out like you know, whether it's going to be reserve instance, whether it's going to be on demand, was it going to be on a larger cluster or a small cluster. And we hope that Q will be able to kind of help us simplify this process by being able to kind of navigate this ambiguity.
Moviraman
This question I think goes to well, both of you, but more specifically to Evan, because you mentioned that over the last few years you have been building Panini, working in Bell AI Labs as well as Spotify. You have been very closely related to building ML platforms and leading teams that builds ML platform. What are some of the probably more interesting and or surprising things you found out about like building ML platforms?
Evan Regmi
I would say there's like two different aspects to this. You know, if you were to ask me like hey, you know, building a like you know, backend team or something like that, the talent is kind of given, right? You would, I would go out there, you know, get like kind of a backend domain expertise and fork on it. I think the challenging and interesting part of ML domain is that you kind of need expertise from both ends, ML side, infrastructure side, sometimes deep level CUDA optimization as well from GPU and so on coming together. And the very nature of ML is such that it's changing so rapidly all the time. But at the same time the nature of platform is that you're trying to have certain level of stability going forward, right? Because if your platform is also changing rapidly, it's breaking users code, that's not a good ML platform. So being able to navigate this, and on one hand this ML domain that's constantly coming with new tools and technologies, but at the same time how do you go forward in a way that's a little bit more stable where you're not breaking user infrastructure, user code? I think that over there is quite challenging and it is a kind of fine balance between at what point do we want to move fast and try new things and at what point do we want to slow down a bit and see is this going to break users changes or their experience. And at that point we should probably kind of slow down. I would say that over there is quite challenging.
Moviraman
I think for me personally that was a bit of learning because I come from a strong infrastructure background. So when we first started talking about like running AI workload on gke, my first response Was it's just to workload, put it on a container, just run on gke. Like I don't understand the big fuss is about but now that I have spent a bit of time and started like talking to customers as well as listening to folks from the community, I think what you mentioned about, I think that dichotomy of the underlying technology of ML moves so fast versus infrastructure is somewhat static compared to something like this whole large language model is like. Surprisingly the transformer based architecture was initially proposed in like 2017 but in the last seven years the scale of models that people are deploying have like thousand to a million x in size, right? Like a few months ago someone was talking about like a 2 billion parameter model being like a small model which is such a bizarre way to like describe a 2 billion parameter model. But again I think our our perspective of what a model size and how big, what is a big model anymore changes quite rapidly. But infrastructure itself is not growing as fast, right? Like it is growing very quickly with the new GPUs, we're releasing new GPUs, the memory and everything is growing. But the scale at this, the ML platform and the model size is increasing. It's not probably it's much faster than like infrastructure changes. So that is actually really interesting finding. And Yeah I agree 100%. And then the similar question to David then like previously even in Spotify you worked in the team that worked with deployment productionizing, probably back in other systems but now working on ML platform, how is that different or how are the same?
David Shah
Yeah, a lot of things are the same in terms of what is good to do as an infrastructure team. Things that are always timeless are have actionable error messages so that you help your users don't write error messages for yourself with all the context but someone who has no idea what this is like. Actionable error messages are very important. I think the Ray team does anyscale does a really good job. If you look at a lot of Ray error messages it's just so informative. Even suggests like we notice that you're doing this. Here's this suggested optimization. We notice this thing is running slow and it just prints it in the logs and so you can just immediately go do it. That's one. Number two is again progressive disclosure make it really easy to get started for people who are, you know, have a quick start, same defaults and then let people drop down to lower levels of implementation detail to override things. Another one is it's not surprising but I guess because things move so fast in ML vs like more just like you know, back in services or something. Things go from prototype to production, being like used in production very, very quickly. So keeping track of your tech debt I think is really important because it's going to be very soon that you have to pay it off, especially when it moves quickly and so many people are using it as quickly, even faster than before. Like as soon as you write a piece of functionality, people are going to be wanting to use it. Another thing is, at least at Spotify, the types of backgrounds and the level of expertise for AI researchers and ML practitioners is very broad. So you'll have people who are coming out of PhDs, they haven't worked in industry before, they're probably writing Python, that's not the best. And they don't know how to SSH to something because they're just used to writing something locally. So at least for me I've noticed that I have to write documentation or design tools that are just for a much broader array of users. Of course we do have very sophisticated ML engineers too who could instantly tomorrow work on our team and have no problem. But we do have also like a lot of other people who don't come from a traditional engineering based background. And so for me I have to take that into account.
Moviraman
Yeah, I think for me another learning has happened over time and I think for a lot of teams this is a revelation too. For the longest time for many companies ML probably was this like research Org that was doing interesting experiments and they would run this one off experiment to find information out. For Spotify I think it's a lot different because Spotify have been putting ML in the forefront of the product that you have for a really long time, which for a lot of companies they got thrusted into the critical path in the last couple of years and it's been like a big challenge. So I think Spotify definitely had like a good head start in that space because productionized ML workload was already kind of in the forefront of your product. So that probably definitely helped a little bit for teams to like know that my application would be seen by other people, not just my in my notebook. So that is definitely, probably was really good. The last thing I want to ask to both of you is obviously that Hendrix platform, it is going to grow and like other people are going to come in, new workloads going to come in. But if you had like a wish list of things and features that you an infinite time and resources to get them done, what are things you would like to see coming into the platform for like the people that have been asking about and also things in the Kubernetes space that have been challenging and you would like to see improved for specifically ML platform building.
David Shah
Yeah, One thing that we could definitely improve is so we put a lot of work into the interactive kind of prototyping stage when you're writing stuff in notebook. I think when people schedule something with flight and then something goes wrong, that debugging process is a lot slower and harder for them because of just how there's a separate team that manages a separate Kubernetes cluster that runs flight and then the flight thing kicks off a workload on our team's Kubernetes clusters. And so debugging that kind of you hit like this organizational hierarchy, you kind of are exposed to the organizational gaps in between. They're not the same team. So it's not a very integrated experience. And then just the fact that everything's running in containers and when things go wrong, you can't go SSH in and poke around at the point in time when the error happened. So that's one part of the experience that could be improved. Definitely. Another part is we have an SDK and it's pretty tied currently, or at least it very much supports or is opinionated on Pytorch. We want to make it more framework agnostic. We also want to make it more flexible in terms of the Ray version that you're using. So we wrote a bunch of code to abstract with Kubernetes and it's very tied to this version of Hendrix, uses this version of Ray, but it doesn't need to be that way. It should just be. You can use whatever version of Ray. You can like specify it, but because of how we wrote all the other code on top of it, it's like very Ray version specific code. It's going to require a bit of rethinking about how to make it Ray version agnostic. So you can just. You want to use like a newer Ray version, go ahead. Everything should still just work. And another part, in terms of user experiences, everything should just be faster. Currently, the images and software artifacts that we provide are a little bit bloated. It's kind of like there's a lot of things, Some people very frequent ask is, I don't need all this, how do I get the minimal thing? How do I get my image or workload to start up a lot faster? How do I not PIP install and have to wait several minutes? And then it's like this virtual Environment is like many, many gigs big.
Evan Regmi
It's funny you asked this question because right before this meeting I was just in another meeting that came out of it talking about more like what are the things changes we want to bring in the next six months and so on. And I very much agree with what David said about kind of flight experience. Right. Like I think traditionally where we came from, that KF Q flow kind of workflow where orchestration was kind of taken care of by the Spotify QFlow moving into flight, I think there were opportunities for us to improve that experience a bit. And certainly that's something it's on our roadmap to make that better. The other thing is that just taking users from experimentation to production much faster because a lot of projects actually don't end up in production, very few actually do make it past a B test and go all the way to production. So what are the minimum different steps that users can take? Or can we reduce the number of steps needed from a notebook experience all the way to the point where things are scheduled fully into a system like orchestration, like flight. Right. And in between can reduce those steps. If we can eliminate those number of steps needed, we can actually hopefully make that number of process much faster. Not encountering for model performance and so on. And I would say that finally the last kind of like an ideal state for that's the direction that we're working on as a platform team would be from a rate perspective is that as we kind of grow in terms of use cases, you know, we may also grow in terms of number of clusters. But from an end user's perspective, how do we abstract away all the necessary infrastructure and just kind of have them focus on their model? So if I'm in a machine learning engineer, I'm trying to train a model I want to optimize for a certain business lift or or bins metric. I don't necessarily want to be able to have to worry about how many head words worker nodes do I need, how many memory GPU that I need. Similar to I would say maybe perhaps SageMaker jumpstart. If we can actually have users just focus on specify the model. Hey, I'm trying to train a model that has this much billion parameters. From that can we infer the batch size and based on that can we actually infer the resources needed where that's completely abstracted away and certainly this will help certain level of users. But then again there might be other users who want to fine tune that or have control of that and for them, you know that will be okay. We can expose them to those settings.
Moviraman
The final question I wanted to ask is in this space of like doing ML experimentation, there is like a number of different ways to get started. As you mentioned, like Ray has their own docs, Kubernetes has their own docs. People could start getting started on Pytorch. But for when you're building an ML platform, I think that the zero to one experience of I have an idea, I want to just get started and understand what this thing is all about. How does Hendrix or Spotify ML platform kind of help user getting started with using something instead of just having to learn all the moving parts individually?
David Shah
One thing that we noticed users struggling with or just there being more sharp edges is setting up your development environment. I mean it's not ML specific, but it gets a little trickier when it comes to a bunch of the ML frameworks and tools currently it's not so much of an issue now, but maybe a year or so ago it was actually even before that. Most of our employees, this is very Spotify specific, but most of our employees use MacBooks and that's a very different CPU architecture than where you're running your production workloads, which is more Linux based or at least Unix. Like when people had to set up their development environment locally, oftentimes they PIP installed stuff and they'd have issues. The behavior was just different. Some GRPC package didn't work, especially after Apple Silicon came out. People would have run into strange error messages that required like hours of googling to find out some obscure compiler flag that you needed to enable. So the dynamic linking between libraries worked like it was just awful people. It would be a time sink, like an unbelievable black hole of just productivity going down the drain. So we partnered with another team at Spotify that owned the cloud developer experience for data scientists and data engineers because they had an experience where you could just click something and it would open up something in your browser and you could just start coding. The environment would be set up for you, you'd have jump to definition, you don't have to set up anything locally. So we wanted that same kind of hello world, really nice experience for ML engineers as well. The name of the internal tool that we built for this cloud IDE is called workbench. We added ML capabilities, specifically the Ray capabilities to workbench so that you can say I'm doing ML stuff, I want a different workbench. It would just give you all of that in your browser. It's VS code base. We use The VS open source like VS Code Server and you can just get started right away without having to fuss around with pip, install all of this stuff, look up some obscure error.
Moviraman
Yeah, as someone who is not a day to day Python developer having to deal with peep install error codes and what that red line actually means, that's a struggle I can relate very closely with. And with that I think. Thank you so much to both of you for spending the last hour ish talking all about building the ML platform at Spotify. Hopefully in the future, in a couple of years, once you have all the wishlist items that you talked about have been implemented, we'll come back and talk about the new challenges that we can face and how did this new solution prompted even more use cases for using ML at Spotify. So thank you for spending the time and sharing all your thoughts with us.
Evan Regmi
Thank you. Thank you for having us. It was a great time.
David Shah
Thanks mophie. It's a pleasure.
Kaslan Fields
Welcome back mophie, and thank you very much for that interview. I'm really excited that we had our interview about Ray. Talking about what Ray is as an open source project and how it works with Kubernetes and all of that. And now we have an episode talking with folks at Spotify who are using Ray and developing. Not just using Ray, they're developing a whole platform. I liked the references to Backstage. It kind of reminded me of a Backstage esque kind of platform where the idea is to abstract the underlying hardware very platform engineering and allow the end users to use the tools that they need. So really the users of Ray are going to be these folks users. But the platform engineering aspect of creating this platform that abstracts away that underlying hardware I think is a pretty common thing that a lot of companies are at least trying to do, if not doing it already.
Moviraman
Yeah, no. Thanks for having me as a host once again. I think the key differences for me of something like Backstage versus an internal ML platform for your teams. Backstage is trying to build a developer platform for a lot of folks that may or may not be in the room where a lot of the Backstage development decisions are made. So it's the same challenge of Backstage has to be something for everyone where the people they're trying to build a platform for may not be there voicing their opinions right away. So that's kind of the key challenge of as Backstage becomes more and more bigger with more feature rich. Like you would need another platform to manage Backstage to kind of abstract away some of the complexities of Backstage. And it's the Same thing like when Kubernetes first came out, the API space of Kubernetes was fairly limited, so there was like a three or four different Kubernetes resources that you could deploy. As Kubernetes added more and more things, you needed abstraction because not everybody needed all the different features of Kubernetes, right? So this internal developer platform that Spotify is building, it is built with the input and need from all the ML teams that exist within and, and as those teams grow, at some point it is possible that the platform becomes so big that they need some other tools to simplify using the platform. And it's kind of like the never ending Ouroboros loop of it becomes big platform is successful, more people want to use it. Now it's too big. We need something more abstract and more streamlined to use the thing again. And the cycle continues again. But even with that, I think the journey that Spotify has taken is actually very indicative as well as illuminating for other listeners that are in the same journey as they're now. Like a lot of the ML platform journey for teams probably started post2020 and in the age of LLMs, right? But Spotify have been on this journey, as we discussed, for years now. So they're maturity in being able to adopt new features as well as knowing what not to do is probably generally higher than a lot of the newer teams that are just getting in on this journey. So there is a lot to learn there from kind of their trial and error and like figuring out they have been using ML from like 2012, 2013. When the discovered weekly Discover feed first came out, they finally decided to build a unified ML platform in like 2018, 2019. So in the beginning, just like most other teams now they were doing similar things like experimentation, trial, build your own thing until you find out you're spending way too much building every team building their own ML platform. Now we can save a lot of time by just having a dedicated centralized team doing this. So that's like a very like I thought was a very common practice and common path a lot of teams are taking.
Kaslan Fields
Speaking of the beginning of Spotify's journey and kind of complexity and how Kubernetes fits into all of this. One section I really enjoyed was when you asked David about how they made the decision about using Kubernetes as the base for this and his answer of course was that it was kind of easy and simple for them to decide on that because they were already using Kubernetes and so they already had that expertise. And so just doing Ray on top of Kubernetes was the easier path for them than doing it on top of VMs, which I thought was very poignant. It's like a lot of the folks that I talk to in the Kubernetes space tell me similar things when I ask about complexity and in various areas of using Kubernetes, if you are familiar with that area or similar areas, then using it is not so complex and it's not such a difficult hurdle to get over in the onboarding. So it was exciting to hear that Kubernetes made sense as the baseline for them for this.
Moviraman
Yeah, we also discussed a little bit about like that part of the complexity of the problem fits in the world of Kubernetes very well versus if you're trying to fit a problem that is not immediately like well defined in the space of Kubernetes and you're trying to jam that in in the context of Kubernetes, you're going to have a hard time. You're going to feel like you are bringing in way more complexity with Kubernetes than just like doing it in a VM or something like cloud run or just running on container or some like PAAS solution. So there is that like hammer and nail problem sometimes in this space, but in this case building an ML platform, something very complex, a lot of multitenant user, you have like a different scale up, scale down requirement, you have requirement of different types of resources and managing all of that manually in the VM world it is going to be probably more work than it's worth, which is Kubernetes is really good at. So it is nice to see that for problems that match well in the dynamic of Kubernetes ML platform, building it on top of Kubernetes makes a lot of sense for Spotify in this case and many other folks that are trying to solve similar problems.
Kaslan Fields
And in building up this journey. I liked that Avin really seemed to have a strong concept of what he wanted this ML platform to look like. His background with ML platforms is very impressive and exciting because you don't see a lot of folks who were focused in this kind of area. I think, I think that's a pretty niche area of focus at this point, but it's growing so quickly. But he was so focused on the core tenets of what this platform needed to be. One thing he said that I wrote down and bolded was navigating this fast moving domain of AI while maintaining stability is the challenge. And that's something that really resonates with me in the Kubernetes open source world as well.
Moviraman
Yeah, like move fast and break things have been a motto in this space. But again, Kubernetes this year turned 10. It's in the double digits. Right. So for pretty much most of our lifetime, Kubernetes will be in the double digit. So it has reached its like double digit age and for the next 89 years is going to be on a double digit age. Hopefully Kubernetes stays that long.
Kaslan Fields
But the years start coming and they don't stop coming.
Moviraman
Yeah, but I think the point there is that Kubernetes is, as far as we're concerned, a fairly mature system over 10 years old. A lot of people rely on it. So Kubernetes has to take a lot of care in moving fast, but still not breaking things because Kubernetes has to continue supporting all these different types of workloads that are coming to it. But also we can't break anything that people rely on Kubernetes for. Right. So it's that like the thing you bolded in our notes here is that moving fast but also keeping the stability in the ML space. Right now they're going through that. Like we need to get the new version out as soon as possible. And sometimes it seems like proper testing and proper integration testing sometimes gets put on the wayside in name of speed and like a fast deployment of things. But over time I think the community will gather around and kind of go for more of a stable system that people can grow on top of in this space. Right now, speed is the name of the game. You want to get your new version of your software out as soon as possible. Like new version of your model, new version of your data, new version of your serving engine, training, what have you. But I think as you're talking about like from experimentation to production, that is the switch production is stable and speed is important, but not the most important. Like stability and correctness is probably more important. Just the raw speed of getting things out.
Kaslan Fields
I liked that when you went into the things that they want to do next. A lot of it was really in that space of how do we enable the speed that these engineers need? Because the AI space is moving super fast and it needs to right now. So we need to enable that speed, but we also need to have stability features. Like I really liked the conversation about debugging and making sure that you have solid error messages. So simple, yet so important.
Moviraman
Yeah, I think for me, another big highlight is towards the end when we were talking about like local dev experience.
Kaslan Fields
Yes.
Moviraman
So just setting up your local development environment for using both Kubernetes notebooks, all the Python libraries that exist, all the different versions of Python libraries that interact with each other. Also a lot of the Python libraries also fall down to like underlying GCC and C libraries of them. So you have those versions to care about as well. Some of them are OS dependent, so if you're like they mentioned in Spotify, they use mostly Macs, but most of those images gets built out for Linux environments. If you're using Apple Silicon, the container image that you end up building ends up being an ARM image that may or may not just work like you know, in a Kubernetes environment you can cross build using Docker buildx and build kits and whatnot. But these are not common knowledge. These are like deep container based knowledge that people in the infrastructure space like you and I probably have learned after many trial and error. But if you're trusting your ML engineers in that space now they're spending their precious, precious time in learning container skills which they could be using in building out like new models, new experimentation, notebooks and what have you. So it's a matter of like how do you take the toil away from your engineers and allow them to do what they're best at, building the product, building the model, running the experimentation instead of having to like everybody having to learn all the different skills. Right. So the other thing David also mentioned about like progressively giving them more information about how the thing works and there I think people fall on either side of that coin. Some people are strictly along the line of ML platform or a platform should be abstracted and people just only have access to either a CLI SDK or some sort of like DSL to talk to this platform. But it seems like in Spotify's case what is working for them right now is starting them off with SDK and cli. But if they want those knobs and access to those underlying hardware infrastructure, they have that option to fall down to like the Kubernetes and the Ray settings themselves, which is, I mean interesting and surprising, but also not that surprising at the same time. Like you can't really have 100% feature parity of every knobs in both Ray and Kubernetes in the same platform without like rebuilding all of them from scratch up again. Right.
Kaslan Fields
So that's one of the biggest challenges that we're always talking about with the folks building GKE is how much do you abstract away and how much do you let folks get to. Because you need to have. We got to serve both users. And so I really liked the way that David put it. He used specific words which I definitely wrote down here somewhere, but kind of progressive.
Moviraman
Progressive disclosure. That's the word.
Kaslan Fields
There you go. Thank you.
Moviraman
David is the word.
Kaslan Fields
I like that term, but yeah.
Moviraman
So all in all, I think this interview, as you mentioned, like we had a previous episode 235 on Ray and Kubrick kind of like the open source project itself. But now we're getting to see how Ray kind of fits in in a real world ML platform. So I think like that order whenever this episode comes out together, that it tells a very compelling story of understanding Ray and all the moving parts, but also kind of taking a step back, zooming out a little bit and taking a look at the broader picture and how REI fits in a larger ML platform piece.
Kaslan Fields
I think that's awesome and I hope that folks out there are able to relate to a lot of the scenarios that we talked about today.
Moviraman
Yeah.
Kaslan Fields
Thank you very much, Mophie.
Moviraman
Thanks, Kazal. That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media Kubernetes pod or reach us by email@kubernetes podcastoogle.com you can also check out the website@kubernetespodcast.com where you will find transcripts and show notes and links. To subscribe, please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.
Podcast Information:
In this episode of the Kubernetes Podcast from Google, hosts Kaslin Fields and Moviraman introduce their guests, David Shah and Evan Regmi from Spotify. David is a Senior Engineer on Spotify's ML Platform team, while Evan serves as an Engineering Manager leading the ML Training and Compute team for the Hendrix ML platform. The episode delves into Spotify’s machine learning infrastructure, focusing on their use of Kubernetes and Ray to support complex AI workloads.
Before diving into the main topic, the hosts share significant updates in the Kubernetes ecosystem:
IBM Acquires Kubecost
Hosted by Moviraman at [00:40]
IBM has acquired Kubecost, a startup specializing in Kubernetes cost management and optimization. Kubecost is widely used by companies like Allianz, Audi, Rakuten, and GitLab. The acquisition aims to integrate Kubecost with IBM’s existing acquisitions, Apptio and Turbonomic, to enhance cost and performance optimization without disrupting current services.
KubeCon Japan Announcement
Hosted by Kaslan Fields at [01:12]
For the first time, KubeCon will be held in Japan in 2025, organized by CNCF Native Community Japan. The event is expected to feature over 100 sessions and attract more than 1,000 attendees, although specific dates and locations are yet to be announced.
Artifact Hub Becomes a CNCF Incubating Project
Hosted by Kaslan Fields at [02:04]
Artifact Hub, a web application for discovering, installing, and publishing cloud-native packages and configurations, has joined the CNCF as an incubating project. It simplifies the discovery of artifacts like Helm charts, providing a centralized platform for users to find and publish cloud-native resources.
OpenMetrics Merged into Prometheus
Hosted by Moviraman at [02:30]
OpenMetrics has been archived and integrated into Prometheus. This merger signifies the consolidation of metrics standards under Prometheus’ umbrella, ensuring continuity and improvement in metric collection and usage.
Kubectl Enhancements
Hosted by Kaslan Fields at [02:52]
Kubectl, an open-source wrapper for Kubernetes commands, has been updated to version 0.4.0 by contributors like PruneDebastian Thomas and Lennartac. The new release adds colorful highlighting to outputs and improved paging functionality for lengthy outputs, enhancing user readability and experience.
David Shah
Introduced by Moviraman at [03:35]
David Shah is a Senior Engineer on Spotify's ML Platform team. He has played a pivotal role in building and operating Spotify’s centralized Ray platform, facilitating easy prototyping and scaling of machine learning workloads. His previous experience includes working on Spotify’s core infrastructure and deployment tooling.
Evan Regmi
Introduced by Moviraman at [04:01]
Evan Regmi is an Engineering Manager at Spotify, leading the ML Training and Compute team for the Hendrix ML platform. With expertise in training and serving ML models at scale, ML infrastructure, and team development, Evan previously led the ML platform team at Bell AI Labs and founded Panini AI, a cloud solution for low-latency ML model serving.
David Shah explains that Spotify's ML platform serves as an infrastructure layer, abstracting complexities of Kubernetes and providing seamless access to computational resources for internal ML practitioners.
"It's the infrastructure layer on which our users, most of them, all of them are internal. Other Spotify employees, like AI researchers, ML practitioners, those users actually use it to do the actual application."
(05:23)
Evan Regmi adds that the definition of an ML platform can vary based on organizational size and needs, emphasizing the platform’s adaptability.
"The need for the ML platform also changes from Org to Org depending on the business case and at what scale you're operating at."
(07:36)
The platform initially relied on Kubeflow and TensorFlow, but as technology advanced, Spotify expanded support to include Ray and PyTorch to accommodate diverse ML workloads, including generative AI and NLP applications.
David Shah notes the shift from each team building their own ML infrastructure to a centralized platform enhancing productivity.
"They had to roll their own from the model architecture all the way down to how do we get compute resources...then the ML platform team...started trying to build common infrastructure for all these use cases."
(08:51)
Evan Regmi reflects on the industry's transition from TensorFlow-centric approaches to incorporating Ray and PyTorch, driven by evolving modeling techniques.
"There were use cases, business driven use cases that kind of allowed us to more invest in that side as well."
(22:05)
Spotify chose Kubernetes (GKE) as the foundation for their ML platform due to existing expertise and the advantages Kubernetes offers in scalability and resource management.
David Shah explains the choice was driven by the ease of deploying Ray on Kubernetes compared to VM-based deployments.
"It was pretty simple that we were able to just get started a lot faster with Kubernetes...we didn't want to have to build more things ourselves on top of just plain VM."
(14:14)
Evan Regmi emphasizes that Kubernetes’ dynamic resource handling aligned well with the complex, multi-tenant requirements of their ML workloads.
"Kubernetes is really good at...multitenant user, different scale up, scale down requirements..."
(25:21)
Spotify's Hendrix ML platform offers a streamlined onboarding process through namespaces, Hendrix SDK, and Workbench, allowing users to start with default settings and progressively access more advanced configurations as needed.
Evan Regmi describes the onboarding steps:
"Users can create a namespace, use the Hendrix SDK to provision a Ray cluster, and start with notebooks or submit jobs via CLI."
(16:23)
David Shah reinforces the principle of progressive disclosure, ensuring users aren’t overwhelmed with Kubernetes complexities initially.
"We don't want or expect our users to know how to use Kubernetes in order to get access to lots of Hardware, accelerators or CPUs."
(21:10)
Managing a shared Kubernetes cluster with thousands of nodes requires effective resource scheduling to ensure fairness and efficiency.
David Shah discusses Spotify’s approach using Kubernetes namespaces and resource quotas to manage multi-tenancy:
"We use Kubernetes resource quotas...subject to approval by our team to check that they're requesting like a sane amount."
(27:32)
Evan Regmi mentions the use of multiple clusters for different workload sizes and the vision to abstract cluster management from end-users:
"Our vision is that as a user, I don't have to worry about context switching between clusters."
(29:48)
Building an ML platform involves balancing rapid technological advancements with the necessity for stable, user-friendly infrastructure.
Evan Regmi highlights the challenge of maintaining platform stability amidst the fast-evolving ML landscape:
"Navigating the ML domain that's constantly coming with new tools and technologies, while providing stability to not break user code."
(30:30)
David Shah adds the importance of actionable error messages, progressive disclosure, managing tech debt, and catering to diverse user expertise levels:
"Actionable error messages are very important...progressive disclosure...keeping track of your tech debt is really important."
(34:03)
The guests discuss potential enhancements to the Hendrix platform, focusing on improving user experience, flexibility, and performance.
David Shah outlines areas for improvement:
Evan Regmi mentions plans to streamline transitions from experimentation to production and abstract infrastructure complexities further:
"As a platform team, we aim to abstract away the necessary infrastructure and let users focus on their models."
(40:30)
Spotify addresses the challenges ML engineers face in setting up local development environments by integrating Workbench, a cloud-based IDE that simplifies environment setup and provides seamless access to Ray capabilities.
David Shah explains the partnership with the cloud developer experience team to introduce Workbench:
"You can just click something and it would open up something in your browser and you could just start coding."
(42:18)
The episode concludes with reflections on Spotify’s journey in building a robust ML platform on Kubernetes and Ray. The hosts and guests emphasize the importance of progressive disclosure, user-centric design, and balancing innovation with stability. Spotify’s experience serves as a valuable case study for organizations aiming to develop scalable, efficient, and user-friendly ML platforms.
Key takeaways include the strategic use of Kubernetes for managing complex workloads, the integration of Ray for scalable ML tasks, and the continuous adaptation to evolving ML technologies while maintaining a stable platform for users.
David Shah at [05:23]:
"It's all built on top of Kubernetes. So we don't want or expect our users to know how to use Kubernetes in order to get access to lots of Hardware, accelerators or CPUs or certain nitty gritty details of in this case Ray."
Evan Regmi at [07:36]:
"The need for the ML platform also changes from Org to Org depending on the business case and at what scale you're operating at."
Kaslan Fields at [15:14]:
"Kubernetes is a great solution, but it's not a perfect solution for everything."
David Shah at [21:10]:
"We use Kubernetes namespaces. Each team starts off with a namespace... It is a combination of both human and technology features."
Evan Regmi at [30:30]:
"From an end user's perspective, how do we abstract away all the necessary infrastructure and just kind of have them focus on their model."
David Shah at [34:03]:
"Actionable error messages are very important...progressive disclosure...keeping track of your tech debt is really important."
Evan Regmi at [37:54]:
"We're hoping that the work with DWS not only will make it more cost efficient...so that our users aren't blocked."
Centralized vs. Decentralized ML Platforms: Spotify’s transition from individual teams building their own ML infrastructure to a centralized platform significantly enhanced productivity and consistency across ML projects.
Kubernetes as a Foundation: Leveraging Kubernetes allowed Spotify to manage complex, multi-tenant ML workloads effectively, benefiting from its scalability and resource management capabilities.
Ray for Scalable ML: Integrating Ray enabled Spotify to handle both inference and batch workloads efficiently, supporting diverse ML use cases like generative AI and NLP.
User-Centric Design: Emphasizing progressive disclosure ensures that users can start with simple interfaces and gradually access more advanced features as needed, accommodating varying expertise levels.
Balancing Innovation and Stability: Maintaining platform stability amidst rapid advancements in ML technologies is crucial. Spotify achieves this by carefully managing tech debt and providing actionable error messages.
Future Enhancements: Spotify aims to further abstract infrastructure complexities, enhance debugging capabilities, and optimize software artifacts to improve user experience and platform performance.
This comprehensive discussion offers valuable insights into building and evolving ML platforms using Kubernetes and Ray, highlighting Spotify’s strategies and lessons learned. It serves as an informative guide for organizations embarking on similar journeys in the AI and machine learning landscape.