Spotify AI Platform, with Avin Regmi and David Xia

Kubernetes Podcast from Google

•

Published: Tue Sep 24 2024

Guests are and from Spotify. We spoke to Avin and David about their work building Spotify’s Machine Learning Platform, Hendrix. They also specifically talk about how they use Ray to enable inference and batch workloads. Ray was featured on episode...

Summary

Podcast Summary: Spotify AI Platform with Avin Regmi and David Shah

Podcast Information:

Title: Kubernetes Podcast from Google
Hosts: Abdel Sghiouar, Kaslin Fields
Episode: Spotify AI Platform, with Avin Regmi and David Shah
Release Date: September 24, 2024

Introduction

In this episode of the Kubernetes Podcast from Google, hosts Kaslin Fields and Moviraman introduce their guests, David Shah and Evan Regmi from Spotify. David is a Senior Engineer on Spotify's ML Platform team, while Evan serves as an Engineering Manager leading the ML Training and Compute team for the Hendrix ML platform. The episode delves into Spotify’s machine learning infrastructure, focusing on their use of Kubernetes and Ray to support complex AI workloads.

News Highlights

Before diving into the main topic, the hosts share significant updates in the Kubernetes ecosystem:

IBM Acquires Kubecost
Hosted by Moviraman at [00:40]
IBM has acquired Kubecost, a startup specializing in Kubernetes cost management and optimization. Kubecost is widely used by companies like Allianz, Audi, Rakuten, and GitLab. The acquisition aims to integrate Kubecost with IBM’s existing acquisitions, Apptio and Turbonomic, to enhance cost and performance optimization without disrupting current services.
KubeCon Japan Announcement
Hosted by Kaslan Fields at [01:12]
For the first time, KubeCon will be held in Japan in 2025, organized by CNCF Native Community Japan. The event is expected to feature over 100 sessions and attract more than 1,000 attendees, although specific dates and locations are yet to be announced.
Artifact Hub Becomes a CNCF Incubating Project
Hosted by Kaslan Fields at [02:04]
Artifact Hub, a web application for discovering, installing, and publishing cloud-native packages and configurations, has joined the CNCF as an incubating project. It simplifies the discovery of artifacts like Helm charts, providing a centralized platform for users to find and publish cloud-native resources.
OpenMetrics Merged into Prometheus
Hosted by Moviraman at [02:30]
OpenMetrics has been archived and integrated into Prometheus. This merger signifies the consolidation of metrics standards under Prometheus’ umbrella, ensuring continuity and improvement in metric collection and usage.
Kubectl Enhancements
Hosted by Kaslan Fields at [02:52]
Kubectl, an open-source wrapper for Kubernetes commands, has been updated to version 0.4.0 by contributors like PruneDebastian Thomas and Lennartac. The new release adds colorful highlighting to outputs and improved paging functionality for lengthy outputs, enhancing user readability and experience.

Guest Introductions

David Shah
Introduced by Moviraman at [03:35]
David Shah is a Senior Engineer on Spotify's ML Platform team. He has played a pivotal role in building and operating Spotify’s centralized Ray platform, facilitating easy prototyping and scaling of machine learning workloads. His previous experience includes working on Spotify’s core infrastructure and deployment tooling.

Evan Regmi
Introduced by Moviraman at [04:01]
Evan Regmi is an Engineering Manager at Spotify, leading the ML Training and Compute team for the Hendrix ML platform. With expertise in training and serving ML models at scale, ML infrastructure, and team development, Evan previously led the ML platform team at Bell AI Labs and founded Panini AI, a cloud solution for low-latency ML model serving.

Main Discussion

1. Spotify’s ML Platform Overview

David Shah explains that Spotify's ML platform serves as an infrastructure layer, abstracting complexities of Kubernetes and providing seamless access to computational resources for internal ML practitioners.
"It's the infrastructure layer on which our users, most of them, all of them are internal. Other Spotify employees, like AI researchers, ML practitioners, those users actually use it to do the actual application."
(05:23)

Evan Regmi adds that the definition of an ML platform can vary based on organizational size and needs, emphasizing the platform’s adaptability.
"The need for the ML platform also changes from Org to Org depending on the business case and at what scale you're operating at."
(07:36)

2. Evolution of the ML Platform

The platform initially relied on Kubeflow and TensorFlow, but as technology advanced, Spotify expanded support to include Ray and PyTorch to accommodate diverse ML workloads, including generative AI and NLP applications.

David Shah notes the shift from each team building their own ML infrastructure to a centralized platform enhancing productivity.
"They had to roll their own from the model architecture all the way down to how do we get compute resources...then the ML platform team...started trying to build common infrastructure for all these use cases."
(08:51)

Evan Regmi reflects on the industry's transition from TensorFlow-centric approaches to incorporating Ray and PyTorch, driven by evolving modeling techniques.
"There were use cases, business driven use cases that kind of allowed us to more invest in that side as well."
(22:05)

3. Decision to Use Kubernetes and Ray

Spotify chose Kubernetes (GKE) as the foundation for their ML platform due to existing expertise and the advantages Kubernetes offers in scalability and resource management.

David Shah explains the choice was driven by the ease of deploying Ray on Kubernetes compared to VM-based deployments.
"It was pretty simple that we were able to just get started a lot faster with Kubernetes...we didn't want to have to build more things ourselves on top of just plain VM."
(14:14)

Evan Regmi emphasizes that Kubernetes’ dynamic resource handling aligned well with the complex, multi-tenant requirements of their ML workloads.
"Kubernetes is really good at...multitenant user, different scale up, scale down requirements..."
(25:21)

4. Onboarding Journey for New Users

Spotify's Hendrix ML platform offers a streamlined onboarding process through namespaces, Hendrix SDK, and Workbench, allowing users to start with default settings and progressively access more advanced configurations as needed.

Evan Regmi describes the onboarding steps:
"Users can create a namespace, use the Hendrix SDK to provision a Ray cluster, and start with notebooks or submit jobs via CLI."
(16:23)

David Shah reinforces the principle of progressive disclosure, ensuring users aren’t overwhelmed with Kubernetes complexities initially.
"We don't want or expect our users to know how to use Kubernetes in order to get access to lots of Hardware, accelerators or CPUs."
(21:10)

5. Resource Scheduling and Multi-tenancy

Managing a shared Kubernetes cluster with thousands of nodes requires effective resource scheduling to ensure fairness and efficiency.

David Shah discusses Spotify’s approach using Kubernetes namespaces and resource quotas to manage multi-tenancy:
"We use Kubernetes resource quotas...subject to approval by our team to check that they're requesting like a sane amount."
(27:32)

Evan Regmi mentions the use of multiple clusters for different workload sizes and the vision to abstract cluster management from end-users:
"Our vision is that as a user, I don't have to worry about context switching between clusters."
(29:48)

6. Challenges in Building ML Platforms

Building an ML platform involves balancing rapid technological advancements with the necessity for stable, user-friendly infrastructure.

Evan Regmi highlights the challenge of maintaining platform stability amidst the fast-evolving ML landscape:
"Navigating the ML domain that's constantly coming with new tools and technologies, while providing stability to not break user code."
(30:30)

David Shah adds the importance of actionable error messages, progressive disclosure, managing tech debt, and catering to diverse user expertise levels:
"Actionable error messages are very important...progressive disclosure...keeping track of your tech debt is really important."
(34:03)

7. Future Wishlist and Improvements

The guests discuss potential enhancements to the Hendrix platform, focusing on improving user experience, flexibility, and performance.

David Shah outlines areas for improvement:

Integrated Debugging: Enhancing the debugging process for production workloads.
Framework Agnosticism: Making the SDK more flexible to support various ML frameworks beyond PyTorch.
Optimized Software Artifacts: Reducing image sizes and improving startup times.
"Everything should just be faster...the virtual Environment is like many, many gigs big."
(37:54)

Evan Regmi mentions plans to streamline transitions from experimentation to production and abstract infrastructure complexities further:
"As a platform team, we aim to abstract away the necessary infrastructure and let users focus on their models."
(40:30)

8. Enhancing Local Development Experience

Spotify addresses the challenges ML engineers face in setting up local development environments by integrating Workbench, a cloud-based IDE that simplifies environment setup and provides seamless access to Ray capabilities.

David Shah explains the partnership with the cloud developer experience team to introduce Workbench:
"You can just click something and it would open up something in your browser and you could just start coding."
(42:18)

Conclusions

The episode concludes with reflections on Spotify’s journey in building a robust ML platform on Kubernetes and Ray. The hosts and guests emphasize the importance of progressive disclosure, user-centric design, and balancing innovation with stability. Spotify’s experience serves as a valuable case study for organizations aiming to develop scalable, efficient, and user-friendly ML platforms.

Key takeaways include the strategic use of Kubernetes for managing complex workloads, the integration of Ray for scalable ML tasks, and the continuous adaptation to evolving ML technologies while maintaining a stable platform for users.

Notable Quotes

David Shah at [05:23]:
"It's all built on top of Kubernetes. So we don't want or expect our users to know how to use Kubernetes in order to get access to lots of Hardware, accelerators or CPUs or certain nitty gritty details of in this case Ray."
Evan Regmi at [07:36]:
"The need for the ML platform also changes from Org to Org depending on the business case and at what scale you're operating at."
Kaslan Fields at [15:14]:
"Kubernetes is a great solution, but it's not a perfect solution for everything."
David Shah at [21:10]:
"We use Kubernetes namespaces. Each team starts off with a namespace... It is a combination of both human and technology features."
Evan Regmi at [30:30]:
"From an end user's perspective, how do we abstract away all the necessary infrastructure and just kind of have them focus on their model."
David Shah at [34:03]:
"Actionable error messages are very important...progressive disclosure...keeping track of your tech debt is really important."
Evan Regmi at [37:54]:
"We're hoping that the work with DWS not only will make it more cost efficient...so that our users aren't blocked."

Key Insights and Conclusions

Centralized vs. Decentralized ML Platforms: Spotify’s transition from individual teams building their own ML infrastructure to a centralized platform significantly enhanced productivity and consistency across ML projects.
Kubernetes as a Foundation: Leveraging Kubernetes allowed Spotify to manage complex, multi-tenant ML workloads effectively, benefiting from its scalability and resource management capabilities.
Ray for Scalable ML: Integrating Ray enabled Spotify to handle both inference and batch workloads efficiently, supporting diverse ML use cases like generative AI and NLP.
User-Centric Design: Emphasizing progressive disclosure ensures that users can start with simple interfaces and gradually access more advanced features as needed, accommodating varying expertise levels.
Balancing Innovation and Stability: Maintaining platform stability amidst rapid advancements in ML technologies is crucial. Spotify achieves this by carefully managing tech debt and providing actionable error messages.
Future Enhancements: Spotify aims to further abstract infrastructure complexities, enhance debugging capabilities, and optimize software artifacts to improve user experience and platform performance.

This comprehensive discussion offers valuable insights into building and evolving ML platforms using Kubernetes and Ray, highlighting Spotify’s strategies and lessons learned. It serves as an informative guide for organizations embarking on similar journeys in the AI and machine learning landscape.