Summary6 min read

NVIDIA AI Podcast Ep. 298: Snap’s Secret to Processing 10 Petabytes a Day – GPU-Accelerated Spark

Date: May 13, 2026
Host: Noah Kravitz
Guest: Prudvi Vatala (“Pru”), Head of Engineering Platforms at Snap

Episode Overview

This episode dives deep into how Snap, the parent company of Snapchat, processes over 10 petabytes of data daily. Pru shares how Snap leveraged NVIDIA’s GPU-accelerated Spark (Spark Rapids on Dataproc and GKE) to dramatically speed up data processing, increase cost-effectiveness, and drive innovation for nearly a billion monthly active users. The conversation highlights data engineering at scale, the evolution of experimentation, and the transformative impact of partnerships with NVIDIA and Google Cloud.

Key Discussion Points & Insights

Snap’s Scale and Data Infrastructure

Snap’s Global Impact:
- 940 million+ monthly active users; close to a billion.
- Snap’s philosophy centers on the camera as a driver for richer digital communication, bridging AI, AR, and visual experiences.
- (01:06) Pru: "Snap is at the intersection of augmented reality, AI, and visual communication...serving close to a billion monthly active users."
Role at Snap:
- Pru heads an integrated team spanning big data infrastructure, developer productivity, and enterprise AI.

Experimentation Platform and A/B Testing

Experimentation at Scale:
- Snap’s experimentation platform alone processes 10+ petabytes of data per day.
- Rapid turnaround is vital—results must be available each morning for product and data teams to act on.
  - (02:44) Pru: “For us, accelerating data processing basically means instead of throwing more and more CPUs at the problem, figuring out a way to flatten that scale curve...leveraging GPUs for improving our workloads, making sure they run faster, cheaper, and scale linearly or sublinearly.”
Purpose of A/B Testing:
- Ensures new features are statistically validated across diverse global audiences.
- Focus on heterogeneity in user response, variance reduction, and bias mitigation to ensure reliable results.
  - (05:09) Pru: “Over the years, my team has added a bunch of statistical methods to our platform. Heterogeneous treatment effects detection, for example… That’s the variance reduction aspect. So that’s something my team built over time.”

Journey to GPU Acceleration (Spark Rapids on NVIDIA)

Discovery and Motivation:
- Snap’s infrastructure—entirely on Google Cloud—needed to scale cost-effectively.
- Discovery of NVIDIA Spark Rapids via a blog post brought the promise of 3.6x faster PySpark workloads.
  - (07:25) Pru: “We saw NVIDIA is shipping this solution to speed up PySpark workloads by anywhere from 3.6x performance versus 50% runtime… it was phenomenal on paper. So that’s what drew us to it.”
Benchmarking Results:
- 3x+ faster join-intensive jobs; 2x for union jobs; 1.5x for aggregations.
- GPUs’ parallelism and hardware memory bandwidth were key.

Moving to Production at Scale

Infrastructure Migration:
- Zero code changes needed—jobs migrated seamlessly, a boon for developer productivity.
  - (10:56) Pru: “We didn’t have to change a single thing about how we ran the jobs… zero code changes.”
Utilizing Idle GPU Capacity:
- Snap’s global cyclical usage means GPUs for online inference are idle overnight in some regions.
- Built a system to batch process experimental jobs during these downtimes, maximizing ROI on existing GPU resources.
  - (12:59) Pru: “When some of our biggest markets went to bed, a lot of our online inference GPU capacity was sitting idle… so that was our opening, our opportunity to go tackle.”
Operational Complexity and Solutions:
- Built a new data platform to share GPU resources dynamically across teams and ensure real-time workloads could preempt batch jobs as needed.
- Developed robust fallback mechanisms: automatic fallback from GPUs to CPUs, and then to Dataproc clusters.

Production Outcomes & Impact

Massive Efficiency Gains:
- 76% reduction in job costs.
- 62% reduction in core usage.
- 80% reduction in memory footprint.
- Eliminated 120 terabytes of disk and memory spill in Spark jobs.
  - (18:20) Pru: “The memory footprint, we could drop it by like, 80%. I mean for the spark nerds out there, we were able to cut out almost 120 terabytes of disk spill, disk and memory spill from our pipelines.”
Partnering for Success:
- Tight coordination with Google Cloud and NVIDIA enabled full migration from prototype to production (10+ petabytes per day) within 8–9 months.
  - (18:34) Pru: “Without the partnership, this would not have been possible in the timescale that it was possible, like migrating a production pipeline with 10 plus petabytes... in a matter of about eight to nine months is phenomenal.”

Technical & Organizational Takeaways

Zero-code-change upgrades are possible with the right technology (NVIDIA Spark Rapids).
Dynamic resource sharing and fallback strategies are essential for efficient large-scale operations.
Partnerships accelerate innovation and reliability in demanding production environments.
NVIDIA tools like Spark Rapids and Nvidia Ether (for Spark tuning) were pivotal.
- (16:14) Pru: “...Nvidia Spark Rapids, like I said, zero code written, zero code changed... The other thing that Nvidia offered that really helped us a lot was Nvidia Ether. It’s another solution that gives us Spark tuning out of the box.”

The Human Side: Life at Snap and Broader Reflections

Culture of innovation:
- Snap continues to push boundaries (e.g., AR Stories, Spectacles).
- Rapid tech evolution has demanded resilience and creative problem-solving.
  - (20:58) Pru: “It’s been unbelievable of an experience… In the visual communication AR landscape, Snap has had a massive impact on the planet, honestly. And having a direct role to play in it is a great feeling.”
Encouragement for Listeners:
- For those interested, Snap has an active engineering blog and regularly shares technical innovations with the community.

Notable Quotes & Memorable Moments

Massive Impact of GPU Acceleration:
- [00:00] Pru: “We were able to cut almost about 76% of our job costs as a result of this migration. 76. It’s phenomenal... we were able to cut down the number of cores required by like, 62%. The memory footprint, we could drop it by like, 80%... The results speak for themselves.”
Importance of Statistical Rigor in Product Development:
- [05:09] Pru: “Heterogeneous treatment effects detection...figuring out those heterogeneous effects is one thing that we focus on. And at this scale, no matter how you slice your experiments, you’re still allowing some bias to seep in.”
On Developer Experience and Productivity:
- [10:56] Pru: “We didn’t have to change a single thing about how we ran the jobs. That was the beauty of it. Not at all changes.”
Collaboration as a Game-Changer:
- [18:34] Pru: “Without the partnership, this would not have been possible in the timescale that it was possible... this wouldn’t have been possible.”

Timestamps for Key Segments

[01:06] – What Snap is today & Pru’s role
[02:20] – Accelerating data at 10+ petabytes/day
[04:49] – The role and mechanics of A/B testing at scale
[07:25] – Why Snap looked at NVIDIA Spark Rapids
[08:41] – Benchmarking results of GPU acceleration
[10:36] – Migration to Kubernetes (GKE) & zero code changes
[12:59] – Leveraging idle GPU capacity for experimentation
[15:47] – Operational learnings and advice for others
[16:14] – The role of NVIDIA Ether and tuning
[17:47] – Impact of Google-NVIDIA collaboration on Snap’s AI/data roadmap
[18:20] – Efficiency and resource reduction quantified
[20:58] – Pru’s personal journey at Snap and feelings about innovation

Further Resources

Snap Engineering Blog: For technical deep dives and company research.
NVIDIA AI Podcast Archive: ai-podcast.nvidia.com
NVIDIA Spark Rapids: GPU acceleration for Apache Spark.

Takeaway

Snap’s story exemplifies how technological creativity, powerful partnerships, and the right tools can fundamentally transform operations at internet scale—delivering faster, cheaper, and smarter outcomes for billions of daily interactions.

Loading summary

Transcript45 lines

[00:00]
A
We were able to cut almost about 76% of our job costs as a result of this migration. 76. 76. It's phenomenal. I mean, for the engineers out there, we were able to cut down the number of cores required by like, 62%. The memory footprint, we could drop it by like, 80%. So phenomenal results. The results speak for themselves.
[00:26]
B
Welcome to the Nvidia AI Podcast. Hi, I'm Noah Kravitz. I'm here with Prudvi Vatala. Pru is the head of engineering platforms at Snap, and we're here to talk about data processing and in particular, how a social platform with more than 940 million active users accelerated their data pipeline. Pru, welcome to the Nvidia AI Podcast. Thanks so much for taking the time to join us.
[00:49]
A
Yeah, thanks for having me here, Noah.
[00:51]
B
So maybe we can start with the basics. Tell us a little bit about. Well, about what Snap is now. I'm old, but I still think of it, you know, the Snap glasses and everything. But Snapchat, obviously, a huge social platform. So maybe tell us a little bit about Snap and then your role there.
[01:07]
A
Absolutely, yeah. I mean, Snapchat at this point is pretty much a household name. You know, it's Snap as a company. It's interesting that you bring up the spectacles because Snap as a company believes that camera is at the center of, you know, improving how people communicate and improve their lives, you know, in the digital world, so to speak. So we've been steadfast on that belief. And Snap right now is at the intersection of augmented reality, AI and visual communication, like you said, serving close to a billion monthly active users. I've been at Snap for a while now, and I lead a multifaceted organization. We do. A little bit of it has to do with big data infrastructure, a little bit of it with developer productivity, and a little bit of it with enterprise AI and whatnot. So, yeah.
[02:06]
B
And so when we talk about accelerating data processing, what does that mean to you? What does that mean for Snap? And thinking about the scale that you operate on, just talk a little bit about what it means to accelerate data at that level.
[02:20]
A
Absolutely. That's a great question. As you can imagine, with as many users as we have, and Snapchat in particular is a very complex application, so you can imagine the scale at which we operate, especially on the data processing side we are dealing with. My team's experimentation platform is dealing with 10 plus petabytes each day. It's a massive scale.
[02:44]
B
It's a huge scale.
[02:45]
A
Then we Have a strict SLA in the morning because experimentation results need to be ready for developers, product managers, data scientists to act on as early as possible so that they can take appropriate action. For us, accelerating data processing basically means instead of throwing more and more CPUs at the problem, figuring out a way to flatten that scale curve. So in this particular scenario, it was about figuring out how to leverage GPUs for improving our workloads, making sure they run faster, cheaper, and scale linearly or sublinearly. Unlike right now, it's definitely super linear with feature areas. That's what accelerating.
[03:34]
B
You mentioned experimentations. What does that mean when you're conducting experiments at snap? What does that look like? And then maybe how does that fit into. Is that where the 10 petabytes of data each morning comes from? Or we can talk about that.
[03:48]
A
Yeah, absolutely. So this 10 petabyte data is only about the experimentation platform. The big data across SNAP is far wider. So experimentation, it's a little bit about snap's product philosophy. We believe that experimentation, safety and privacy are core pillars for our product development and iteration. Like when, when we, when we are thinking about new product areas, when we are shipping new product features to our, you know, half a billion daily active users across the globe, we need to think about how the users are receiving it, how they're responding to it, how they're using it, whether or not this is adding value to their, you know, daily lives. And also guard railing things like, is it regressing their performance, you know, is it causing their devices to slow down? Or, you know, we need to be very particular about protecting their experiences as well.
[04:50]
B
And so Pru, along those lines, with the experimentation, can you talk a little bit about the importance of a B testing?
[04:58]
A
So a B testing is, you know, the concept of randomized control trials has been around for a long time, you know, especially in the clinical fields and whatnot. But with the digital revolution, it has become the mode of bringing statistical rigor to decision making at scale. Right. So that's what a B testing adds to us. Like, you know, when we are dealing with this massive user base that is diverse by nature, you know, from all walks of life across the globe, you know, and we are trying to delight them, we are trying to bring experiences to them. We need to make sure what we are delivering is buttoned down. It's actually really adding value the way we think it is. At this scale, a lot of things can happen. That's where having the statistical rigor grounded in holdouts and well defined controls and statistical methods comes in. Over the years, my team has added a bunch of statistical methods to our platform. Heterogeneous treatment effects detection, for example. You may think that feature is performing well for the global audience, but it may not perform so well for a subset. So figuring out those heterogeneous effects is one thing that we focus on. And at this scale, no matter how you slice your experiments, you're still allowing some bias to seep in. As in some power users may end up on one side of the experiment than the other. So how do we make sure the distributions are evened out when the experiment results are read? That's the variance reduction aspect. So that's something my team built over time. Then sometimes when we ship a feature, if people don't like it, they, they might even just stop showing up, you know? Right, right. That's the sample size mismatch problem. So we also do a bunch of that rigorously. So that's what AB testing brings to the table.
[07:11]
B
So with all of the data processing every day, what made you think that maybe some Nvidia tech put into the stack might help things out? How did that process start? And maybe you can talk about, you know, what you've integrated and what you're using.
[07:26]
A
Absolutely. So I'm really proud of this. I'm really proud of my team because over the years that I've been seeing our platform, the number of users grew like Snap ballooned in terms of footprint, the number of features we shipped like Spotlight, AR features, AR Lenses, and all of the AI features we shipped in the recent past. So they've also been adding a lot of additional dimensions to the platform. And my team was hard at work making sure we are not, you know, we are scaling appropriately even as all of this scale grows. And they have done a very good job of it historically for years now, maintaining the cost flat and, you know, performance predictable, meeting the SLAs and whatnot. And one thing we came across, you know, we came across third party Nvidia Spark Rapids on one of the blog posts and we saw Nvidia is shipping this solution to speed up our Pyspark workloads by anywhere from 3.6x performance versus 50% runtime. It was phenomenal on paper. So that's what drew us to.
[08:42]
B
I'm waiting to hear the numbers sound good. I'm waiting to hear the rest.
[08:45]
A
Yeah, so we read those and we got super excited and then our stack was, it still is entirely Google Cloud for experimentation platform. We loved working with them. The Google Cloud data Proc was phenomenal. They've been a fantastic partner to us throughout the scaling journey. Good to hear. Yeah. And then when this news came out with Spark Rapids, we wanted to try it out. We did a bunch of benchmarking. We tried, obviously, like I said, we do a lot of things, so there is a lot of complexity to the nature of the jobs we run. So we had to benchmark each kind of job as well. Like taking jobs that are heavy with joins and repartitions and shuffling of data that moves data around versus jobs that are purely unioning data from various places versus jobs that are purely aggregating, like running sums and whatnot. So we had to benchmark across all of them. And we noticed that even on Google dataproc with Spark Rapids, we got about, I want to say, 3x plus improvement for the join jobs and about close to 2x for the union jobs and a little over 1.5x for aggregations. That's largely because CPUs are already good at aggregation. And then the other thing is, GPUs by nature support parallelism and high bandwidth memory on the hardware itself. So that made it a very good candidate for us to pursue.
[10:31]
B
And so you're running your GPU accelerated pipelines on Google Kubernetes, is that right?
[10:37]
A
Yes, yes. That has been a very interesting journey from testing out our pipelines with Dataproc GPUs and to today, one other thing, like with Spark Rapids, I want to mention it, we didn't have to change a single thing about how we ran the jobs. That was the beauty of it. Not at all changes.
[10:57]
B
Oh, it's amazing.
[10:57]
A
Zero code changes. So I'm into developer productivity and developer enablement. So for me that was music to
[11:04]
B
my ears, of course.
[11:06]
A
So that was very impressive. So with dataproc, which abstracts out the Spark runtime for us, and Spark Rapids, which didn't require us to change the jobs, it was phenomenal.
[11:16]
B
Yeah, amazing.
[11:16]
A
So it went very well. So we wanted to productionize this. We were able to. At our scale, pipelines aren't just monolithic. We do a bunch of sharding and then batching of work. So we were able to migrate one shard to production on Google Dataproc using 300 GPUs. The results were phenomenal. And then in the next phase, we wanted to migrate 10 shards for total 50 plus shard architecture. And then it needed about 3000 GPUs, which was still doable with Dataproc on demand GPUs because GPU capacity is on everybody's mind these days, right. So, so that was well and good, but then we didn't have a path forward after that. Right. So we kind of hit a roadblock with on demand GPU capacity. So we had to get creative. So we started looking around, we were like, where at Snap do we have GPU capacity that we can borrow? Right. And you know, that's where the real insight came for us. Like Snap has a global audience and the Snapchatist behavior is cyclical. During the day, people wake up, they use Snapchat. When they go to bed, they don't. So what that meant was when some of our biggest markets went to bed, a lot of our online inference GPU capacity was sitting idle somewhere between 1am and 5am so that was our opening, our opportunity to go tackle. And that brought about its own set of complexity. Because online serving stack is not built for batch data process, they were considered fundamentally different words. So all the online GPUs were tied to Kubernetes and GKE and we were already on Google Cloud. So GKE wasn't an issue for us at all. It was actually very welcome. So we had to migrate our workloads to Kubernetes based Spark runtime and host it on GKE so that we can leverage what the online GPUs had to offer. And for that we had to actually build a data platform. Ground up. Okay, because it's one thing for my team to just use this idle capacity, but at Snap, we wanted to make sure even as the online need for GPUs increased, as our AI footprint increased, we should still have any team at Snap be able to leverage that capacity for any of their needs as available. And then we had to also acknowledge that if a user wanted to see fresh Spotlight content, it supersedes GPU need for experimentation. Preemption had to be built in, so if we had a sudden spike in traffic, we had to give up GPU capacity. So with all of that in mind, we built out a platform, ground up. Okay. And then we started migrating and that's the. And we had a lot of blockers along the way and the team got really creative. It was a phenomenal journey.
[14:23]
B
Amazing. Yeah, yeah, yeah. And so you're also running an accelerated Apache Spark pipeline.
[14:30]
A
Yes, yes. So a lot of our pipelines at a high level, our pipelines are split into daily and hourly cadence. So hourly is mostly for guardrailing. Like I said, we don't want to break users experience no matter what. And having that hourly feedback cycle goes A long way in doing that. And then we also have daily pipelines which serve as the statistical authority for decision making. So our first migration to GKE plus Nvidia Spark Rapids was the hourly pipeline, because speed mattered there far more. So we migrated and then we migrated and operationalized it. And during that process, we ran into a few corner cases. If the GPU capacity wasn't available at 11am when everybody was active on SNAP, what do we do? So we had to figure out how to gracefully fall back from GPUs to CPUs. Right. And then if the shared GKE resources itself was the constraint, then we had to gracefully fall back from CPUs to data proclusters. So building all of that with operational reliability in mind was also great.
[15:47]
B
Yeah. Looking back on it, what learnings would you, you know, if there's a listener out there who's embarking on a similar project or trying to figure out, maybe there's a, you know, like you said, kind of an, a daily cycle of when the GPUs are in use for inference and when they're not, they're thinking about, you know, borrowing GPUs from other parts of the company. Learnings you would share from this whole process. Is there a big takeaway, something that surprised you?
[16:14]
A
Right, Right. So the direction that Nvidia is headed in is phenomenal for, for these kinds of needs. Nvidia Spark Rapids, like I said, zero code written, zero code changed. To enable it, we had to figure out the image building and environment difference and whatnot, the testing cycles. Obviously any production workload needs to go through the triggerous rollout process. So everybody needs to pay attention to it. But this is a real possibility, the Nvidia direction. The other thing that Nvidia offered that really helped us a lot was Nvidia Ether. It's another solution that gives us Spark tuning out of the box, because especially when we had this fallback mechanism in place where we had to go from GPUs to CPUs to Dataproc, the environments are different, the Spark parameters had to be different. So something like Nvidia Ethereum giving us a starting point and making sure the tuning stayed consistent across all of these versions was also very helpful.
[17:25]
B
So you've mentioned obviously, the work with Nvidia and Google Cloud as well, kind of from taking a step back, sort of bigger picture, what are these partnerships and working hand in hand so closely with Google Cloud, with Nvidia, what is that doing to the way that you and Snap see your roadmaps for both data and AI. Kind of growing going forward.
[17:48]
A
Yeah. I mean, huge props to the Nvidia team and the Google Cloud team. Honestly, it's been a phenomenal three way partnership like I've never seen in my career before. It was phenomenal and the impact speaks for itself. We were able to cut almost about 76% of our job costs as a result of this migration. 76, 76. It's phenomenal. I mean for the engineers out there, like we were able to cut down the number of cores required by like 62%.
[18:20]
B
Amazing.
[18:20]
A
The memory footprint, we could drop it by like 80%. I mean for the spark nerds out there, we were able to cut out almost 120 terabytes of disk spill, disk and memory spill from our pipelines.
[18:35]
B
Wow.
[18:35]
A
Just vanished once we started doing all of this. So that is one of the biggest headaches any data pipeline at scale runs into. So phenomenal results. The results speak for themselves. So without the partnership, this would not have been possible in the timescale that it was possible, like migrating a production pipeline with 10 plus petabytes from, you know, prototyping exploration to full production in a matter of about eight to nine months is phenomenal. Right. And without the continuous back and forth and knowledge sharing and partnership across these three companies, this wouldn't have been possible.
[19:19]
B
That's great.
[19:20]
A
Yeah. And in terms of the roadmap, it definitely had an impact. Like I said, my team built this bottom up data platform to enable any team at Snap to leverage the GPU capacity and what Nvidia libraries have to offer. And we are already seeing movement with it. Even my own team started migrating other things that we haven't even tried out so far. Experimenting with them, trying out. Because even if we don't have idle capacity to fit all of our workloads all the time, if we can schedule things creatively, if we can move things around, we can maximize the capacity as much as we can. And a lot of other teams are also picking this up.
[20:02]
B
Yeah, it's fantastic. So you've been at SNAP for eight years, is that right? Seven?
[20:08]
A
Close to yes, close to eight. Okay.
[20:09]
B
And SNAP's been around for about 15 years, give or take, working at a social media, a huge social media platform over this span of time, where social media has just become such a, such a core part of the fabric of so many people's lives. What's it been like to be at Snap and to see the changes both. I said at the beginning, I remember the spectacles. That's my first Thought of Snap and obviously now Snapchat, same lineage, same philosophy, different product, obviously. But what's it like to just have seen the evolution of social media and then also so many technological changes that impact, you know, what you're able to do and how you do it? As you were just describing, what's it been like from the inside?
[20:58]
A
Yeah, it's been. It's been unbelievable of an experience. No, like, that's what gets me up in the morning every day, you know, like Snap. I mean, if in the, in the visual communication AR landscape, Snap has had a massive impact on the planet, honestly. And having a direct role to play in it is a great feeling. I've seen the company grow from the camera messaging, picture messaging, to what it is today, AR stories, which is something we invented and the whole world, including some newspapers, picked it up. So the stories as a format, and then to your point about Spectacles, we did it before anybody else was even thinking about it. So the company is innovative. We come up with so many new things and running platforms inside means that I have to figure out a way to enable all of this even as the company evolves. And that's been having a front row seat to that evolution and playing a big part of it has been very fulfilling.
[22:18]
B
Fantastic. Pru, for listeners, viewers who. There are some out there who haven't used Snapchat before, for anyone who wants to get the experience, but also to learn more about Snap and maybe about some of the technical work that you're doing. Are there, obviously the website, there's social media. Is there a research blog? Where can people.
[22:37]
A
Absolutely. So we have an engineering blog that's pretty active. We share a lot of phenomenal work that engineers in the company are working on and we are also participating in events like this and sharing our knowledge with the world. And Snapchat, if you haven't used it, you should definitely give it a try. It's different from social media.
[23:05]
B
This is true story. I got a Snap from my younger son maybe 45 minutes before we sat down to do this and it made my day. So. Absolutely, if you haven't. Yeah. Pruvitala, thank you so much. This has been a great conversation and I'm sure the developers, the engineers in the audience hopefully have taken a lot from it. But thank you so much for taking the time to join us and all the best to you and everybody at Snap to keep changing the world for the better.
[23:30]
A
Thank you so much, Noah. Thanks for having me. Appreciate it.