#724: Accelerated computing: From fraud detection to AI innovation - AWS Podcast

Summary

AWS Podcast Episode #724: Accelerated Computing – From Fraud Detection to AI Innovation

Released on June 9, 2025

The 724th episode of the AWS Podcast delves into the transformative world of accelerated computing, exploring its pivotal role in powering advanced AI and machine learning (ML) applications across various industries. Hosted by Shruti Koparkar, the episode features insightful discussions with two AWS experts, Ray and Sudhir Kaldendi, who shed light on the challenges, architectural considerations, and real-world applications of GPU-accelerated computing on AWS.

Introduction to Accelerated Computing

Shruti Koparkar opens the episode by introducing the concept of accelerated computing on AWS, emphasizing its significance in AI and ML workloads. Accelerated computing leverages powerful hardware like Nvidia GPUs and AWS’s proprietary AI chips—Trainium and Inferentia—to enhance computational performance for complex tasks.

Segment 1: Accelerated Computing with Ray

Ray’s Role and Expertise

[00:00] Shruti Koparkar: "Hello everyone and welcome to another episode of the AWS Podcast... Today we are going to dive into a couple of different use cases for accelerated computing."

[01:06] Ray: "I'm a container specialist, solutions architect... My primary role is to now work with customers that are trying to build these massive systems on Kubernetes... especially for machine learning and generative AI solutions."

Ray, an experienced solutions architect at AWS, specializes in container orchestration and helps customers architect scalable ML applications using Kubernetes and Amazon EKS (Elastic Kubernetes Service).

Key Challenges: Maximizing GPU Utilization

A significant challenge highlighted by Ray is the maximization of GPU utilization. Unlike CPUs, GPUs are costly, and underutilization can lead to substantial financial losses. Ray states:

[05:37] Ray: "The biggest challenge... is that the GPUs... are very expensive. Anytime you're using the GPU you need to ensure that you're maximizing its usage."

He points out that applications often fail to fully utilize GPU capacity, sometimes achieving only 30% utilization, which is inefficient given the high costs associated with GPU resources.

Architectural Considerations for GPU Workloads

Ray discusses critical architectural considerations for deploying GPU-accelerated workloads on EKS:

Storage: Utilizing fast, distributed storage solutions like Amazon FSx for Lustre to ensure rapid data access and minimize latency.

[13:51] Ray: "FSX for Lustre is a great service for anyone that's looking to do distributed training... It allows hundreds of instances to read simultaneously."
Networking: Ensuring low-latency connections between GPUs using Elastic Fabric Adapter (EFA) to facilitate efficient data exchange during distributed training.
Resource Management: Balancing cost and performance by selecting appropriate GPU types based on workload requirements.

Customer Success Story: Rivian

Ray shares how Rivian, an automotive technology company, leverages accelerated computing on AWS:

[15:24] Ray: "Rivian has built a stack on top of AWS data on EKS... They use the Jark stack—Jupyter, Argo, Ray on Kubernetes—to run distributed jobs efficiently."

Rivian utilizes Argo Workflows and Ray to manage and orchestrate large-scale ML tasks, ensuring high GPU utilization and streamlined workflow management. By optimizing job scheduling and co-locating related tasks, Rivian maximizes the performance and cost-effectiveness of their GPU resources.

Nvidia NIMS and Karpenter Integration

[19:37] Shruti Koparkar: "Nvidia has launched NIMS... How do those work with EKS?"

[20:13] Ray: "NIMS addresses maximizing GPU efficiency by providing optimized containers that adapt to different hardware configurations."

Ray explains that Nvidia Inference Microservices (NIMS) offer pre-packaged, optimized containers for various ML models, simplifying deployment and ensuring optimal performance across diverse hardware setups.

Furthermore, Ray introduces Karpenter, an open-source Kubernetes autoscaler:

[23:02] Ray: "Karpenter optimizes compute scaling by dynamically provisioning the right EC2 instances based on workload demands."

Karpenter automates the scaling process, allowing EKS to adjust resources seamlessly in response to application needs, thereby enhancing resource utilization and reducing costs.

Segment 2: Accelerated Computing in Financial Services with Sudhir Kaldendi

Sudhir’s Role and Expertise

Transitioning to the financial sector, Sudhir Kaldendi, a principal solution architect specializing in payments at AWS, discusses the application of accelerated computing in fraud detection.

[27:03] Sudhir Kaldendi: "Financial institutions face the challenge of processing vast amounts of transaction data in real time, requiring robust infrastructure and efficient data processing."

Fraud Detection with Accelerated Computing

Sudhir highlights how financial services leverage GPU-accelerated instances to build sophisticated fraud detection systems:

Data Processing: Utilizing Nvidia Rapids integrated with Amazon EMR to accelerate data processing pipelines.

[35:55] Sudhir Kaldendi: "Nvidia Rapids speeds up data processing and machine learning pipelines, enabling faster fraud detection and cost savings."
Machine Learning Pipelines: Implementing frameworks like Nvidia Morpheus and Triton Inference Server for real-time transaction analysis and model deployment.
Scalability and Cost Efficiency: Combining AWS services like Amazon SageMaker with Nvidia technologies to achieve up to 14 times faster data processing and model inference, while significantly reducing costs.

Architectural Considerations for Real-Time Fraud Detection

Sudhir outlines key architectural components essential for building real-time fraud detection systems:

Data Storage and Retrieval: Efficiently storing and accessing vast volumes of historical and real-time transaction data using Amazon S3 data lakes.
Data Correlation: Integrating historical data with emerging trends to identify patterns indicative of fraudulent activities.
High Throughput Processing: Leveraging Triton Inference Server to handle up to 350,000 transactions per second, ensuring swift and accurate fraud detection.

[30:19] Sudhir Kaldendi: "Using the Triton inference server, we could process close to 350,000 transactions per second, which is crucial for identifying fraud in real time."

Customer Success Story: Feature Space’s ARIC Risk Hub

Sudhir shares the success of Feature Space and their ARIC Risk Hub:

[33:25] Sudhir Kaldendi: "Feature Space processes over 100 billion events annually, leveraging AWS scalability and Nvidia GPUs to deliver impressive fraud detection rates."

By utilizing AWS’s elastic computing power and Nvidia GPUs, Feature Space has developed a dynamic platform capable of real-time fraud detection with high accuracy, effectively mitigating financial risks.

Convergence of AI Technologies

Sudhir elaborates on the synergy between various AI models in enhancing fraud detection:

[38:26] Sudhir Kaldendi: "Graph neural networks analyze complex transaction patterns, while large language models process unstructured data like invoices and emails. Together, they provide a comprehensive fraud detection system."

The integration of Graph Neural Networks (GNNs), Large Language Models (LLMs), and Large Transaction Models enables financial institutions to detect fraudulent activities more accurately and efficiently, reducing false positives and enhancing customer trust.

Conclusion

Throughout the episode, Shruti Koparkar facilitates a deep dive into the intricacies of accelerated computing on AWS, highlighting its critical role in driving AI and ML innovations. From optimizing GPU utilization with Kubernetes and EKS to empowering financial institutions with real-time fraud detection capabilities, accelerated computing stands at the forefront of technological advancements.

[26:44] Shruti Koparkar: "Thank you so much, Ray... you provided a great overview of what folks should be thinking about when they are running GPU workloads on EKS."

[40:56] Shruti Koparkar: "That's it for this episode, everyone... until next time, keep on building."

For more insights and updates, listeners are encouraged to connect with Shruti Koparkar on LinkedIn or X, and provide feedback via email at awspodcast@Amazon.com.

This episode underscores the pivotal role of accelerated computing in modern AI applications, providing actionable insights for developers and IT professionals aiming to harness the full potential of AWS’s GPU-powered resources.

Transcript

Shruti Koparkar (0:00)

This is episode 724 of the AWS podcast, released on June 9, 2025. Hello everyone and welcome to another episode of the AWS Podcast. My name is Shruti Koparkar and today we are going to do something fun. We are going to dive into a couple of different use cases for accelerated computing. This is essentially the compute on AWS that is accelerated using Nvidia GPUs as well as the AWS AI chips, AWS Trainium and Inferentia. And it's purpose built for AI and machine learning workloads. And so we are going to discuss how excited computing is enabling success for several different customers across different verticals. And we are going to do that by chatting with a few different guests on this episode, starting with Ray. Ray, welcome to the show and can you please introduce yourself?

Ray (1:06)

Thanks for having me, Shruti. So I'm a container specialist, solutions architect. I've been with AWS for about eight years and for most of my journey I've worked with customers trying to architect applications in the cloud. And yeah, so my primary role is to now work with customers that are trying to build these massive systems on Kubernetes and a lot of them is accelerated computing, since everyone is currently now trying to run some kind of machine learning solution or some kind of generative AI solution. So a large, large part of my work now is working with customers that are trying to build ML applications within their organization. So they're trying to build these platforms that enable hundreds of machine learning experts to build models and do things like distributed training or do model serving or serving large language models, et cetera. So that's gonna do what I do, right?

Shruti Koparkar (2:13)

That's really cool. So just to elaborate that on a little bit, there could be teams who are directly using Amazon Bedrock, for example, and just building generative AI applications with that. But then we also have customers who have internal teams who are building their own AI ML models or sometimes fine tuning, sometimes pre training and then deploying. And you work with the teams that help these other teams at a customer do that. So the teams that you work with typically are the ones that own the AIM and infrastructure piece, where they are responsible for securing the GPU capacity or the accelerated instances powered by Trainium or inferential capacity, and then figuring out how to manage that, the cluster management, sort of tracking utilization, all of that. Is that sort of what you were trying to describe?

Ray (3:18)

Yeah, yeah, exactly. So there's a vast array of customers that are using AWS ML solutions or building machine learning applications on aws. And on one hand you do have customers that really just don't want to invest anything on building net new applications on aws, but would still like to integrate some kind of machine learning solutions. And for that, a turnkey solution like Bedrock is perfect where you don't have to understand infrastructure. You can just say, hey, I just want to be able to use a large language model for that Bedrock. It's amazing. And then you also have customers that want more control, that are building their own models that really would like to tune the underlying hardware for the specific use case that they may have. So for example, we have many, many customers that do object detection when they may be simulating like a car going through a highway and what happens when all of a sudden it sees a pedestrian crossing. So that may not be a real world solution situation. That entire scenario, driving scenario may be a generated scenario. So for that you may need really, really a lot of compute. Sometimes it's a distributed compute jobs. We are talking about tens of GPUs. This is very similar to how customers also train large language models or vision models. So these customers would benefit from having the underlying access to the underlying hardware because their needs are very unique and you know, not what most customers would use these models for. You know, they're building their custom models.