Exposing AI's Achilles heel. [Research Saturday] - CyberWire Daily

Summary4 min read

CyberWire Daily: Exposing AI's Achilles Heel – Research Saturday Summary

Episode Details:

Title: Exposing AI's Achilles Heel. [Research Saturday]
Host: Dave Bittner, N2K Networks
Guest: Amy Lutwak, Co-founder and CTO of Wiz
Release Date: November 23, 2024

1. Introduction to Research Saturday

In this episode of CyberWire Daily's Research Saturday, host Dave Bittner engages in an in-depth conversation with Amy Lutwak, the co-founder and CTO of Wiz. The focus of their discussion centers on a critical vulnerability discovered in Nvidia's AI infrastructure, specifically affecting container environments that utilize Nvidia GPUs.

2. Overview of the Nvidia AI Vulnerability

Amy Lutwak begins by outlining the scope of Wiz's research, emphasizing the significance of Nvidia's software stack in the AI industry.

"[02:31] Amy Lutwak: Wiz research finds critical Nvidia AI vulnerability affecting containers using Nvidia GPUs including over 35% of cloud environments."

Wiz's team identified vulnerabilities within the Nvidia container toolkit, a pivotal software component that enables GPU sharing across multiple users and is integral to AI applications deployed in containerized environments.

3. Technical Details of CVE-2024-0132

Delving into the specifics, Amy explains the nature of the vulnerability, designated as CVE-2024-0132.

"[04:15] Amy Lutwak: ...the vulnerability that we found allows the container image to escape from the container and basically take over the entire node."

The Nvidia container toolkit facilitates the use of GPUs in containers, a common practice given the high cost of GPU resources. The identified vulnerability permits malicious container images to break out of their isolated environment, granting attackers unfettered access to the host system. This breach can lead to unauthorized reading of sensitive files and execution of arbitrary code on the server housing the GPU.

4. Impact and Scope

Amy emphasizes that while the vulnerability is rooted in GPU usage, its implications extend beyond AI-specific applications.

"[06:52] Amy Lutwak: So this is... affects almost anyone using Nvidia for containers... it's actually any usage of GPU, it can be for gaming."

The flaw impacts over 35% of cloud environments that leverage Nvidia GPUs, making a substantial portion of the industry vulnerable. In multi-tenant settings, such as Kubernetes clusters, the risk amplifies, allowing attackers to potentially access resources and data across different user environments.

5. Mitigation and Patching Advice

Addressing the response to the vulnerability, Amy highlights Nvidia's prompt action in releasing a patch.

"[09:00] Amy Lutwak: ...they closed the vulnerability within a few weeks since the time we disclosed it to them and the patch was released."

She advises organizations to prioritize patching, especially those running untrusted container images or operating in multi-tenant environments. The recommendation underscores the importance of not solely relying on containers for isolation, advocating for additional virtualization layers to enhance security.

6. Responsible Disclosure Process

Dave inquires about the process of working with Nvidia to address the vulnerability. Amy provides insight into the responsible disclosure protocol.

"[17:31] Amy Lutwak: ...the entire discussion is highly sensitive and secretive between us and the vendor."

Wiz collaborated closely with Nvidia, ensuring that all details of the vulnerability remained confidential until a patch was deployed. This collaboration involved providing comprehensive reports and assisting Nvidia in replicating and resolving the issue, ensuring a swift and effective remediation.

7. Current Exploitation Status

When asked about active exploitation of the vulnerability, Amy clarifies that there is no evidence of widespread attacks exploiting CVE-2024-0132 at the time of the discussion.

"[20:31] Amy Lutwak: ...we haven't seen exploitation of this vulnerability in the wild yet, but it doesn't mean that it will not happen soon."

However, she cautions that the absence of detected attacks does not guarantee immunity, especially considering the limited visibility across on-premises environments and the potential for future exploitation.

8. Best Practices for Organizations

Concluding the technical discourse, Amy outlines several best practices for organizations utilizing AI models in containerized environments:

Inventory AI Tools: Maintain visibility over all AI tools and environments used within the organization.

"[21:46] Amy Lutwak: ...you need to know what AI tools are being used in your company..."
Implement AI Governance: Establish governance processes to oversee AI model usage, including sourcing, testing, and deployment in isolated environments.

"...define AI governance processes... AI discovery, the ability to define AI testing..."
Enhance Isolation Mechanisms: Use robust isolation barriers beyond containers, such as virtual machines or tools like gvisor, to mitigate the risk of container escapes.

"[14:03] Amy Lutwak: ...virtual machines are the best way to isolate... tools like gvisor provide additional security..."
Vetting Third-Party Models: Exercise caution when running third-party AI models or container images, ensuring they originate from trusted sources and are subject to rigorous security assessments.

"[15:25] Amy Lutwak: ...verify what is actually being run as an AI model and where."

9. Conclusions and Recommendations

The episode underscores the critical intersection of AI infrastructure and cybersecurity. As AI becomes increasingly integral to organizational operations, ensuring the security of underlying tools and frameworks is paramount. The discovery of CVE-2024-0132 serves as a stark reminder that vulnerabilities can emerge in unexpected areas, necessitating vigilant security practices and proactive governance.

Amy Lutwak concludes with a call to action for security teams to collaborate closely with AI and development teams, fostering a culture of security-aware AI deployment.

Final Thoughts

This episode of CyberWire Daily provides a comprehensive examination of a significant vulnerability within the AI ecosystem, highlighting both the technical intricacies and broader security implications. Organizations leveraging AI and GPU resources should heed the insights shared, implementing recommended best practices to safeguard their infrastructure against emerging threats.

For more detailed information, listeners are encouraged to review the full transcript and access the research findings through the links provided in the show notes.

Loading summary

Transcript27 lines

[00:02]
Dave Bittner
You're listening to the Cyberwire network, powered by N2K. Hey everybody, Dave here. I want to talk about our sponsor, LegalZoom. You know, I started my first business back in the early 90s and oh, what I would have done to have been able to have the services of an organization like LegalZoom back then. Just getting all of those business ducks in a row, all of that technical stuff, the legal stuff, the registrations of the business, the taxes, all of those things that you need to go through when you're starting a business, the hard stuff, the stuff that sucks up your time when you just want to get that business launched and out there. Well, LegalZoom has everything you need to launch, run and protect your business all in one place. And they save you from wasting hours making sense of all that legal stuff. Launch, run and protect your business. To make it Official today@legalzoom.com you can use promo code CYBERTEN to get 10% off any LegalZoom business information product, excluding subscriptions and renewals that expires at the end of this year. Get everything you need from set up to success@legalzoom.com and use promo code CYBERTEN. That's legalzoom.com and promo code CYBER10. Legalzoom provides access to independent attorneys and self service tools. Legalzoom is not a law firm and does not provide legal advice except where authorized through its subsidiary law firm, LZ Legal Services llc. Hello everyone and welcome to the Cyberwires Research Saturday. I'm Dave Buettner and this is our weekly conversation with researchers and analysts tracking down the threats and vulnerabilities, solving some of the hard problems and protecting ourselves in our rapidly evolving cyberspace. Thanks for joining us.
[02:20]
Amy Luttwak
So the Wiz research team focuses on finding critical vulnerabilities in cloud environments and recently we focused a lot on AI research.
[02:32]
Dave Bittner
That's Amy Luttwak, co founder and CTO from Wiz. Today we're discussing their research. Wiz research finds critical Nvidia AI vulnerability affecting containers using Nvidia GPUs including over 35% of cloud environments.
[02:56]
Amy Luttwak
We published different research efforts that we've done basically finding vulnerabilities in huge AI services, AI services that provide the AI capabilities to most organizations in the world, like hugging face like Replicate and SAP. So we started thinking, okay, what could be a way an attack surface on the entire AI industry? When we started thinking about it, we got to the software stack of Nvidia because we all know that Nvidia is an amazing company. They have the GPUs that everyone uses for AI. But a little known fact is that there's also a pretty considerable software stack that comes together with the GPUs, and that software stack is actually used by anyone using AI. So we thought if we can find a vulnerability there, this vulnerability can affect the entire AI industry. So that's how we started looking into the Nvidia container toolkit.
[04:02]
Dave Bittner
Well, we're talking today about CVE2024 0132, which affects Nvidia's container toolkit. Can you walk us through exactly what is involved here with this vulnerability?
[04:16]
Amy Luttwak
Yes. So what is the Nvidia container toolkit? It's basically a piece of software that anyone that wants to use a GPU and share the GPU across multiple users, and that happens a lot because GPUs are expensive. You would basically add to your container support for gpu. So the container itself can access the GPU and leverage the resources from the gpu. So this container toolkit is basically used by almost anyone that builds an application of AI on top of GPUs when the application is containerized. Now, the vulnerability that we found allows the container image to escape from the container and basically take over the entire node. So that means that if the container image runs from a source that is not controlled by the service provider, this container image can escape and read any secret, any file, and even execute code on the actual node that runs the GPU itself, the actual server.
[05:28]
Dave Bittner
Well, how could the attacker escape the container and then gain control of the host system?
[05:36]
Amy Luttwak
Basically, what we found is that in theory, it's not possible. If I run a container that has no capabilities, no permissions, how can it be that this container can escape and take over the entire server? What we found is a vulnerability within the Nvidia toolkit that if we craft a very specific container image that uses very specific features within the Nvidia container toolkit, what it actually does is that it maps mistakenly, right? It maps to my container, which is untrusted the entire file system of the server. It means that we can read any file from the underlying server because of this vulnerability. And we showed that once you have read access to any file on the server, we can actually run a privileged container that can take over the entire server. So this bug, this vulnerability that allowed us to map accidentally into our container, the entire server file system also of course, allows you to do full takeover if you want.
[06:45]
Dave Bittner
And is this specific to GPU enabled containers? Are they more susceptible to this type of attack?
[06:53]
Amy Luttwak
So this is I mean this is obviously it's wider than AI. I mean it's actually any usage of gpu, it can be for gaming. So this basically affects almost anyone using Nvidia for containers. The reason that it's relevant for GPUs is just because this is the software stack that is used there, right? So we usually wouldn't find this library when you don't have GPUs because it's a library that allows for GPU integration. It's not actually a bug in the gpu, right? It's just a bug in the software stack that is used by most of the GPU users.
[07:27]
Dave Bittner
What about multitenant environments, Kubernetes, clusters, those sorts of things.
[07:33]
Amy Luttwak
So I think in multitenant environments the risk is much, much higher and this becomes a crucial risk. In the exact use case that we started the research for was in environments where either you are a multi tenant and you run and you allow others to run their own container images, right? In that scenario, a container image that is malicious can escape the isolation and can potentially access the other images from other users, right? So basically in a multi tent environment there's a huge risk here that this container escape vulnerability allows the attacker to get access to anyone using the AI service. And this is why we always recommend in the with research team when you build applications, remember that containers can be escaped. So do not trust the container as a way to isolate your tenant. So even if you build a multi tenant service, do not rely on containers. Always add another virtualization area that is stronger. And this is a good explanation here why this is so crucial. We found the vulnerability and if you didn't build the right isolation, your service is at risk right now.
[08:49]
Dave Bittner
Now my understanding is Nvidia recently released a patch for this vulnerability. How should organizations prioritize their patching?
[09:00]
Amy Luttwak
Yes, so we work very closely with Nvidia and they responded very fast and they closed the vulnerability within a few weeks since the time we disclosed it to them and the patch was released. This vulnerability affects anyone using gpu. However, if we look at what is really crucial to fix, it's more urgent to fix areas where you allow an untrusted image to run. Because if you trust the image and you know that it's not actually coming from an untrusted source, the ability for the attacker to leverage its vulnerability is highly limited. However, if you have environments where you have researchers that download untrusted images, or you have multi temp environments that run images from users, these are environments that are at high risk right now. And that's what we recommend to prioritize and actually fix today.
[09:52]
Dave Bittner
What about the various attack vectors that are possible here? I mean, are there particular attack vectors that folks should be aware of?
[10:02]
Amy Luttwak
Basically, container escape is just the first step of an attack. Once you escape the container, you can steal all of the secrets. You can get access to any AI model on the server. You can start running code on other environments. The container escape on its own is just the beginning of the attack. You can think about it as basically the initial access into the environment. So if you look at a classic attack, this would just be the first step. And any step from there depends on the specific use case and architecture. However, what's important to understand is that many companies do run untrusted AI models, right? And we've talked about it in the past in other research that we've done, researchers download AI models without any way to verify them. So this risk of, hey, someone is running an untrusted AI model and this AI model can now escape the container because we thought it's fine to run AI models in containers. There's nothing going to happen to me. So this assumption is not true.
[11:09]
Dave Bittner
We'll be right back. And now a word from our sponsor, KnowBefore. It's all connected and we're not talking conspiracy theories when it comes to Infosec tools, effective integrations can make or break your security stack. The same should be true for security awareness training. KnowBe4, provider of the world's largest library of security awareness training, provides a way to integrate your existing security stack tools to help you strengthen your organization's security culture. KnowBe4's security coach uses standard APIs to quickly and easily integrate with your existing security products for from vendors like Microsoft CrowdStrike and Cisco 35 vendor integrations and counting Security Coach analyzes your security stack alerts to identify events related to any risky security behavior from your users. Use this information to set up real time coaching campaigns targeting risky users based on those events from your network, endpoint, identity or web security vendors. Then coach your users at the moment the risky behavior occurs, with contextual security tips delivered via Microsoft Teams, Slack or email. Learn more@knowbefore.com SecurityCoach that's knowbefore.com SecurityCoach and we thank KnowBefore for sponsoring our show. Do you know the status of your compliance controls right now? Like right now, we know that real time visibility is critical for security, but when it comes to our GRC programs, we rely on point in time checks. Get this more than 8,000 companies like Atlassian and Quora have continuous visibility into their controls with Vanta. Here's the Vanta brings automation to evidence collection across 30 frameworks like SoC2 and ISO 27001. They also centralize key workflows like policies, access reviews and reporting, and helps you get security questionnaires done five times faster with AI. Now that's a new way to GRC. Get $1,000 off Vanta when you go to vanta.com cyber that's vanta.com cyber for $1,000 off. What are some of the other isolation barriers that people should be using here? Are we talking about things like virtualization?
[14:04]
Amy Luttwak
Exactly. So basically when we design for isolation, especially from multi tenant services, containers are not a trusted barrier. Virtual machines virtualizations are considered a trusted barrier because if you look at the last recent years, right? How many vulnerabilities of container escape we found, how many vulnerabilities in Linux kernel we found there were an unnegligable number of vulnerabilities. However, in virtualization environments that is very, very rare, right? And that's why as a security practitioner, when I look at a review of an architecture, a virtual machine is the best way to isolate. Now there is tools today like gvisor, which is a tool that you can run that limits the ability of a workload to go outside of a specific set of approved perimeter capabilities, which reduce the risk significantly. Gvisor is not as secure as running a full virtual machine, but it's an example of a tool that provides great isolation capabilities without changing your entire architecture.
[15:13]
Dave Bittner
What about organizations that might allow, let's say third party AI models or third party container images to be running on their GPU infrastructure? Do you have any advice for them?
[15:26]
Amy Luttwak
Yeah, so I think that happens a lot, right? So it happens both for AI service providers, but also for anyone that has a GPU and allows anyone in the company to run code. And the implications here are first of all that you have to pitch, right? That's number one, just pitch for the vulnerability. But the wider implications are that we need to look at AI models and container images that come from third parties, just like we look at the applications that we download from. You know, like when you go through an email and you get an email from someone and you download the email, you know, and I know that I would not start running the applications that I get from the email, right? Because we all know that that can be malicious. Same why do we trust a container image that is an AI model from an untrusted source. Right. We should be a bit more careful because this is code that we are running and we need to make to remember that. This is a new attack surface for attackers. Just like downloading applications from emails. It used to be a great attack surface, but no one is, I hope no one is clicking on emails and actually running an application from an email. This is going to be a new attack vector. Right. Everyone talk about AI, so they just run everything that has the name AI and the AI model would be run. No, we have to remember this is a security risk. It's a new attack vector. Anything that we run, either it's fully isolated on a separate VM and so on, or we have actual processes in the company to verify what is actually being run as an AI model and where. Right. If you get an untrusted AI model, okay. You can only run it in this highly isolated environment. Right. If we don't have this kind of gauge rails, then we expose ourselves to a lot of risk.
[17:15]
Dave Bittner
You mentioned that Nvidia was a really helpful partner here in this disclosure. Can you walk us through what that process is like for folks who've never been through that? What goes into responsible disclosure with a big organization like Nvidia?
[17:32]
Amy Luttwak
Great. So the Nvidia team, first of all, how do we engage with them? So every company that has a product has a security program of how to report vulnerabilities to them. Usually there is an incident response email that is published. So we approach that vendor and there is a protocol that you have to follow. Right. When we report the vulnerability, we do not provide anyone outside of the vendor information about the vulnerability until it's fully patched. So the entire discussion is highly sensitive and secretive between us and the vendor. During that discussion, we try to provide to the vendor a full disclosure report with all of the information that we found during that attempt. We usually try not to touch actual customer data of that vendor so they don't actually get to any kind of issues with their customers. So what we try to do as researchers is to find the problem, provide the vendor a full report, and then basically we wait. Once we send the email, we just wait until the vendor actually replies to us. In the Nvidia use case, they actually worked really fast. They provided us responses almost within a day and they worked until they fix the vulnerability. As I said, like this was within two, three weeks, fully patched. Now, during that time, we communicate with the vendor. If they have any questions, anything that we found that they didn't know how to replicate, we help them actually reproduce. And the goal again is to make sure that the vendor has all of the information in order to fix the vulnerability. Because it's like when we found the vulnerability and we reported it. Think about it, it's like a weapon, right? Until someone actually patched it, it's very, very secret. And we cannot disclose and talk about it, even with our friends, partners, customers. We cannot talk about it with anyone because I have a weapon now. And until the vendor actually finishes the fix efforts, we have to remain silent on it. Now, once the vendor has a pitch, our role as researchers is to explain to the world about the vulnerability and why it's important to patch it. Now, something that's important to understand is that although we talk about it, we do not disclose yet in the beginning how the exploit actually works, right? And we do not disclose it because we wanna give the good guys time before any bad guy can leverage the vulnerability. So although Nvidia patched the vulnerability, since we didn't disclose exactly how to exploit it, we are giving time for the good people to fix the vulnerability before anyone can actually exploit it.
[20:24]
Dave Bittner
And do you know if this is being actively exploited? Do you have any methods to be able to track that?
[20:32]
Amy Luttwak
So there is no way to know that for sure, right? We have ways because we are also a security company and we have, we are connected to millions of workloads, so we are actually monitoring the environments that we see for any potential exploitation. So we haven't seen exploitation of this vulnerability in the wild yet, but it doesn't mean that it will not happen soon. And of course our view is limited because we see only cloud environments. There is huge amounts of GPUs deployed on premise environments within the cloud providers. So there is a. Our view is very limited. And also Nvidia wouldn't see because this is actually happening on in a local gpu, right? So no one can tell you for sure if this is actually already being exploited. I do think that again, this is not a vulnerability that is easily exploited because you do need ability to build an image and then you need to publish the image. So it takes time until this kind of vulnerability can be leveraged by an attacker.
[21:34]
Dave Bittner
Well, for those organizations who are running AI models in containers, what are some of the best practices they should follow to help mitigate these risks?
[21:46]
Amy Luttwak
That's a great question. You know, we talk so much, there's so much buzz about AI security and many times people talk about, oh, how AI is going to take over the world or how the attackers are leveraging AI to to basically take over my company. But the real risk right now, the real risk right now for AI usage is the AI infrastructure they use, right? So, I mean, if you look at this vulnerability, where does this come from? It comes from the AI infrastructure that you have in the company. And everyone now that's starting using AI, they have dozens or hundreds of tools that are used for AI, and these tools are actually bringing real risk right now. So if you, if I think about the best practices from, from this vulnerability is, number one, you need to know what AI tools are being used in your company, but by the AI researchers. And again, I am, I want to endorse AI usage, but I need to be able to say I have visibility into all of the AI environments and all of the AI tooling across my company. Right. And the second step is, as we saw here, AI models are great, but they're also kind of risky. So you need to define AI governance processes. So basically, which projects are using AI, which models are using? What's the source of the model? Where are you testing AI models? Is it running in a test, isolated environment? All of those are definitions that each company needs to do. And I call this AI governance. It's composed of AI discovery, the ability to define AI testing. All of those processes are important to define right now. And every team that has, you know, an AI team and a security team, they should start working together to define those kind of practices. It's always better to do it early than to do it later.
[23:48]
Dave Bittner
Our thanks to Amy Lutwak from Wiz for joining us. The research is titled Wiz Research finds critical Nvidia AI vulnerability affecting containers using Nvidia GPUs including over 35% of cloud environments. We'll have a link in the show notes. We'd love to know what you think of this podcast. Your feedback ensures we deliver the insights that keep you a step ahead in the rapidly changing world of cybersecurity. If you like our show, please share a rating and review in your favorite podcast app. Please also fill out the survey in the show notes or send an email to cyberwiren2k.com we're privileged that N2K Cyberwire is part of the daily routine of the most influential leaders and operators in the public and private sector. From the Fortune 500 to many of the world's preeminent intelligence and law enforcement agencies, N2K makes it easy for companies to optimize your biggest investment your people. We make you smarter about your teams while making your team smarter. Learn how@n2k.com this episode was produced by Liz Stokes. We're mixed by Elliot Peltzman and Trey Hester. Our executive producer is Jennifer Iban. Our executive editor is Brandon Karp. Simone Petrella is our president. Peter Kielpi is our publisher. And I'm Dave Bittner. Thanks for listening. We'll see you back here next time.