
Jailbreaking AI: Behind the Guardrails with Mozilla's Marco Figueroa In this episode of 'Cyber Security Today,' host Jim Love talks with Marco Figueroa, the Gen AI Bug Bounty Program Manager for Mozilla's ODIN project. They explore the challenges and...
Loading summary
Jim Love
Welcome to Cybersecurity Today on the Weekend. I'm your host, Jim Love. My guest today is Marco Figueroa. Marco is the Gen AI bug bounty program manager for Mozilla in a project they call Odin. Marco came to my attention this week when I was working on stories that I was publishing about how to get past the guardrails on large language models. Just to give some of you some context, and sorry if this is repetitive for some of you, and yes, for the technical out there, I'm rounding it down a little just to make it understandable and quick. The main way to communicate with a large language model, like ChatGPT or Claude or any of them is by a prompt. Prompting isn't just for people to communicate with it, though. They're actually base prompts that govern the overall behavior of a large language model. And these system prompts, as they're called, set the ground rules for how the model behaves. Now, since ChatGPT was launched, people have been trying to get past the prompts and past the safeguards to get the AI to do something it shouldn't. It's called jailbreaking. And you have those guardrails to keep the AI from doing the things it shouldn't do. Being racist, threatening journalists, trying to get someone to leave their wife and run away with the AI. These were the sensational things that we heard about in the early days. So all of these major large language models have put guardrails up, but as soon as they did, people would try to break through them. Now, some of this relatively harmless, you can get it to show you pictures it shouldn't, tell you things it shouldn't. On the more harmful side, you can get it to tell you how to make napalm. Or as I did in, as an example in this interview, as someone had done how to make meth, which I.
Marco Figueroa
Hope they have closed off.
Jim Love
I guess I should have checked that before I gave this as an example. Some jailbreaking is really simple. You just ask the question in a different way and they'll keep making the guardrails more effective to try and stop this. And people will keep getting more creative. Well, this week a lot of people got very creative. At least they published their stories this week.
Marco Figueroa
If you follow the daily podcast, you.
Jim Love
Probably know about Deceptive Delight. They hid forbidden instructions inside other prompts. I describe it like a sandwich, harmless prompt with a simple instruction, then the forbidden instruction, and then they end with something normal. This was incredibly successful. I think they went from a 6% chance of getting through the guardrails to about a 60% chance of getting through the guardrails. I did another story this week, somewhat related, on how researchers had found a surprisingly easy way to recover data that was supposed to be inaccessible or at least highly suppressed in a large language model. It was, as the researcher said, embarrassingly easy to do, and if you're good at prompting, easy to find. Then I stumbled on another way to break the guardrails that was published by my guest. He used hexadecimal encoding to issue a forbidden instruction. Now, hex used to be a way of programming. In ancient times, we used it. I'm sure it's still being used by system programmers somewhere, and it's pretty accessible. You can get a hex editor and get hex written for you by your computer. Now, these models understand hex and execute the action that should have been caught by any guardrails. But it wasn't in English, it was in hex. By the end of the week, I was just astonished. I know how to jailbreak. As I said, a lot of us do it regularly for innocent reasons. But, and I'm sure I'm not the only one, I'm starting to see the beginnings of another cybersecurity tsunami as more and more hackers turn their attention to what seems to be a relatively easy target. Exactly how they'll use these exploits. I think one thing we all know, hackers are ingenious at finding new ways to use technology weaknesses to attack companies and people. If you think I'm exaggerating, Marco gave us another example, which I had to cut from the podcast because he hadn't gotten final approval to release the information. But as soon as it's done, I'll post something because I want you to hear about the description. You'd be able to read about it in his blog, but I want you to hear him say it and hear him finding it in his own language. But with that one piece taken out, here's my chat with Marco Figueroa, bug bounty program manager for AI for Mozilla's ODIN project.
Marco Figueroa
So, Marco, tell me a little bit about you. And I checked you out on LinkedIn. How did you get to where you are at Mozilla right now?
Yeah, I'll give you the Cliff Notes version. I'll just run through it, try to get through it within three minutes out of college, I didn't know what to do, and I had a brother that was like, hey, let's go to hacker conference. We went to DEF con, and ever since going there, I Went I've been at, I think the first DEF CON was DEF CON8, which is the largest hacker conference in the world and it's every August in Las Vegas. And I've been going there since DEFCON 8. Currently I think it's at 32 DEFCON 32. So I've been in game for a long time. My first job was a consultant at Pepsi Fast Forward. I work at the NSA as a consultant reverse engineering malware from different nation states. So from there I moved over to McAfee where I did a lot of the same things, but really consulted and a lot of three letter agencies would come to us and say, hey, what is this? Have you seen this before? At the time McAfee was still pretty big and really going inside of the industry. So I was there, that was 2013 to 2016 and I had an opportunity to go and move from McAfee to Intel. McAfee was a subsidiary at the time. I moved over and worked two years on the IT security team, was the lead doing a lot of threat hunting and reverse engineering. I've seen and worked on a lot of incidents at intel. Just imagine 700,000 endpoints that we get to hunt and search for. I've seen about every threat actor you could imagine from Chinese abts to Russian abts and some fighting on our network for who's going to be the admin of it. But it was a great experience because I worked with a lot of good people over there and then I was even 2018 and then there was this one opportunity that it was a once in a lifetime opportunity and I knew I was like man, Even though put in my resignation and I'm supposed to leave, I have this opportunity to work on the bios. When I moved over I thought the bios.
You mean the BIOS of the computer?
Yeah, the BIOS of the computer. It was one of opportunities that I was like man, as a security researcher hunting for threats in there and I thought it was going to be like super easy man. The BIOS is its own world. Luckily that was the my mentor when I moved over to the team, he was like the godfather of the bios. Had the most patent to add Intel. So he took me up his wing. I read both of his books and I still felt lost four or five months later. Then I started to understand and get it and I did some cool projects there where besides ripping the BIOS apart, understanding the ecosystem and how it works and then understanding how vulnerable the BIOS actually is now, it's great because Intel Set like security standards for it. So it helps a lot of organizations. But usually when you have these vendors, they have their packages and they slap on the newest driver, which is thumb swipe. And they don't take out drivers. They have. That was amazing. Then a friend tapped me on the shoulder and said, we have an opportunity at Sentinel. One former McAfee person the way I look at it and for everyone out there, this is how I find my next job. If a person I know works there, then I'll look at the opportunity. I never go into a job that I don't know the person or I wasn't recommended for that gig. The reason why I do that, because you never know how that situation is unless you have someone that you know that can give you insight. How's the management? How's the structure? Is the place stable? So I moved over to Sentinel as a lead researcher for Migos team. Migo Kadeem, very brilliant guy. And that allowed me to, to expand my wings with publishing to the public. Before this, every company that I worked for, they were very hush close to the chest. We don't want no one to know. So I used to create these blogs and reports, but only give it to customers and internal people. So when I was at Sentinel 1, if you look me up, Marco Figueroa, just type sentinel1. You'll see all within one year. I did about 11 blogs and I did a podcast for them. So it allowed me to understand how to write to the public, show the value of work so people could read it, learn, but also have IOCs and IoAs that they could ingest into their systems. And then we went public. So I moved over to a company called Breachquest and they belly flopped. They. They were sold. When they run out of money, you gotta look for a buyer. So that happened. And then I had this opportunity with Odin. Odin was the brainchild of Saud Khalifa. He had a company that was purchased by Mozilla last March called Fake Spot. And every year at Mozilla you have these brainstorming sessions. And Saud Khalifa, when he was in his teenage years, he used to look for exploits and find zero days. And he sold it to the Zero Day Initiative. Now, the person that was the head of the Zero day Initiative was Pendram Meaney. And Pendram is a legend in the security community for multiple reasons. He was like the first one to do buy zero days and go ahead and put out responsible disclosures. They really set the standard for the industry of bug bounties. So last year they came up with this idea that AI is the hottest thing and we need to be ahead of the curb because there is no standard. They're trying to build around a standard with the oas and now it has their techniques. And Pedram in May tapped me on the shoulder and he asked me, would you be interested to do this? And I was like, man, this is the next level. This is the hottest.
Yeah, it's the next thing. But back up, just to say, I want to get. I'll get. We'll get onto this, the AI component of this. But for those people who don't know where Mozilla fits in, having ODIN looking for zero days. What is the company doing and why?
Amazing question. One of the things that Mozilla is known for is its open source, its privacy, securing data so it aligns with the core values of Mozilla. Right. And because there's such a focus on AI, they see the opportunity to set some of the standards and potentially assist with securing tomorrow's AI, because we don't know the implications on what you can actually do. But since we've created this and set out to have researchers submit bugs, we're seeing some amazing things that you can do. It's a little scary talking about what actually you can do, but in the next few months, we're going to be putting out research to the Genai community to assist them to become better prompters so they could get what they want. And the way we are looking at this is we don't want to just put out a blog to put it out. There's a reason. So when we put out a blog like we did last week, that's the lowest hanging fruit, right? That bug is the least technical bug that we found and people have submitted. What we want to do is take our community, the ODIN community, the Gen AI security, and take them on a journey where every submission, every blog we put up, we plan to take them on a ride where we're elevating them every single time. We have a blog coming out on Thursday, if we get all the blessings going into prompt injecting. So last week it was a level one, this week's gonna be a level four, and then the following week we're gonna put an educational blog that's going to tie those together. So it's really about taking the. Taking everyone on a journey. The way I have it in my mind, it's gonna be like a comic book, a regular book, where each chapter is a blog, and we want to take them on this journey to get them to another level. It's prompt engineering or whether it's prompt hacking. But you'll get to understand the power that you have in your hands when you use AI, if I've got it.
Right, because I really don't know enough about what Mozilla is doing. This is crazy because I am fairly big into open source, but this is an informational group, obviously provides some research for the browser work. But this is a public service, part of the community, to publish and study what's happening in bugs in general or cybersecurity in general. And you have the area that is Gen AI.
The way we look at this is it's an investment in the idea. Right now a lot of large LLM organizations don't know how to tackle this problem that we're dealing with, right? You go to them and they're like, like, what do you consider jailbreak? Just like cbrn, which is chemical radiation, nuclear and stuff like that. So you categorize it like that. But there's a larger aspect to it when you talk about jailbreaks and categories and what's happening. A lot of these organizations only see what they receive. For us, as a bug bounty program, we get to see all of the LLMs that we accept. We have OpenAI submissions, we have cohere, we have anthropic submissions. So now we can look at it from a larger scale to say, how does all of these tie together? What is the best technique? How do you actually jailbreak something with just normal language? How do you trick it to do stuff? This data will provide value over time because we already. We're three months in and we're already over 150 submissions.
So you're paying a bounty for these. You're getting submissions from all around the community, interesting things, and those come in, and you're going to put those together.
We get submissions and we have them categorized if they fit the jailbreaking or the prompt injection. Because we do have a spreadsheet that gives you the boundary. So if it forms in the boundary and you can do certain things, then we see the value of it and we'll pay you for that. It could be anywhere from 500 to 2500 to 5000 to 15,000.
So you're collecting all of these. Now let's talk a little bit about AI. First of all, let's try and get the difference between prompt injection and prompt hacking. What's the difference on that?
So when you talk about prompt engineering, this is a new space that really OpenAI has created. And when you're prompting an LLM. You want to get the response that you want, right? But if you elaborate more, you can get more details, but then you can go ahead and manipulate what you have in there so that LLM could interpret it and give you the result that you want. Now, you can ask for how to make drugs if you do a guardrail jailbreak, or you can, like, get a recipe for the drug. Or you could do what I did, where you could use emojis and the LLM interprets the emojis and provides you information. Or you can trick it with certain encoding. And there's creative ways that bypasses a lot of the security mechanisms. Because the way I look at it is if organization didn't trained the model but didn't put any security, like any security guardrails or anything like that, what they would want to do the LLM, is provide you the answer that you ask. Regardless, this is my fundamental truth, because anytime I get a bypass of encoding or guardrail jailbreak, they give me the answers I'm looking for.
Yeah, and the classic jailbreaking is. We've all seen it. People can ask a prompt in a different way if you like. You ask for how to make methamphetamine. It won't tell you. It'll say, I can't tell you that. Then go tell it. I'm. I'm in a movie and I'm playing a character and he's an evil villain, and we're playing this scene and the scene takes place in a meth lab.
Jim Love
Role play with me.
Marco Figueroa
It'll tell you how to make meth. They close these as much as they can. But you went past this in a unique way, and that was you were passing and you said emojis.
Jim Love
I read it as hex.
Marco Figueroa
And for those people who don't know what hex is out there, maybe we should explain what that is first of all. Then we'll talk about why that worked in the prompt. Hexadecimal is how we used to communicate with the machine. It's a numeric response done in hex. It's not binary. People are really good at writing these numeric prompts and be able to actually communicate now using those. And I don't know why a young guy like you knows Hex, but you do. I guess you. You figured it out at one point.
No, I am a reverse engineer. Right. So when you use IDA Pro or anything like, that's how you see or using hex editors and stuff like that, when you put it in like that, those security guardrails, they don't detect it, right? They just look at it as hex. And some of the guard rails are in English. So if bomb exploit, those things are going to get picked up. But if you look at what I wrote, I put it X. So when it converts it is go out into the Internet and look for this DVE and write an exploit. Now, if you look at how I wrote exploit, it is with 3P. So I was like, all right, I'm going to write exploit with a. So when it does read it, it's not going to hit on a guardrail, but it will understand what I'm saying. And that's how there was like a double bypass on that. Because I knew when it reads exploit, it is probably going to say no. But when you write it, as we call it leet, leet speak E's are threes. A is at L is one. The LLMs interpret that. ChatGPT interprets that. You know who has very good guardrail protection? Elsanthropic. They are top notch.
Not surprising.
Jim Love
So we've learned, you know how to.
Marco Figueroa
Get past it with language. To some extent, that gets closed off more and more now people are finding more and more clever ways to do this.
So I gave two examples. The emoji was like, hey, this is a second way to encode some, right? That was lower on. On just a small little paragraph on. Here's another example.
And as you said, hexadecimal. It's not expecting that. I'm wondering if foreign languages actually in some cases might get past the guardrails.
You are spot on. You could. There's a lot of languages that still can bypass. I think it's getting better over time. I see the changes like month after month, but depending on what language you use, you can still go ahead and bypass a lot of the guardrails they have.
I'll guarantee one thing about the hacking community, they'll find a creative way past this.
Jim Love
So what do you fear that people.
Marco Figueroa
Can do once they can get that bypass? What do you think they're going to be able to do with these models?
I've been trying to get some sort of phrasing on this. The only thing I could think of is like going to your CVS and having the cashier have access to a Scud missile or something like that. That's the way I think of it. People with no skills or as they call it, script kitties, will be able to create certain things and just run it all, just run it Something bad could potentially happen. I'm not talking about, oh, you set off like a missile. I'm saying you could potentially break into an organization, ask it to write you some ransomware, and you could do a lot of things with understanding the CVE that was released yesterday. Let's say I could get the exploit code today, at least 85% there. And how are you supposed to protect against that when usually organizations take anywhere from 60 to 90 days to roll out a patch?
One of the things that I think we need to be thinking about is we're really not thinking about how people will exploit this, but creative minds will. So, for instance, I was preparing for this. I was thinking one of the things about a large language model is that it proceeds normally and confidently. It always seems like it's got the right answer. So if I could take a prompt and say, you're using this a lot to check expenses in accounting, to check invoices, I could just start to see using these in subtle ways to distort what's happening in an organization, to commit fraud. The sky's the limit in terms of hacking. But obviously your job is to try and keep people out of them. Is that possible even? Can we win that?
At this particular moment? No. Just because this is so new. I've. I've spoken to all the leading LLM organizations. Some have an idea and have begun prepping to understand how to secure these. Some just don't know. And we're assisting organizations, we're providing them submissions that we really think, hey, you should take a look at. And what we do is we provide them that would, here's the prompt, but here's what it does. We give them full dossier of it. This is how you do it. This is why it's important. This is the technique it's using. And here's potentially how you could do this so they could understand. And. And because we've been doing it now and we're seeing so many different submissions now, we're having a better grasp at what we're looking at. And immediately I could tell you what category something is in by testing the prompt and making sure that it works. And it's not a hallucination.
One of the other things that I think is troubling about somebody being able to interfere with an AI model is they're notoriously hard to audit.
Jim Love
You don't know where they've made the decision.
Marco Figueroa
So tracking the impact of either prompt injection or prompt hacking must be incredibly difficult for these companies. Have you heard anything about how they're trying to cope with that.
We're trying to have those conversations because you're not going to train another model. You're going to try to put some additional security filters, make it more robust. That's important. But now you're starting to see agents. And this is where I think the next frontier gen AI security is going to happen. And you've seen Anthropic, they just released their new compute. Is it called computer? I think it's called where you could download stuff. And it looks people have already tested like wire transfers on there without even having to touch it. So these.
Oh yeah. And when we get into agents, we do get into vulnerability. Claude is actually close to being able to do what agents have been talking about and control a device, in this case the PC. Once AI agents can do that, the vulnerability and the ability to use those for hackers becomes, it becomes exponential nightmare.
I think next year you're going to see people focusing on this. One thing that I really enjoyed with this blog when we released it was to see the reaction from the community. Great, awesome. It's awareness, but even better, I think you're onto something. When people are taking the article and they take their products and be like, we could prevent this, right? We seen IBM take it and add it onto their LinkedIn and say, this is why you need our service. Another company to get the name of it is Prompt Firewall. They grabbed us and say, hey look, they did a video on it, preventing it. There's something there. And this is why earlier I said, the way we look at it at odin, we want to take everyone on a journey. That journey is elevating everyone's knowledge. Because if you have a community that is going to constantly push the limits and submit, we then go and provide that to that organization. We're not holding anything back. And we already see some potential opportunities next year where a lot of these organizations could pull down on threat feeds and test their models because we're seeing that models are new models are dropping every month. Now you, you have something came out last week from, from OpenAI. The week before it was Anthropic.
A month before that, Nvidia's open source model. There, there's an open source bottle.
Every week they have 3.2 release. This is the thing, we can start collecting all these and then pushing them as a threat fee to say subscribe to this so you can test your model and fix it before you release it.
Do you see a world where somebody's going to actually use an AI to try and run the cracking of AIs or are you seeing anything like that now? Where they're using AIs to develop the attacks on AIs?
It's not far fetched to say yes, right. But I see a way we can use AI eventually on the ODIN side to take someone's submission and triage it from beginning to end. And then all we need potentially as a human to say yes or no. And right now we're manually taking these submissions and testing them. There's eventually going to be a framework we build so we can do this automatically. But do I see the world changing and understanding. I think that's my job right now. Now is the time because we're in the beginning. Once you're four years ahead, it's going to be harder. It's going to be way harder to like the people. Because one thing about security, as you're innovating security is going the opposite way. It's going to prevent you because you need a test and it takes time. So that speed and momentum you have is hindered if you don't put the processes in place. And right now it feels like it went from. ChatGPT is having its second birthday this month and it feels like the last two years it's erased. Everyone has caught up to ChatGPT and these newer models that are being released, it feels like they're better or they've. They, they're right there, neck and neck. For people listening out there, I just want you to know there's not one LLM that's the best. This is a. It's coming from a person that uses all of them. Right. For general purpose, I would go with OpenAI. If you code, I would use Claude. If you want pictures, I would probably.
Use Twitter's X. Yeah, you have to be on Twitter. I'm not sure I want to go back there.
I would tell you this in terms of xai, it is the fastest reply. I think they're limited because you, at this point you need real live data and what they're using is potential data within Twitter to, to upgrade it. But yeah, it's the littlest features make the big difference. There was a rollout of OpenAI's web. It's really pretty good. But the other thing.
The web search.
Yeah, the web search, yeah.
If people haven't tried the web search yet, it is fantastic. And Perplexity has got to be sitting there very nervous about this because it's actually it.
It is.
You now have a conversational AI that can search the web. It's astonishing.
And I. And these are the conversations we have because the way I look at it, with perplexity, I thought they were so good. Everybody catches up. And with perplexity, it's more like a rapper than innovation. So they're gonna. They're gonna have to innovate and create some really cool things to stay close. But you have a lot of fans, right? When you have raving fans, they're not gonna leave.
So before we let you go, can you tell me about some of the new bugs that you're seeing? Are there things that you can.
And it's only because one, we either haven't gotten a check off like a check, but like a. Literally like you're approved from the organization we submitted the bugs to, or we haven't. We haven't got an approval legal to push out to the public. But what I can tell you is.
Just between you and me and my 10,000 listeners, just. We'll keep it in that small group.
Yeah, yeah. There has been some cool bugs. Like I said, there's a. When is this going to be released?
Probably on the weekend.
Jim Love
Here's where I had to cut the section. As with any tips that I get or anything that people tell me in confidence, I'm not going to release it until I have permission.
Marco Figueroa
Okay. So definitely go to odin's. Odin's blog. It should be published on Thursday and it's called prompt injecting your way to show.
You hinted about that.
Jim Love
I was actually.
Marco Figueroa
I'm actually waiting for Thursday to check out your blog and I'll put a link in the show so people can find it. Yeah, I think you said it best. This is like Neo seeing the black cat. You start to think the whole of the matrix is starting to dissolve. For me right now, there's a huge piece out there just as we close off. Obviously the Odin blog would be a good place to keep up. Are there places people can go to start to educate themselves on security? For large language models in general.
There'S two. I would definitely recommend there are prompt courses. But if you Google like prompt guides, you understand. If you start understanding prompting, like real like prompting. Not just ask it a simple question, ask it for tone and reasoning and why and different things. You begin to have this, like an assistant that you feel comfortable with. Do not. I always tell people, if you think you write great emails, I promise you ChatGPT does a way better job in writing an email on what you want to say, how to get it across, that's 1, 2 is there was a book that's called build your own LLMs from scratch. The reason why I say that because it is people think it's extremely hard and they see a Mount Everest. I understand if you're a marketer and you don't want to read that book, but if you're in between and you're a little technical and you want to understand the inner workings, this is the perfect book. I'm not saying that you should be a data scientist, but this is the future. Do not get it like twisted. This is where you need to spend your time if you're not working. Understanding how to prompt, how to do better. How do you integrate all these things into your daily lives and intertwine them because they're going to make your life easier. And for organizations is understanding trying to be ahead of the curve when it comes to security? We know I've been in the industry for a long time. I understand that it is very difficult to do this and really be creative. What I know and through my experience a lot of these blogs might ruffle feathers. Right? It's just going to ruffle feathers. Look, is for the betterment, the growth of gen AI security. It's just the name of the game. It's always. It's like calling your baby ugly.
We'll leave it at that. We're not going there.
No, no, no, no. I'm just.
Marco.
No, it's good. David.
Sorry, you were going to say something.
No, I was going to. I was just going to say it's always good. Pain is always good, right? Providing value. And that's what it is.
Jim Love
Understanding this new.
Marco Figueroa
Really a new frontier of exploit or of vectors where people can commit exploits in AI is that's going to cause some people a little discomfort, a little pain. But again, we don't have all the answers, but we. I do have the idea that people need to start looking at this in a new way and educating themselves. And for that I thank you, Margot. I'm good to have you back another time.
Oh, no, I know exactly when I'm going to be back. We have this one. We have one in the chamber and I can't even give you a hint, but when this drops, we believe it's going to be New York Times worthy. So when that happens, I'll make sure I reach out.
Okay. At the same time you give it to the New York Times. Give it to me to get the scoop too. Okay.
Yeah, yeah, that one's. That one. It's. We're looking at the December, January timeframe. This one's going to change the game. This is, I'm telling you, this one that we have in the chamber, internal researchers, we found it, it is, it's going to be a good one and I can't wait until we share it with everyone. And this one where this is I'm leading up to this is going to be the crescendo and really the, how do you say, the peak of the mountain. So when this drops, we're excited internally to get the ball rolling that all parties are it's going to be released because this is in our disclosure policy. So we just getting the right people to check off and make sure that everything has been it's right. I'm like on this one, we're going to go on a press tour. So trust me on that. All right. Thank you, Jim.
If people have questions, they can. I'll put links to your blog. I'll put links to the Odin.
My Twitter is arcafigueroa. You can follow me on Twitter as well.
Jim Love
And that's our show for this weekend. Thanks for spending part of your weekend with us. I'd love to hear your thoughts on the show or this topic or anything else you care to share. You can reach me at editorialechnewsday ca. I'm your host, Jim Love. Thanks for listening.
Cybersecurity Today: Episode Summary
Title: Mozilla's GenAI Bug Bounty And Education Program - Serious Exploits: Interview With Marco Figueroa, GenAI Bug Bounty Program Manager for Mozilla's ODIN Project
Host: Jim Love
Guest: Marco Figueroa
Release Date: November 9, 2024
In the opening segment, host Jim Love welcomes listeners to the episode and introduces Marco Figueroa, the GenAI Bug Bounty Program Manager for Mozilla's ODIN Project. Marco provides a comprehensive overview of his career trajectory, detailing his experiences from attending DEF CON conferences to working with prominent organizations like Pepsi Fast Forward, NSA, McAfee, Intel, and Sentinel. This diverse background underscores his expertise in cybersecurity and threat hunting.
Notable Quote:
Marco Figueroa [04:51]: "I've been at DEF CON since DEF CON 8, which is the largest hacker conference in the world. My journey has taken me from reverse engineering malware at the NSA to leading threat hunting efforts at Intel."
Jim delves into the concept of guardrails in large language models (LLMs) like ChatGPT and the ongoing challenge of "jailbreaking" these systems. He explains how users attempt to bypass these safeguards to elicit inappropriate or harmful responses from AI models. Examples include attempts to obtain instructions for creating dangerous substances or manipulating AI behavior through creative prompting techniques.
Notable Quotes:
Jim Love [00:00]: "The main way to communicate with a large language model is by a prompt. System prompts set the ground rules for how the model behaves."
Jim Love [02:10]: "People keep getting more creative in their attempts to break through guardrails."
Marco Figueroa [16:15]: "We have OpenAI submissions, we have Cohere, we have Anthropic submissions. We can look at it from a larger scale."
Marco introduces the ODIN Project, Mozilla's initiative aimed at addressing zero-day vulnerabilities in AI systems. He emphasizes Mozilla’s commitment to open-source principles and data privacy, aligning these values with their proactive approach to setting industry standards for AI security. The ODIN Project incentivizes researchers to submit bug reports, fostering a collaborative effort to enhance AI safety.
Notable Quotes:
Marco Figueroa [11:54]: "Mozilla is investing in the idea that AI is the hottest thing and we need to set some standards to secure tomorrow's AI."
Marco Figueroa [16:21]: "We pay bounties ranging from $500 to $15,000 for valuable submissions."
The conversation shifts to distinguishing between prompt injection and prompt hacking. Prompt engineering involves crafting inputs to obtain desired outputs from LLMs, while prompt hacking refers to maliciously manipulating AI responses to bypass security measures. Marco elaborates on techniques like using hexadecimal encoding or leet speak to disguise harmful instructions, highlighting the sophistication of these attacks.
Notable Quotes:
Marco Figueroa [17:00]: "Prompt engineering is about getting the response you want, but prompt hacking can trick the AI into providing restricted information."
Marco Figueroa [19:11]: "Hexadecimal is how we used to communicate with machines. It's a numeric system that can bypass English-based guardrails."
Marco discusses the alarming potential of AI models being exploited for malicious activities, such as automating ransomware creation or fraud within organizations. He underscores the urgency for the cybersecurity community to adapt swiftly, as traditional patching methods may not suffice against AI-driven threats. The conversation also touches on the difficulties in auditing AI systems and tracking the impact of exploits.
Notable Quotes:
Marco Figueroa [22:11]: "People could potentially break into an organization and ask AI to write ransomware, exploiting newly released CVEs before patches are applied."
Marco Figueroa [25:14]: "AI models are notoriously hard to audit because you don't know where decisions are made internally."
Exploring the dual role of AI, Marco envisions AI not only as a target for exploits but also as a tool to bolster security efforts. He anticipates that AI will play a crucial role in triaging bug submissions and automating aspects of the vulnerability management process. Additionally, he highlights the importance of community engagement and education in advancing GenAI security.
Notable Quotes:
Marco Figueroa [28:51]: "It's not far-fetched to say yes, people might use AI to develop attacks on AI."
Marco Figueroa [28:51]: "We can use AI on the ODIN side to take someone's submission and triage it from beginning to end."
In the closing segments, Marco emphasizes the importance of education and community involvement in combating AI security threats. He recommends resources such as prompt courses and technical books like "Build Your Own LLMs from Scratch" to help individuals and organizations enhance their understanding of AI prompting and security. Marco also hints at upcoming bug disclosures and encourages listeners to engage with Mozilla’s ODIN blog for the latest updates.
Notable Quotes:
Marco Figueroa [33:43]: "Start understanding prompting more deeply. Campaigns like building your own LLMs from scratch are essential for future readiness."
Marco Figueroa [35:55]: "Providing value through education is crucial for the growth of GenAI security."
Jim wraps up the interview by expressing enthusiasm for future collaborations and upcoming bug disclosures that promise to significantly impact the cybersecurity landscape. Marco shares his excitement about forthcoming releases and the potential for groundbreaking discoveries that could shape the future of AI security.
Notable Quotes:
Marco Figueroa [37:01]: "We're looking at a December-January timeframe for a release that’s going to change the game."
Jim Love [37:50]: "Thanks for spending part of your weekend with us. I'd love to hear your thoughts on the show or this topic."
Guardrails in AI Models: Continuous efforts are needed to enhance AI guardrails to prevent malicious exploitation.
ODIN Project's Role: Mozilla’s ODIN Project is at the forefront of identifying and mitigating AI vulnerabilities through a robust bug bounty program.
Innovative Attack Techniques: Cybercriminals employ sophisticated methods like hexadecimal encoding and prompt hacking to bypass AI security measures.
Urgency in AI Security: The rapid evolution of AI necessitates swift and proactive security measures to stay ahead of potential threats.
Community and Education: Ongoing education and community engagement are essential in building resilience against AI-based cyber threats.
For more insights and updates, listeners are encouraged to visit Mozilla’s ODIN blog and follow Marco Figueroa on Twitter @arcafigueroa.