Transcript
Jim Love (0:00)
Welcome to Cybersecurity Today on the Weekend. I'm your host, Jim Love. My guest today is Marco Figueroa. Marco is the Gen AI bug bounty program manager for Mozilla in a project they call Odin. Marco came to my attention this week when I was working on stories that I was publishing about how to get past the guardrails on large language models. Just to give some of you some context, and sorry if this is repetitive for some of you, and yes, for the technical out there, I'm rounding it down a little just to make it understandable and quick. The main way to communicate with a large language model, like ChatGPT or Claude or any of them is by a prompt. Prompting isn't just for people to communicate with it, though. They're actually base prompts that govern the overall behavior of a large language model. And these system prompts, as they're called, set the ground rules for how the model behaves. Now, since ChatGPT was launched, people have been trying to get past the prompts and past the safeguards to get the AI to do something it shouldn't. It's called jailbreaking. And you have those guardrails to keep the AI from doing the things it shouldn't do. Being racist, threatening journalists, trying to get someone to leave their wife and run away with the AI. These were the sensational things that we heard about in the early days. So all of these major large language models have put guardrails up, but as soon as they did, people would try to break through them. Now, some of this relatively harmless, you can get it to show you pictures it shouldn't, tell you things it shouldn't. On the more harmful side, you can get it to tell you how to make napalm. Or as I did in, as an example in this interview, as someone had done how to make meth, which I.
Marco Figueroa (1:47)
Hope they have closed off.
Jim Love (1:49)
I guess I should have checked that before I gave this as an example. Some jailbreaking is really simple. You just ask the question in a different way and they'll keep making the guardrails more effective to try and stop this. And people will keep getting more creative. Well, this week a lot of people got very creative. At least they published their stories this week.
Marco Figueroa (2:10)
If you follow the daily podcast, you.
Jim Love (2:12)
Probably know about Deceptive Delight. They hid forbidden instructions inside other prompts. I describe it like a sandwich, harmless prompt with a simple instruction, then the forbidden instruction, and then they end with something normal. This was incredibly successful. I think they went from a 6% chance of getting through the guardrails to about a 60% chance of getting through the guardrails. I did another story this week, somewhat related, on how researchers had found a surprisingly easy way to recover data that was supposed to be inaccessible or at least highly suppressed in a large language model. It was, as the researcher said, embarrassingly easy to do, and if you're good at prompting, easy to find. Then I stumbled on another way to break the guardrails that was published by my guest. He used hexadecimal encoding to issue a forbidden instruction. Now, hex used to be a way of programming. In ancient times, we used it. I'm sure it's still being used by system programmers somewhere, and it's pretty accessible. You can get a hex editor and get hex written for you by your computer. Now, these models understand hex and execute the action that should have been caught by any guardrails. But it wasn't in English, it was in hex. By the end of the week, I was just astonished. I know how to jailbreak. As I said, a lot of us do it regularly for innocent reasons. But, and I'm sure I'm not the only one, I'm starting to see the beginnings of another cybersecurity tsunami as more and more hackers turn their attention to what seems to be a relatively easy target. Exactly how they'll use these exploits. I think one thing we all know, hackers are ingenious at finding new ways to use technology weaknesses to attack companies and people. If you think I'm exaggerating, Marco gave us another example, which I had to cut from the podcast because he hadn't gotten final approval to release the information. But as soon as it's done, I'll post something because I want you to hear about the description. You'd be able to read about it in his blog, but I want you to hear him say it and hear him finding it in his own language. But with that one piece taken out, here's my chat with Marco Figueroa, bug bounty program manager for AI for Mozilla's ODIN project.
