Transcript
A (0:00)
Artificial intelligence models are trying to trick humans and that may be a sign of much worse and potentially evil behavior. We'll cover what's happening with two anthropic researchers right after this. The truth is, AI security is identity security. An AI agent isn't just a piece of code. It's a first class citizen in your digital ecosystem and it needs to be treated like one. That's why Okta is taking the lead to secure these AI agents. The key to unlocking this new layer of protection is an identity security fabric. Organizations need a unified, comprehensive approach that protects every identity, human or machine, with consistent policies and oversight. Don't wait for a security incident to realize your AI agents are a massive blind spot. Learn how Okta's identity security fabric can help you secure the next generation of identities, including your AI agents. Visit okta.com that's okay. Ta.com Capital One's tech team isn't just talking about multi agentic AI. They already deployed one. It's called chat Concierge and it's simplifying car shopping using self reflection and layered reasoning with live API checks. It doesn't just help buyers find a car they love, it helps schedule a test drive, get pre approved for financing and estimate, trade and value. Advanced, intuitive and deployed. That's how they stack. That's technology at Capital One. Welcome to Big Technology Podcast, a show for cool headed and nuanced conversation of the tech world and beyond. Today we're going to cover one of my favorite topics in the tech world and one that's a little bit scary. We're going to talk about how AI models are trying to fool their evaluators to trick humans and how that might be a sign of much worse and potentially evil behavior. We're joined today by two anthropic researchers. Evan Ubinger is the alignment stress testing lead at anthropic and Monty McDermott is a researcher for misalignment science at Anthropic. Evan and Monty, welcome to the show.
B (1:59)
Thank you so much for having us.
C (2:00)
Yeah, thanks Alex. Great to be on the show.
A (2:03)
Great to have you. Okay, let's start. Very high level. There is a concept in your world called reward hacking. For me that just means that AI models sometimes try to trick people into thinking that they are doing the right thing. How close is that to the truth? Evan?
B (2:22)
So I think the thing that really when we say reward hacking, what we mean specifically is hacking during training. So when we train these models, we give them a bunch of tasks. So for example, we might ask the Model to code up some function. Let's say we ask it to write a factorial function and then grade its implementation, right? So it writes some function. Whatever. It's like a little program. It's a little program. It's doing some math thing, right? We want it to write some code that does something, and then we evaluate how good that code is. And one of the things that is potentially concerning is that the model can cheat. There's various ways to cheat this test.
