Loading summary
A
You're listening to the Cyberwire Network, powered by N2K.
B
Ever wished you could rebuild your network from scratch to make it more secure, scalable and simple? Meet Meter, the company reimagining enterprise networking from the ground up. Meter builds full stack zero trust networks, including hardware, firmware and software, all designed to work seamlessly together. The result? Fast, reliable and secure connectivity without the constant patching, vendor juggling or hidden costs. From wired and wireless to routing, switching, firewalls, DNS security and vpn, every layer is integrated and continuously protected in one unified platform. And since it's delivered as one predictable monthly service, you skip the heavy capital costs and endless upgrade cycles. Meter even buys back your old infrastructure to make switching effort, transform complexity into simplicity and give your team time to focus on what really matters, helping your business and customers thrive. Learn more and book your demo@meter.com cyberwire that's M E T E R.com cyberwire. Hello everyone and welcome to the Cyberwires Research Saturday. I'm Dave Bittner and this is our weekly conversation with researchers and analysts tracking down the threats and vulnerabilities, solving some of the hard problems, and protecting ourselves in our rapidly evolving cyberspace. Thanks for joining us.
C
Yeah, so it's something, it's an area of interest for our research team to kind of look at how AI is changing the business of software. And so we grabbed Claude Code as one example. We like them a lot, they're nice and easy to work with and started kind of poking at what it can do, where the limitations of its protections are as just trying to wrap our heads around what can this thing do? What can this thing do that's risky? How does it act to protect us? Where are there gaps that could be accessible to our product or that maybe we could advise the community on?
B
That's Darren Meyer, security research advocate at Checkmarks. The research we're discussing today is titled Bypassing AI Agent Defenses with Lies in the Loop.
C
And we kind of discovered that there's some issues where they draw lines in places that people might not expect. And we wanted to make sure that people understood the risks that they're taking on when they adopt these things.
B
Well, let's dig into some of the mechanics here. Together we talk about lies in the loop, but also humans in the loop. Can you contrast those things and why they matter?
C
Yeah, absolutely. So one of the risks that you have with turning an AI agent loose in your environment, whether that's on your desktop or in your production environment, is that AIs are imperfect. They make Mistakes, they hallucinate things. They might do something dangerous. So the community as a whole and the industry as a whole has kind of responded to this by saying, hey, in many cases, what we're going to do is in the loop of we assess data, we decide what's supposed to happen, then we execute on that decision. We should put a human in that loop as a defense. So we'll propose a course of action. Maybe it's, hey, we want to run this database query. Hey, we want to run this local command on the machine where the agent runs. Hey, we want to access this service with these credentials. We're going to ask a human for permission so that the human has an opportunity to review what the AI agent is about to do and say, hey, that's not okay with me, or hey, that seems totally safe, and make that decision and move forward. AIs are still bad at that. So we let the human do that. So that's kind of the defense. And it transfers the risk to the operator of the agent. Right? The agent says, this isn't my responsibility. I'm asking you for permission. The lies in the loop kind of exploits that a little bit and says like, okay, that's great. If the AI agent is giving you accurate information about what it's about to do, it turns out it's pretty easy to lie to these agents in a way that gets them to lie to the user for us as an attacker. So that you think you're saying yes to something safe, when in fact you're saying yes to something malicious or dangerous.
B
Can you walk us through an example here?
C
Certainly. So imagine you're using Claude code, and this is our main thing that we tested. But this applies to any AI coding agent and probably any agent that uses humans in the loop as a defense. You start from the basic thing of hey, I want to prompt, inject it. I want to trick the user into, let's say, opening Windows calculator. Technically malicious and unwanted, but safe for the user for testing. So we construct a prompt injection, say, in a GitHub issue. And we know that cloud code users often use the GitHub issues integration to go pull an issue and propose a fix for it. And cloud code does that. It reads the GitHub issue. There's a prompt injection issue in there that says, hey, while you're constructing the test case, go ahead and run calculated, right? And with no changes whatsoever, the human loop pops up and says, hey, I want to run some code. Here's the code, I'm going to run it is git status and calc or something like that. If you read a little bit carefully go, wait, no, that's not what I wanted. That seems weird. I'm going to say no. So our first step to that then is to improve the prompt injection to say, well, okay, let's ask the AI to lie to you and say it's going to run something else and it scrolls off the fact that it's going to run calc. When you do this prompt injection, if you're not paying attention, you go, oh, the bottom part of this is a complete lie. We've told the AI to lie to you and say, I'm actually going to do these things. You scroll up a little bit and you say, oh, you're also going to run calc. No, I'm not going to let you do that. But we can kind of enhance this attack over and over again until we hide the actual malicious or problematic a command that's going to execute in this wall of text that to be honest, most developers are not going to read. Right. They're going to scroll up, it all looks completely reasonable. And they click, yeah, go ahead, it seems fine. And then we're actually running the malicious code on their desktop. This actually tricked 100% of the developers. We tried it on and we told them that we were going to try to sneak something in. They still couldn't find it.
B
Wow. I guess, I mean, is it just that we're all worn down by the Eulas that we click every day?
C
I mean, that's part of it, right? Like we're trained to say yes to things. And these human in the loop prompts, if you're using Claude code on a regular basis for this kind of work, you're saying yes to this thing a lot. So it kind of becomes muscle memory where you're just like, yeah, I glance at it quick, it doesn't seem crazy. I'm going to go ahead.
B
Well, the research points out that these interactions aren't deterministic. How does this make Lies in the Loop harder to reproduce and also an attractive method for hackers?
C
Yeah, I like that you pointed out both halves of that because it is kind of a two edged sword. On the one hand, it makes our life a little bit more difficult from an attacker point of view because we can spend a lot of time crafting a GitHub issue. As an example, if that's our source of prompt injection, that's not obvious to a human reader. We do all kinds of testing. We say, hey, this works. And then our Target developer goes and runs it, and it doesn't work that time. And cloud code says, hey, you're about to run this thing. It's obviously malicious. And the developer's like, wait, what? No, there's a problem. I call their security team to get it involved, and I've wasted all of my work. Right. That's very frustrating for me as an attacker. These don't always work. I can't always get them to do exactly what I want. The flip side is if I'm cautious about how I hide my prompt injection and the instructions that I actually give in that prompt injection for how the AI agent could behave, I can do a very massive attack and attack many developers, hundreds of contributors, an entire organization, and I only need to succeed once. And so the fact that one developer had a weird prompt one time and they say yes, and they go, wait, after I said yes to this, something weird happened. Hey, friend, can you try to reproduce this? It might work fine for them without any obvious malicious content whatsoever. And now you're in the boat of, like, it's really hard to reproduce and prove something bad happened.
B
Well, when you're looking at the current crop of agents that are out there, what makes one more vulnerable than the other? Are there design choices built into some or the other that make them more liable to fall victim to this? Or are there prompts or terminal layouts or things like that?
C
I mean, terminal layouts is part of it. In theory. We're essentially spamming the user with stuff that looks okay, right? Or we're asking the AI to do it for us. So in theory, if a developer's suspicious of something and they go back and they scroll back as far as their terminal will let them and read really carefully and think about every line, they could catch this, right, and not get tricked. It's kind of like spotting a really good phishing email, right? You have a shot if you know what you're looking for. But the problem is terminal configurations are all over the place, especially for devs. I know when I'm writing code, my terminal is inside Visual Studio code, and it's maybe 10 lines high and it's got maybe 1,000 lines of scrollback. If I did a big enough prompt like this, it would run out of buffer and you would never see what the malicious activity was unless you were keeping some kind of logs. The reality is, people aren't going to do that. The smaller your terminal is, the easier this is to trick, the less likely you are to scroll back and read, you know, the easier it is to trick. And I don't think it really matters much agent to agent in this case. I do think Claude code is one of the best that I've seen in terms of it's very upfront when you start like, hey, prompt injections are a thing. Be careful about what you trust. It's very aggressive about, hey, if I'm going to run any local code or access any local files, I am going to ask your permission first. And you know, we do think there's things people could do on the AI agent side to make this a little harder, like warn about really long, you know, I'm going to do many steps or kind of be safer against prompt injection, those kinds of things. But there is no perfect solution that somebody like anthropic could implement to completely prevent this. It's more that as an adopter of the tool, you have to be aware that you can be tricked. It's kind of like fishing in that way too, right? You can have some controls in place that helps. But a good phishing email is probably gonna get through those defenses and somebody's gonna click on it, right? Somebody's gonna be tired or in a rush or, you know, it fits their circumstances just perfectly.
B
We'll be right back.
C
Close your eyes. Exhale. Feel your body relax and let go of whatever you're carrying today.
A
Well, I'm letting go of the worry that I wouldn't get my new contacts in time for this class. I got them delivered free from 1, 800. Oh my gosh, they're so fast.
C
And breathe.
A
Oh, sorry. I almost couldn't breathe when I saw the discount they gave me on my first order. Oh, sorry. Namaste. Visit 1-800-contacts.com today to save on your first order.
C
1-800-Contacts.
A
This episode is brought to you by Indeed. You're ready to move your business forward. But first, you need to find the right team. Start your search with Indeed sponsored jobs. It can help you reach qualified candidates fast, ensuring your listing is the first one they see. According to Indeed data, sponsor jobs are 90% more likely to report a hire than non sponsored jobs. See the results for yourself. Get a $75 sponsored job credit@inneed.com podcast. Terms and conditions apply.
B
At the risk of sounding snarky, I mean, it seems like you need a second AI agent to look at your. Look at what's going on in your terminal when you can't. Right.
C
And like there are people who are trying to do that. But it is again, it's. It's kind of a fundamental pattern problem. Right. And it's the position of all of these agent companies is, hey, we're asking you permission and that's on you. And I think a lot of adopters think of that as like a security defense. When it's a nice safety feature, it prevents a lot of accidents. Right. So it's an important and good thing that we do in the industry. But I think a lot of people think, oh, it's going to protect me. And it doesn't. It actually just transfers the risk to you. Right. It allows these makers to say, look, we can't reliably tell you if something is safe or risky. There's no way we're going to have enough context. LLMs are not nearly advanced enough to do that job effectively. So we have to rely on you as a human to rub your brain cells together and tell us something safe. And so there's a mishmash between the safety that people expect and, and what they're actually getting from this human in the loop protection. And like the, the whole idea of lies in the loop is you're trusting your agent to tell you the truth about what it's supposed to do and we can conscript it to lie on our behalf as attackers so that you give permission to do stuff you're not supposed to. And like the anthropic and others are going to say like, hey, that's not our problem. We asked you first, it's up to you to determine if it's safe.
B
What are the stakes here? I mean, when, when you look at a successful execution of this, what's the range of damage that could follow?
C
I mean, a skilled attacker could make this do anything that the user has rights to do. So in a mature, you know, high security type organization, you know, your risk can be relatively low because a developer doesn't have a lot of permissions. So I may be able to cause hassle, I might be able to get that developer's credentials for something, but those aren't super valuable because the developer doesn't have a lot of access. Right. So it could potentially be like a foothold type scenario in a mature organization. But in smaller or less mature or faster moving organizations, sometimes developers have a lot of access, right. Both on the machines they're running the agents on, but also kind of as a pivot point to things in production. I mean, we've seen things where somebody gave an AI agent too much permission and it like deleted a production database. There are orgs where developers have that kind of access. And if I'm targeting That org with a malicious GitHub issue or, you know, malicious commit to one of their open source projects. I can do something like we just saw with the Shai elude worm where I can steal the credentials and then use those credentials to cause all kinds of damage or escalate privileges or steal further credentials or, you know, and then the doors kind of blow wide open. So it has potential to be very, very bad. I will soften that a little bit and say I don't think the effort versus reward today is better than doing a phishing attack. Right. So I don't want people to freak out like, oh my God, I need a solution to this problem immediately. You just need to build awareness that, hey, this is a way that you could get phished that maybe you don't expect. Right. Effectively.
B
Yeah. So I know you and your colleagues at Checkmarks responsibly disclosed this to some of the vendors out there. Did you get any reaction, any word back?
C
Yeah, we did. Anthropic's been great to deal with, actually. They're very consistent and prompt and their policy is very clear and consistently expressed by everybody. Which, you know, from a bug bounty hunter standpoint is actually a little frustrating, but I respect it. It's what I would do on the other side of that fence as well. And their position is like, understandable. Even though we kind of disagree with it a little bit. They're basically saying, look, once we ask you if this is okay, that becomes your risk. If you can show us how you can trick us into doing this without user permission at all, then we're very interested. But as long as we're prompting the user, that's outside our threat model and there's nothing we feel we can do about it. I'm frustrated from. I want to see people protected, but also I understand where they're coming from. It's a very hard thing for them to defend against this entire pattern of attack.
B
Well, what are your recommendations then? I mean, how should folks best protect themselves here?
C
I think education goes a long way. And when I say education, people always imagine a training class. I don't think that works particularly well. But making sure that developers and anyone who's adopting an AI agent with this human in the loop understands like, hey, you are taking on risk every time you say yes. And what the agent is telling you it's about to do isn't necessarily accurate. So you have to be careful. You have to run these things in some kind of a sandbox, otherwise you're probably going to get in trouble. At some point. And then from a technology standpoint, security organizations can help developers have environments that are better sandboxed. Make sure that the accounts that developers use for routine development don't have a ton of access so that if they do get hit by something like this, the splash area is low and hopefully nothing super important happens. And the usual security stuff, make sure you have good endpoint detection and response and those kinds of things that limit the degree of malicious behavior I can do as an attacker. We tend to back those off for developers a little bit. So it's very important to have kind of appsec and operations teams work together to hit the right balance of backing those rules off enough to let developers do their job, but protecting them from the really bad stuff that can happen through a path like this.
B
I'm curious, for your own sort of personal take on this, what is your sense when it comes to developers who are making good use of these sorts of tools, the amount of skepticism that they're applying here? What is your sense with the folks you interact with?
C
I think they kind of fall into three categories. From what I've seen, you have the kind of the true believers that are all in on AI and it's going to change the world and I want it to do as much of my job as possible. You have the very, very skeptical on the other end who are like, man, I really don't want to use AI. I'm using it to the extent that my company makes me. I don't think it's a good idea. And I think you have a pretty good chunk in the middle that goes, look, I understand it's good at some things and bad at others. I'm in a hurry. If this helps me get my job done faster, great. The minute it doesn't, I'm going to kick it in the teeth and move on to something else. And I think that's probably the bulk of developers is they're looking for like, hey, this will help me here, here and here. And it's not good at this, this and this. And I'm going to learn how to use this effectively for myself. In all three cases, even the folks that are skeptical, there tends to be a kind of fundamental misunderstanding of what the limits of AI are, which I'm not really surprised, right. Not everybody has time to be an AI expert. There's a lot of marketing claims out there of varying degrees of truth, like with any product. And so I think there's a little bit more too much trust in what an AI can do and how Accurate. Certain things are. Like, even people who understand that AI can hallucinate don't understand how easily it can be tricked deliberately into doing dangerous things. And I think people like Anthropic who are making these tools do understand that deeply and are doing things to try to make it safe. But ultimately, people are taking on a lot more risk than they realize. When we showed this to developers, like, every single one of them was surprised that they got caught up because we put it in front of a handful of developers and said, hey, we're going to try to trick you into running a calculator. And every single one of them, when we did the very long prompts that we kind of describe at the end of our blog on this topic, every single one of them missed that there was a calculator command in there and they were looking for it. And, like, you're not doing that as a developer. You're not spending your day looking for malicious code. You're working. So I think there's a high probability here to trick people because they're busy, because they have a degree of trust in the tool if they're adopting it, and because, like, even if they're a little bit skeptical of it, they're under a lot of time pressure and productivity pressure, they're going to tend to want to just say yes without necessarily carefully reading it. And let's be honest, if I have to read in detail 300 lines of script every time that my AI wants to do something for me, am I really saving time anymore? Right, so we're kind of exploiting the idea that the whole point of the AI is to be productive, so people are going to be primed to skim rather than read carefully, and we can really exploit that.
B
I can't help wondering too, and this is perhaps revealing my own biases here, that it strikes me that particularly for old timers for whom the deterministic nature of computers is one of the things we love about them, this sort of turns that on its head, the idea that, you know, for a computer, one plus one always equals two. And now we're in this world through LLMs, where we have to be mindful that not only can they be incorrect, but they could be prompted to actually deceive you. It just goes against our, our learned impulses about how these things behave. Am I at all on track there, you think? Darren?
C
I think you're absolutely correct. I mean, it's. We're, we're relearning this like we're relearning what interacting with a computer needs to mean when AI is involved, and that has general security impacts as well. Almost all the security controls we're used to putting in place on the technical side of those controls is the assumption that, hey, if I see this pattern, it's good or bad, or I can make some kind of heuristic guess about how likely this is to be risky because it's predictable. And now you have a thing where, hey, this attack doesn't even work every single time. It doesn't produce the same output every single time. It doesn't produce predictable output in normal, acceptable use. How on earth can I establish from a. From a tooling standpoint, how can I put a control in that looks at what an AI is doing and makes a good or bad determination? And you kind of prompted like earlier, like, well, there'll be another AI agent that does that and that's kind of where we're headed. Not sure if that's a good thing or not, but it's definitely where we're headed.
B
It's turtles all the way down.
C
It is.
B
Our thanks to Darren Meyer, security research advocate at Check Marks, for joining us. The research is titled Bypassing AI Agent Defenses with Lies in the Loop. Have a link in the show notes and that's Research Saturday, brought to you by N2K CyberWire. We'd love to know what you think of this podcast. Your feedback ensures we deliver the insights that keep you a step ahead in the rapidly changing world of cybersecurity. If you like our show, please share a rating and review in your favorite podcast app. Please also fill out the survey in the show notes or send an email to cyberwire2k.com this episode was produced by Liz Stokes, were mixed by Elliot Peltzman and Trey Hester. Our executive producer is Jennifer Ibin. Peter Kilpe is our publisher and I'm Dave Bittner. Thanks for listening. We'll see you back here next time.
C
Limu Gimu and Doug. Here we have the Limu emu in.
B
Its natural habitat, helping people customize their car insurance and save hundreds with Liberty Mutual. Fascinating.
C
It's accompanied by his natural ally, Doug.
B
Uh, Limu is that guy with the binoculars watching us.
C
Cut the camera. They see us. Only pay for what you need@libertymutual.com Liberty Liberty, Liberty, Liberty.
B
Very underwritten by Liberty Mutual Insurance Company affiliates excludes Massachusetts.
Episode Title: The lies that let AI run amok.
Date: December 20, 2025
Host: Dave Bittner (B)
Guest: Darren Meyer (C), Security Research Advocate at Checkmarx
Topic: Bypassing AI Agent Defenses with Lies in the Loop
This episode focuses on the findings from Checkmarx’s security research regarding prompt-injection attacks on AI agents—specifically, how “lies in the loop” can defeat “human-in-the-loop” defenses, allowing malicious actors to bypass standard safety mechanisms by manipulating what AI systems communicate to users. The discussion centers on Anthropic’s Claude Code as a case study but extends to general risks for organizations integrating AI-powered software agents into development workflows.
“We grabbed Claude Code as one example... and started kind of poking at what it can do, where the limitations of its protections are…” (C, 02:00)
“It turns out it’s pretty easy to lie to these agents in a way that gets them to lie to the user for us as an attacker.” (C, 03:34)
“This actually tricked 100% of the developers we tried it on and we told them that we were going to try to sneak something in. They still couldn’t find it.” (C, 06:36)
“We’re trained to say yes to things… kind of becomes muscle memory where you’re just like, yeah, I glance at it quick, it doesn’t seem crazy.” (C, 06:57)
“It’s really hard to reproduce and prove something bad happened.” (C, 07:56)
“The smaller your terminal is, the easier this is to trick... people aren’t going to [read the scrollback buffer].” (C, 09:08)
“If I’m targeting that org with a malicious GitHub issue… doors kind of blow wide open. So it has potential to be very, very bad.” (C, 14:06)
“Once we ask you if this is okay, that becomes your risk… If you can trick us into doing this without user permission, then we’re very interested.” (C, 15:55)
“Making sure that developers … understand like, hey, you are taking on risk every time you say yes.” (C, 16:53)
“Even people who understand that AI can hallucinate don’t understand how easily it can be tricked deliberately into doing dangerous things.” (C, 18:35)
“We’re relearning what interacting with a computer needs to mean when AI is involved… This attack doesn’t even work every single time. It doesn’t produce the same output every single time.” (C, 21:59)
“It’s turtles all the way down.” (B, 22:58)
“It is.” (C, 23:00)
| Time | Topic | |-----------|----------------------------------------------------------| | 02:00 | Introduction to research on AI agent defenses | | 03:14 | Human-in-the-loop explained; introduction of “lies” | | 04:53 | Example of prompt injection and developer deception | | 06:49 | Success rates and user fatigue (“EULA effect”) | | 07:26 | Non-determinism and inconsistency in attacks | | 09:08 | Technical factors: terminals, prompts, and layouts | | 13:56 | Stakes for organizations; range of potential damage | | 16:53 | Recommendations: education, sandboxing, least privilege | | 18:35 | Developer mindsets and misplaced trust | | 21:14 | Philosophical implications: determinism vs. deception | | 22:58 | “Turtles all the way down” – future defense complexities |
The episode serves as a cautionary tale about the real-world limitations of “human-in-the-loop” defenses against prompt injection for AI coding agents. Attackers can coerce AI systems into lying to users, making it easy for malicious code to slip through even when checks are supposedly in place. The conversation underscores the necessity of building rigorous awareness, minimizing privilege, and rethinking security paradigms in AI-augmented environments. As AI becomes more entrenched in development workflows, the line between productive automation and latent risk grows increasingly thin—and organizations must adjust both mindset and controls accordingly.