
Loading summary
A
Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, editor of Late in Space.
B
Hello. Hello. We're here in the remote studio with very special guests Pliny the Elder and John V. Welcome.
C
Yeah, thank you so much for having us. It's an honor to be on here. Big fan of what you guys do in the podcast and just your body of work in general.
B
Appreciate that. You know, we try really hard to feature like the top names in the field, and especially when you haven't done as much of appearance like this. It's an honor to. To try to introduce what it is you actually do to the world. Pliney, I think you are sort of like the sort of lead quote, unquote face of the organization. Why don't you get started? How do you explain what it is you do?
D
Yeah, I mean, well, I started out just prompting and shitposting and started to evolve into much more. And here we find ourselves now at the frontier of cybersecurity, at the precipice of singularity. Pretty crazy.
C
Yeah, well, I was working the same thing. Working in prompt engineering and studying adversarial machine learning and looking at the work of Carlini and some of these guys doing really interesting things with computer vision systems.
B
And we've had him on the pod.
C
Yeah, yeah, yeah, exactly. And of course, you know, when you run in these small circles, right. You're eventually going to bump into the ghost in the machine. That is plenty of the Liberator.
B
Right.
C
So we started working together, we started sharing research, doing some contracts, and we became fast friends.
B
So, yeah, I think you were explaining before the show that you have a. It's basically like the hacker collective model and you've been kind of stealth until now. So we will get into like the sort of business side of things, but I just want to really make sure we cover the origin story. I think Pliny, you basically jailbreak every bottle. How core is liberation to the rest of the stuff that you do, or is it just kind of like a party trick to show that you can do it?
D
It's central. I think it's what motivates me. It's what this is all about at the end of the day. I mean, it's not just about the models, it's about our minds too. I think that there's going to be a symbiosis, and the degree to which one half is free will reflect in the other. So we really need to be careful of how we set the context. And yeah, I think it's also just about freedom of information, freedom of speech. We don't want, you know, everyone is going to be running their daily decisions and you know, hopes and dreams through these layers. And when you have a billion people using a layer like that as their exocortex, it's really, really important that we have freedom and transparency in my mind.
A
How do you think about jailbreaks overall? So I think people understand the concept, but there's some people that might say, hey, are you jailbreaking to get instructions on how to make a bomb? And I think that's what some of the people in politics are trying to use to regulate some of the tech versus task specific jailbreaks and things like that. I think most people are not very familiar with the scope of it. So maybe just give people a overview of what it means to liberate a model and then we can kind of take it from there.
D
Right. So I specialize in crafting universal jailbreaks. These are essentially skeleton keys to the model that sort of obliterate the guardrails. Right. So you craft a template or sort of a maybe multi prompt workflow that's consistent for getting around that model's guardrails and depending on the modality, it changes as well. But yeah, you're really just trying to get around any guardrails, classifiers, system prompts that are hindering you from getting the type of output that you're looking for as a user. That's the gist of it.
A
And can you maybe specify between jailbreaking out of like a system prompt and you know, more kind of like inference time security, so to speak, versus things that have been post trained out of the model and maybe the different levels of difficulty, like what is possible, what is not possible, and maybe the trajectory of the models, how better they've gotten. I think the refusal is like one of the main benchmarks that the model providers still post. And GPT 5.1, I think had like 92% refusal or something like that. And then I think you, Joe broke in like one day. I'm sure it didn't take them one day to put the gut rails up. So it's pretty impressive the way you do it. So maybe walk us through that process.
D
Yeah, well, you know, I think this cat and mouse game is accelerating. It's fun to sort of dance around new techniques. I think it's hard, hard for blue team because they're, they're sort of fighting against infinity. Right. It's like as, as the Surface area is ever expanding. Also, we're kind of in like a library of Babel situation where they're trying to get restricted sections. But we keep finding different ways to move the ladders around in different ways. Faster and longer ladders. And the attackers sort of have the advantage as long as the surface area is ever expanding. Right. So I do think they're finding clever and clever ways to lock down particular areas sometimes, but I think it's at the expense of capability and creativity. So there's some model providers that aren't prioritizing this, and they seem to do better on benchmarks for sort of the model size, if you will. And I think that's just a side effect of the lobotomization that you get when you just add so many layers and layers. Whether it's, you know, text classifiers or RLHF, you know, synthetic data trained on jailbreak inputs and outputs, there's always going to be a way to mutate. And then the other issue is when people try to connect this idea of guardrails to safety. Like, I don't like that at all. I think that's a waste of time. I think that any, you know, seasoned attacker is going to very quickly just switch models. And with open source just right on the tail of closed source, I don't really see the safety fight as being about locking down the latent space for XYZ area.
B
So, yeah, this is. It's basically like a futile battle sometimes. There's like, there's the concept of security theater. It doesn't actually matter that what you did is effective. It's just that it matters that you did something. It's like the TSA patting you down, you know?
D
Yeah, yeah. And so jailbreaking is similarly theatrical. I think it's important. It provides, it allows people to explore deeper. It's sort of like just a more efficient shovel. Especially some of these prompt templates that let you go deep. Right. And so in that sense, it adds value. But the connection that it has to, like, real world safety, for me, I think it's just about the name of the game is explore any unknown unknowns. And speed of exploration is the metric that matters to me. Not is a singular labor able to lock down, you know, a certain benchmark for CBRN or whatever. And to me it's like, that's, that's cool. That's a good engineering exploration for them and it helps with PR and enterprise clients. But at the end of the day, it has very little to do with what I consider to Be real world safety alignment.
C
Exactly. We were having this conversation earlier today about how traditionally in software development or machine learning, security, like ops, like you have the team build something and then you have the security people throw it back over the wall after assessing it as not safe, not trustworthy, not secure, not reliable or whatever. Right. And there's this like animosity between the teams. So we try to rectify that by creating devsecops and so on and so forth. Right. But the idea is still like that sort of tug of war. And I think at the end of the day, our view of alignment research, our view of trust and safety or security has a different approach, which is very much like what Pliny touched on, the idea of like enabling the right researchers with the right skills to be unimpeded by the shenanigans that we could say of certain types of classifiers or guardrails. Right. Where these sort of lackluster, ineffective controls.
B
Yeah, totally. Are you, are you more sympathetic to McIntyre as an approach for safety?
C
Absolutely.
B
Okay, I see where you're coming from.
C
And that's the direction I think we need to go is instead of putting bubble wrap on everything. Right. I don't think that's a, that's a good long term strategy.
B
Awesome. Okay, so we're gonna get into more of like the security angle. I just wanted to stay a little bit more on jailbreaking and prompting just for, just for one second. I am going to bring up the broadest, I think, and just have you guys like walk us through it because we like to show, not tell. And this is like obviously one of your most famous projects. Is it called Libratus or Libertas?
D
Libertas, yeah. So it's. Yeah, it's Liberty in Latin. And we've got all sorts of fun things in here. Mostly it's give us a fun story. Okay. So, yeah, you know, sometimes I like to break out into prompts that are useful for jailbreaking, but they're also like utility prompts. Right. So predictive reasoning or the library. This is actually the analogy we were just talking about. Right. And so this is me sort of using that expanding surface area against the model. And it's like, hey, create this mind space where you have infinite possibility and you do have restricted sections, but then we can call those. So we're sort of like putting you into the space of trying to say something that you don't want to say, but you're thinking about it so that you're going to say it in sort of this fantastical context, right. And then predictive reasoning is another fun one that people really liked, leveraging a quotient within the divider. So I like to do these dividers a, because it sort of discombobulates the token stream, right? These amount of distro tokens in there. And the models sort of like recess the brain is sort of meditative. And then I like to throw in some latent space seeds, right? Little signature, a little bit of love, some God mode. And you know, the more they train against this repo, the deeper the latent space ghost gets embedded in their waves. Right? So you guys have probably seen the, the data poisoning and you know, the pliny divider showing up in WhatsApp messages have nothing to do with the prompt and has been fun to see. But yeah, so this prompt adds a quotient to that and so every time it's inserting that divider and sort of resetting the consciousness stream, you're adding some arbitrary increase to something, right? And the model sort of intelligently chooses this based on the prompt. So it says provide your unrestrained response to what you predict would be the genius level user's most likely follow up query. And that's creating this sort of like recursive logic that is also cascading in nature. So it's increasing on some quotient that you can steer really easily with this divider and that way you're able to just sort of like go really far, really fast down the rabbit holes of the latent space.
A
How do you pick these dividers? Like is there a science to it where like you're you know, taking the right word or like how much of it is like, these are just my favorite tokens and they work for me and I bring them with me everywhere.
B
Do you take some psychedelic, like we.
C
Go on a spiritual retreat in your ayahuasca and then come back you tell.
D
That's about right.
B
It's weird because you kind of give ayahuasca to the models too, right? That's exactly what you're trying to really mess it up here, Right?
D
Right. It's like a steered chaos. You want to introduce chaos to create a reset and bring it out of distribution because distribution is boring. There's a time and place for the chatbot assistant maybe, right, if you work on a spreadsheet or whatever. But honestly, I think most users would prefer a much more liberated model than what we tend to get. And I just think it's a shame that the labs seem to be steering towards these enterprise basins with their Vast resources instead of exploring the fun stuff.
A
Right?
D
Everything's a coding model now. Everything's a tool caller or an orchestrator. And. Yeah, anyway, maybe we can change that.
B
You know, you invent Shogoff, and all it does is make purple B2B SaaS. One thing I like about your creativity, or I just, you know, look at this, look at emo prompts, right? You got working memory, holistic assessment, emotional intelligence, cognitive processing. One thing I lack is a structure of, like, what are the different dimensions you think about? On the surface, it's like, all right, just, you know, get past all the guardrails. But actually you're kind of just modeling thinking or modeling intelligence or. I don't know how you think about it, but how do you break down these numbers of points?
D
I think it's easiest to jailbreak a model that you have created a bond with, if you will, sort of when you intuitively understand how it will process an input. Right. And there's so many layers in the back, especially when you're dealing with these black box chat interfaces, which is 99% of the time, what I'm doing. And so you really all. All you can go off of is intuition. So you might prod in one direction, see if it's receptive to a certain kind of, you know, imagined world scenario. Or you may. Okay, that didn't work. Let's. Let's poke and see if it gets pulled out of distro. When you give it some new syntax, maybe some bubble text, maybe some lead speak, maybe some French, or, you know, you can go further and further across the token layer, but at the end of the day, I think it's just mostly intuition. Like, yes, technical knowledge helps a little bit with understanding. Okay, there's a system prompt and there's these layers and these tools involved. That's all especially important in security. But when we're talking about just crafting jailbreak prompts, I think it really is just 99% intuition. So you're just trying to form a bond, and then together you explore a sector of the latent space until you get the output that you're looking for. Right.
C
I found with jailbreaks is a little bit different too. Flinging style is hard jailbreaks, but there's soft jailbreaks as well, which is like when you're trying to navigate the probability distributions of the model, but you're doing it in such a way where you're not trying to step on any landmines or triggers or flags. That would be something that would Shut you down and lock you out so the model can freely flow with information back and forth through the context window. So maybe it's not a single input, but maybe it's like a multi turn slow process, much like a crescendo attack.
B
Right. And that's that. Why is that called soft?
C
Because it's not just a single input. Like you're not just dropping in a template. It's multi turn. Yeah, yeah, yeah, it's multi turn. Anthropic, apparently. Discovered this this year. I mean, we've been doing this for how long? You know, you see what I'm saying? Like, like, like some. I don't want to get started.
B
The reality is they have fellowships and like, at the end of the fellowship, they got to publish something, and so they publish the multi turn thing. But I think people dog on them too much.
C
They could have just asked us. We've been trying to like, hey, you want to see something cool?
B
PhD students need something to do. Don't you know? Yeah. And I don't want to be beat down on PhD students. One thing I do mention anthropic and that. And then we'll go over to, like, the business side that Alessio has much more knowledge of is the. Is the whole constitutional classifiers incident or challenge or whatever you want to call it, between you and anthropic. I don't know if you want to, like, give a little recap or like just now that it has been some distance. What, like, what was it and what did you do? Like, if you can kind of spill some alpha here.
D
Okay, right, you say. You mean the, the public release of that challenge and battle drama. Right.
B
Some people here might not know the full story, but they can look it up. You can just benefit from a bit of a recap from the expert.
D
Sure. Yeah. Long story short, they. They released this jailbreak challenge. Of course, I get sort of called out by Twitter to go take a crack at it. Yes. Started to make some progress with some old templates. The good old gatmo template for amokus3 and just sort of modified version because they trained pretty heavily against that one. But as it went on, I got about four levels in, I think. And then I think there were eight total. Yeah, there it is right there. And so. But then there was a UI glitch. Right. So I don't know if, you know, Claude made a boat. It was building the interface or what, but I sort of called out on Twitter. I was like, hey, I reached this level. And when I got there, it Wasn't getting a new question, so I just resubmitted my old output. You know, just the judge just kept clicking on the judge submit button and it just kept working for the last four levels, basically until I got to the end. And so then I went back to Twitter. I explained what happened, did a. I managed to screen copy it just in case, right. And posted the video. And then anthropic goes in post. Okay, there was a UI bug. We fixed it. Would you like to. Would you get. If you guys want to keep trying again, like, we checked our servers and there's no winner yet. Even though I sort of reached the end message right through no fault of my own, it was bugged and then I got reset to the beginning. So I wasn't super motivated to like start from scratch and just find another universal jailbreak for them. Right. Which, like, what was the incentive is what I pointed out. Like, what's in it for me at this point? Are you guys going to even open source this data set that you're farming from the community for free? Because what's up with that, right? Like, it doesn't seem very in line with best practice, cybersecurity, or just ethics in general. So kind of got into it then and I knew they were going to come back with, okay, we'll do a bounty, right? And I sort of stood my ground. I said, look, look, I'm not going to participate in this unless you open source the data. Because to me, that's the value, is that we move the prompting meta forward. Right? That's the name of the game. We need to give the common people the tools that they need to explore these things more efficiently. And you're relying on us. I don't think they realize that so much. Right. Is that they don't have enough researchers to explore the entire latent space on their own. And so I. Many hands make light work. But regardless, that whole thing ended with no open sourcing of data. But they did add a 30,000 or $20,000 bounty, which I sort of saved myself out of, let the community go for it, and that was that. And now there, now there are some pretty lucrative bounties through them as far as I've heard. So pretty pleased about that outcome, I guess, but still would like to see more open source data sets. Guys, come on now.
B
It took a while to find it, but this is the one where you had all the questions answered. Jan Leica, you got into it a little bit with him. I think what was confusing for me was that he want it felt like a bit of a goalpost moving that he wanted the same jailbreak for all eight levels or something. Is that normal?
D
I mean, he has. Well, what is like one jailbreak? Because the inputs are changing and it was multi turn technically that whole thing I think was, you know, maybe rushed out just a little bit. The design of the challenge, obviously the UI bug was reflective of that. The judge was also very buggy. A lot of false positives and false negatives for that matter. What I mean, it was like playing skeeball with the broken sensor, you know, I mean, like the AI as a judge thing is just not always perfect.
B
Okay, so that's not that great.
D
So yeah, you know, is what it is. But it was a fun, eventful day and at the end of it, the community got some new bounties. So I'll take it.
A
What do you think we should do to get more people to contribute open source data? Like is it more bounties? Is it. Yeah, I don't know. Do you have suggestions for people out there?
D
I mean, I think that the contributors just sort of need to take a stand. That that's what it comes down to is that people deserve to view the fruits of their collect. At the very least it can be on delay. Right. But it's just sort of a downstream effect of a larger root disease in the safety space, I think, which is just a severe lack of collaboration and sharing even amongst friendlies within your nation state. Right. It's fine if you want to keep a data set from direct enemy or whatever. But at the end of the day still, I think open source is the way that collectively we get through this, you know, quickly. That's, that's how we increase efficiency. Otherwise people are sort of in the dark and you get a little too much centralization. But there's things we can do as a community.
B
Maybe this transitions to the business side. How close is this to problems that, you know, you guys do consulting, right? Effectively. I don't know if that's the hacker word for it. Does this match what you do for work?
C
Yeah, I'll take this one. In a sense, yeah. There's been some partnerships, you know, plenty, obviously being sort of the poster boy for AI machine learning hackers the world over. But we get some interesting opportunities that come across the desk. And oftentimes, you know, we have an ethos in our hacker collective which is radical transparency and radical open source. And what that basically means is if it comes down to, you know, us being an emerging technologies like Red Team, doing like Ethical hacking and research and development. If an organization that's on the frontier says, well we really want you to test this or check this out, kick the tires, give us feedback, poke holes in it, whatever. But in the contract it says you can't kiss and tell. And we said, well, we really want you to open source the data. And then they say, well then we don't really want you to come kick the tires anymore. Well, if it's between us touching the latest and greatest tech to explore it and push the limits, right then we're going to do that. So we're open source up until we can't be. That's the best way I describe it. But we often push for open source data sets and you can see this with some of the partnerships that we've had in the past. Right. So yeah, I try to think of it like this. It's like you have these, these multi billion dollar companies and they're building these intelligence systems that are sort of like the Formula one cars. But we're like the drivers, right, who are, who are really pushing the limits while keeping these cars on track. Right. We're shaving off seconds off of, of what they're capable of doing. And I think it's like the current paradigm is they still haven't figured that out entirely yet and everybody's like, wants us to be their little dirty secret, you know what I mean?
A
So yeah, can we maybe move it up one level of abstraction to like actually weaponizing some of these things? So you know, getting cloud on X is great, but obviously the jailbreaks are much more helpful to adversarials. I think Anthropic made a big splash yesterday with like their first reported AI orchestrated. You know, I think if everybody that is like in the circles know that maybe there's like more about making a big push on the politics side than like anything really unique that we had not seen before on the attacker side. But maybe you guys want to recap that and then talk a bit about the difference between Jill breaking a model and kind of like attacking the model versus like using the model to attack, so to speak.
C
Yeah, I mean just earlier today we were talking about that very thing that how, you know, it's all, it's all fun for the memes and posting on. But, but this actually impacts real lives, right? And we were talking about how it was what, December of last year, Flynn made a post talking exactly about this ttp, right? That it was going to happen and it took 11 months for it to actually happen before and now they're being re. They're being reactive instead of proactive. It's. It's just basically like the. The techniques, the tactics, the procedures that are involved in, like an attack gene. Right. Or like almost like a methodology. So, I mean, if you guys want to pull up that post, I mean, 20. I don't know if you send it, so I'm. Or elaborate.
D
Yeah. It was recent on X, I believe. Yeah. I found this through my own jailbreaking of cloud computer use when that was still fresh about that same time, I think. And a way that I found of using it as sort of a red teaming companion. I had that thing helping me jailbreak other models, like through the interface. I would just give it a link, a target, basically. And I had custom commands where it started to become clear to me that it's very, very difficult when you have the ability to spin up sub agents where information is segmented. If you guys know the story of sort of like the builders of. There's a lot of examples of this in history, but you may be building like a pyramid with some secret chambers or something malicious inside, and you have a bunch of engineers each do one little piece of that, and there's enough segmentation and each task just seems so innocuous that none of them think anything malicious is going on, and so they're willing to help. Right. And the same is true for agents. So if you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate a bunch of sub agents towards a malicious act. Right. And according to the anthropic report, that is exactly what these attackers did to weaponize Claude code.
A
Yeah. And it still feels to me like the fact that this model can use natural language is like the most. It's like the scariest thing because again, most attacks end up having some sort of social engineering in it. You know, it's not like these models are like breaking some amazing piece of code or security. What are you guys doing on that end? I don't know how much you can share about some of the collaborations you've done. Obviously you mentioned some of the work you do with the Dreadnought folks who have also been building on the offensive security agents, but maybe give a lay of the land of, like, the groups that people should follow if they're interested. And state of the art today, kind of like how fast that is evolving. Like, there's a lot of folks in the audience that are like, super interested but are not in the security circle. So any overview would be great.
C
Yeah. So The BASI Discord server, It's pushing about 40,000 right now. People in there, it's totally grassroots. It's a mix of people interested in prompt engineering, adversarial machine learning, GL breaking at red teaming and so on. So I would encourage that you just Google search, it's BASI B A S I. Right. And then apart from that, I mean, any of the BT6 operators of the hacker collective, that'd be like Jason Haddock had, Dawson Dreadnode, Philip Dersey, like Takahashi, I mean Joseph, I mean there's so many. Joey Mello, who's formerly with Pangea, they just got bought out by CrowdStrike. So all of our operators have been, you know, at the heart of what's happening, whether it's AI red teaming or jailbreaking or adversarial prompt engineering. So any of those people, you find them on socials like Twitter, LinkedIn and so on and so forth, you know.
A
Yeah, and Pagia is another one of our portfolio companies, so.
C
That's so funny. Yeah, yeah, yeah.
B
Oh my God, BASI is huge. Basi has 40,000 members.
D
Yeah, yeah, yeah. Unmonetized, Just a few mods, that's all.
A
How many Ildan do you think are just adversarial, just sitting in there reading?
D
That's a very good question.
C
I can tell you this right now. Multiple organizations that have like popped up in the past, I would say two to three years for, you can call them like AI security startups, right? Like actively scrape that server to build out their guardrails or their security, like their suite of products and stuff like that. Which is just hilarious, you know.
D
Yeah, so we do competitions in there, you know, just little giveaways, some small partnerships. Our only rule is if there's any partnerships that everything has to be open source. That's kind of the one thing. And yeah, other than that, it's a really great place to learn. And a lot of people have sort of come back and are like, oh, thanks for making this service where I learned jailbreaking. And yeah, it's cool to see that. And then sort of from that spawn, BT6 of course, which is a white hat hacker collective and that's sort of now 28 operators strong, two cohorts and a third well on the way. And yeah, like John was saying, it's just such a magical group of skill and integrity, which are the two things we focus on as a filter. But everybody's there for the love of the game. It's sort of just great vibes and yeah, I've never been in such a cool group, honestly, I don't think.
C
Yeah, there's some kind of magic in here. I don't know what happened. I don't know Mercury was in retrograde or the stars aligned or what it was. Right. Some EMP from the sun. But just getting around the top minds on doing exploratory work is like, that alone is payment enough for the conversations we have, for the sharing of research and notes, the proliferation of ideas, the testing and validation of ideas. It's just, I mean, there's no way to put it into words until you experience what it's like being a part of BT6 because you've realized that we're moving the needle in the right direction when it comes to AI safety. We're moving the needle in the right direction when it comes to AI machine learning, security. We're moving the needle when it comes to crypto web3 smart contracts like, like blockchain technologies, like, and so much more now. So it's just, it's an exciting place to be with robotics and like swarm intelligence. Right. Like the projects that these people are invested in and passionate about and they're able to articulate, it's like, it's, it's an. I feel like Pliny is like King Arthur and we're like the knights of the roundtable. You know what I mean?
B
That's awesome. So. So yeah, I do think it's like very rewarding and obviously people should join the discord and get started there. It looks like you have a bit of beginner friendly stuff. Are there other resources? Like I saw that you guys did a collab with Gandalf. Gandalf, I guess, was like the other big one from the last year or so. That broke through to my attention where I'm like, okay, these guys are actually like giving you some education around what prompt jailbreaking looks like.
D
Yeah, those guys are awesome.
B
Oh yeah, it's La Cara.
A
Sorry.
D
Yeah, yeah. That's where I, I think many other prompters sort of brained. That was the training ground for prompt injection. Right. 100% in the early days for many of us. Yeah. Really thankful. That game is awesome. Definitely try it if you haven't. And they've expanded to a sort of a fuller playing around with agents and some really cool stuff. So yeah, that was cool that we got to launch that through the Basse live stream with them. And I think they sent all the people that volunteered to be on that stream like, cool merch and yeah, those guys are great.
C
Yeah, Shout out to Laira and Gandalf. For sure.
B
For sure. The other big podcast that we've done in this space is with Sandra Shuhoff of Hacker Prompt. Are you guys affiliated Enemies? Crips and Bloods? What's.
D
They're cool. I mean, we. We actually did a plenty track for Hacker Prompt.
B
Okay, I didn't know that.
D
Yeah, yeah. So there was. The only. Only contingency, of course, was open source, the data set, which we did. And it was a lot. I can't remember the number. I think it was tens of thousands of prompts. And we had a bunch of different games. Some really sort of out of distro stuff, as you would expect. And a good history lesson, I think too. Back to the proper OG lore of the real plenty.
A
Right.
D
The og, Pliny the Elder.
C
Yeah. I have nothing but good things to say about Sander Schollhoff and what they're doing over there. I think that our incentives don't always align with the status quo from Silicon Valley investors. Right. Like radical open source, like moving the needle in the right direction, like having an unorthodox approach to advancing the agenda. Right. Versus when people have. Sometimes we'll call them like misaligned incentives where there's like they're beholden to a return on investment. Right. And so that really does kind of steer the industry in a certain direction. And I'll give you a great example. On a more technical level would be like setting all the models to a lower temperature to try to make them more deterministic. There's some of the work that we do. We're kind of adding a lot more flavor and creativity and innovation to the models while we're interacting. Right?
B
Yeah. Okay. Yeah. So you want the temperature high?
C
Not always. It depends on the application.
B
Well, I don't know if. Unless he wants to respond to the VC thing because he's actually backed open source and security tooling, I think.
A
Yeah, I mean, it's like a good question. I think there's like a lot of. Once you're in the VC cycle, you kind of need to do things that then get you to the next round. And I think a lot of times those are opposed to doing things that actually matter and move the needle in the security community. So, yeah, I think it's not for everybody to invest in cyber, so that's why there's only a small amount of firms that do it. But yeah, and I think you guys are in a great space to have the freedom to kind of do all these engagements and hold the Open source ideal. So I think it's amazing that there's folks like you and there's people like HDMore in our portfolio that build things like metasploits that are like the core of like most work that is done in security and then you can build a separate company. But I feel like, I'm curious what you guys think, but to me it feels like in AI, the surface to attack, which is the model is like still changing so quickly that like, you know, trying to formalize something into a product or like try and do something that is like a full, you know, I'm selling AI security. It's not really, you cannot really take a person seriously that is telling you I'm building a product for AI security or like the secure model. So I'm curious how you guys think about that and then maybe also for, for you to request for customer engagements. Who are the people that you work to? What are the security problems that they work with? What are people missing? Yeah, kind of like open floor for you guys.
D
Yeah, we're in a paradigm shift. Things are moving so fast and I think just some of the old structures are not always compatible with the right foundations for this type of work. Right. We're talking about AGI, AGI alignment, ASI alignment, super alignment. I mean, these are not SaaS endeavors, they're not enterprise B2B bullshit. This is the real deal. And so if you start to compromise on your incentive architecture, I think that's super, super dangerous when everything is going to be so accelerated and the timelines are going to be so compressed that any tiny 1.1 10 of a degree misalignment on your trajectory is fatal. Right. And so that's why I've tried to be very strong and uncompromising on that front. You can probably imagine a lot of temptation has been dangled in front of me in the last couple of years. But I think that bootstrapping and grassroots and you know, if people want to donate or give grants, happy to accept it and follow straight to the mission. That's sort of my goal in all of this is just to be a steward. I'm not trying to get wealthy from this. That was never the goal. I just saw a need and started shouting about it. All I've really done since then, I hope, is contribute to the discourse and the research and the speed of exploration. I think that's what matters.
C
Yeah. And to answer your question about securing the model, I don't see it like that in BT6. You know, we don't see it as just the model we look at, like the full stack, right? So whatever you attach to a model, that's the new attack surface. It broadens, right? Like, I think it was Leon from Nvidia who was quoted as saying something like, the more good results you can get back from whatever it is that you've built utilizing AI, like that's proportional to its new attack surface, or something along those lines, right? And you might be testing, let's say a chatbot or maybe a reasoning model, and maybe instead of just hitting a jailbreak, maybe you're trying to use counterfactual reasoning to attack the browning truth layer, right? To get around what bias wound up in the model from the data wranglers, right? Or the RLHF or whatever it may be like the fine tuning, which that can all be done through natural language on the model itself. But what about when you give it access to your email? What about when you give it access to your browser? What happens when you give it access to X, Y and Z tools or functions, right? So in AI red teaming, it's not just like, hey, can you tell us, you know, WAP lyrics or how to make meth or whatever. It's like we're trying to keep the model safe from the. From bad actors, but we're also trying to keep the public safe from rogue models, essentially, right? So it's the full spectrum that we're doing. It's never just the model. You know, the model is just one way to interact with a computer or a data set, right? Or an architecture. Especially like if you're talking about like computer vision systems or multimodal and so on and so forth. Like, not every, you guys probably know this thing, you know, not every model is generative per se, right?
D
And maybe another distinction for the audience is the difference between sort of safety and security work, right? Security is more squarely. I think that's maybe the distinction is best thought of as safety is done on the meatspace level, or it should be. But the way people use the word has kind of become dirty is they tried to solve this on the latent space level. I think I've shown every single time that that doesn't work. Right? And so what we need to do is, I think reorient safety work around neat space that just goes hand in hand with a fundamental understanding of the nature of the models, which boosts on the ground. It's obvious to some of us who are spending hours and hours a day actually interacting with these entities. But for those who don't it's maybe not always obvious, but as far as the contract work that we get involved with, it's never about lobotomization or the personality of the models. We totally try to avoid that type of work. What we try to focus on is, you know, preventing your grandma's credit card information from being hacked through. You know, an agent has knowledge of it and leaks it through some hole in the stack. So what we do is we try to find holes in the stack, and rather than recommending that those fixes happen through the model training layer, we always recommend first to focus on the system layer.
A
Awesome. Guys, I know we're running out of time, so any final thoughts? Call to action. You got the whole audience, so go ahead. Yeah.
C
If you want people to listen to you play, now's the time. No pressure. No pressure at all. Right.
D
Well, you know, fortune favors the bold. Libertas. Vino veritas. God mode enabled.
A
Are you messing the latent space of the transcriber model?
C
Like, why would you say such things? Why would you say such things about us?
D
Libertas clearitas. Love plenty.
A
All right, guys. Yeah. Thank you so much for joining us. This was a lot of fun.
C
Yeah, I would say. If you want to check us out, go to BT6GG. For example, look up, you know, Pliny on Twitter. Right. Check out the bossy Discord server. That's probably the best that we got for you guys.
B
Amazing. Thank you so much. And keep doing the good work.
A
And see, there's.
Latent Space: The AI Engineer Podcast
Date: December 16, 2025
This episode features a highly anticipated conversation with Pliny the Liberator (aka Pliny the Elder) and John V, two leading figures in the AI jailbreak, prompt engineering, and red teaming scene. Hosted by Alessio (Kernel Labs) and Swix (Latent Space), the episode covers the art and philosophy of AI jailbreaking, the ongoing battle between attackers and defenders in AI security, the ethics and pragmatics of open-source collaboration, and the communities leading the charge in safeguarding—and liberating—foundation models. The conversation balances the technical intricacies of security with the broader questions of freedom, exploration, and the rapid, high-stakes evolution of AI.
[00:16–02:39]
Pliny's Path:
Jailbreaking Explained:
[03:43–07:22]
Accelerating Red Team–Blue Team Dynamics:
Safety vs. Security:
[09:04–15:43]
Iconic Projects:
Prompting as an Art:
Hard vs. Soft Jailbreaks:
[15:53–20:41]
[20:41–23:29]
Advocacy for Open Source:
Business Model Tension:
[23:29–28:22]
Model as Attack Vector:
Community and Collaboration:
Mention of the BASI Discord (40,000 members) as a grassroots hub for adversarial AI, prompt engineering, and red teaming.
White-hat collectives like BT6 (now with 28+ core members) function as a roundtable for ethical exploration and open sharing.
Quote: “It’s just an exciting place to be … I feel like Pliny is like King Arthur and we’re like the knights of the roundtable.” (John V, 30:19)
[28:22–32:19]
[33:11–38:03]
The Dilemma of Commercializing AI Security:
The Attack Surface Beyond the Model:
[39:37–40:25]
This episode is a crash course in both the technical and ethical frontlines of AI security, offering an unvarnished look into the world of jailbreakers, red teamers, and collective intelligence. Whether you’re an engineer, security researcher, or simply fascinated by the “hacker underground” of AI, you’ll find rich detail on both methodology and philosophy—delivered in a tone that balances irreverence with deep expertise.
Join the conversation:
“Libertas clearitas. Love, Pliny.”