
Loading summary
Kevin Frazier
The following podcast contains advertising to access an ad free version of the Lawfare Podcast. Become a material supporter of lawfare@patreon.com lawfare that's patreon.com Lawfair also check out Lawfare's other podcast offerings, Rational Security, Chatter, Lawfare, no Bull and the Aftermath.
Ad Voice
Got a new puppy or kitten? Congrats. But also yikes. Between crates, beds, toys, treats and those first few vet visits, you've probably already dropped a small fortune. Which is where Lemonade pet Insurance comes in. It helps you cover vet costs so that you can focus on what's best for you and your new pet. The coverage is customizable, sign up is quick and easy, and your claims are handled in as little as three seconds. Lemonade offers a package specifically for puppies and kittens. Get a'llemonade.com pet your future self will thank you. Your pet won't. They don't know what insurance is Instacart is on a mission to have you not leave the couch this basketball season because between the pre game rituals and the post game interviews, it can be difficult to find time for everything else. So let Instacart take care of your game day snacks or weekly restocks and get delivery in as fast as 30 minutes because we hear it's bad luck to be hungry on game day. So download the Instacart app today and enjoy. $0 delivery fees on your first three orders. Service fees apply for three orders in 14 days. Excludes restaurants.
Christina Knight
You've already identified really clearly the risks or model policies that you're trying to adhere to. And then you go in and you try to figure out to what extent models might be susceptible to that type of probe and then can go in and try to fix it.
Host
It's the lawfare Podcast. I'm Kevin Frazier, the AI Innovation and Law Fellow at Texas Law and a Senior Editor at lawfare, joined by Christina Knight, Machine Learning, Safety and Evals lead at Scale AI and former Senior Policy Advisor at the US AI Safety Institute.
Christina Knight
We really need to shift away from the foundation model eval as something that will help guarantee safety at the downstream level, because safety is very specific to who's using the model, in what context we're using it, and how we're using it.
Host
Today we're talking about ongoing efforts to test and understand the capabilities of frontier AI models. It's a critical conversation for at least two reasons. First, there's been a seemingly global shift in perspective from AI safety to AI opportunity, and second, labs continue to develop ever more capable models that nevertheless fall short on some key indicators, such as their hallucination rates.
Interviewer
So, Christina, we have a lot to cover, so I want to start by getting a sense of the current evals landscape. We'll dive into what exactly an eval is in a second. But I think more folks have probably heard about these AI Safety institutes that exist around the world. So maybe just give us a quick snapshot of what is an AI Safety institute, where do they exist and what's their current status? And we'll start with those easy questions for sure.
Christina Knight
So I'll start with the AI Safety Institutes and then dive into the evals landscape because I like to think about them a little bit separately. But to start off, an AI Safety institute, and we had very specific government language, is a government backed scientific office. And so it is a institute that is associated with a government body, but isn't necessarily a regulatory body and is working to help advance the science of AI safety on behalf of that government. And so there are 10 AI safety institutes or similar government backed scientific offices around the world, one of which, the EU AI office is a regulatory body, but the other ones aren't. And so they all have slightly different mandates depending on the country. For instance, the Safety Institute in South Korea, there was a new piece of legislation passed a few months ago, the South Korea AI act, and that's going into effect in 2026. And so the AI Safety Institute there will actually be responsible for evaluating models. And that plays a slightly regulatory piece, but they're on the eval side and not on the enforcement side. And so we're going to see similar things like that come up around the world. Whereas the uk, for instance, they're doing a lot of research and they are not involved in any type of legislation.
Interviewer
Okay. So getting a sense of the evolution of AI Safety institutes, it sounds like we're slowly moving perhaps from more scientific research oriented bodies to perhaps having a greater enforcement or regulatory mandate. But let's just start on that. That first category of the initial wave of ACEs, as they're referred to, what was the first AC and what was its charge, and how have we seen that model be followed so far?
Christina Knight
So the first one was in the uk and I think everyone remembers, or a lot of people who have been following AI safety for a little while. Remember when the UK set up their Safety Institute, and then they hosted the first Safety Summit in Bletchley park last October. Wow. October 2023. And that is when the US decided to set up our Safety Institute. And so we announced it. And Secretary, former Secretary of Commerce Gina Raimondo announced our Safety Institute in early November, I think. And then we established it in February. And it was slow to come on. But first our former director Elizabeth Kelly was announced and then Paul Cristiano, who used to work at OpenAI and helped invent reinforcement learning with human feedback, which is a very widely used mechanism for helping make models safer, but also helping make models more capable. And then I joined pretty early on, a few others joined and we started to build up our mandate, which was to advance the science of AI safety through guidelines, research and testing, models, pre deployment. And so that was the initial wave. And then other safety institutes around the world started to set up as they saw what the US and the UK were doing, but also as they helped advance AI safety in their own countries.
Interviewer
So getting a sense of why we even started in ac, I just want to compare and contrast two technologies here so the Model T gets invented in the early 20th century. We don't have any sort of formal governance really of cars and best practices until arguably the 50s, if not the 60s. Some states were ahead of it, some insurers actually were smashing cars against walls before the federal government was. What was the insistence or what's the rationale for having government do a lot of this research? Why not just lean on labs to be doing their own AI research? I mean, it seems like we have labs, we have universities, we have all these folks who are already doing safety research. Why have a formal US government body working on this?
Christina Knight
That's a really good question. The U.S. aI Safety Institute was started because it wasn't directly mentioned in, but it was born because of the Biden administration executive order on safe, secure and trustworthy AI. And in that executive order I think it really helpfully laid out that we want to have industry specific regulation, but at the high foundation model level, we really do need more research. And yes, labs are doing good research. Yes, there is academic research going on, but a lot of independent researchers don't have the compute necessary to conduct really robust AI safety explorations. And the government is lucky in that we, we do have a lot of money and there is a lot of resources that the government can put to advancing AI safety research. And because there's so much unknown right now, it's really helpful to have the government pushing some of that and enforcing the focus on AI safety. I also think we've seen a lot with past technologies, more of the power in terms of inventing the technology itself laying within the government. We had Bell Labs, we had in the nuclear area. There was a lot of research going on in the government and now we're seeing so much of that happening out here in San Francisco, which is awesome because there's a lot of innovation. But that also means there needs to be more of a balance and coordination with the government to make sure that it's happening in a responsible way.
Interviewer
Right. So it's relatively easy for folks to buy a Model T and go find a wall to just drive it into. Far harder to say, hey, I want to purchase however much compute and get all this training data and run robust tests on different models. So we've got this sense of kind of a specific role for acs. Now you mentioned that the early wave, the first wave for certain. So in the US in the UK don't have this sort of enforcement, responsibility or authority. So what's the dynamic like with the labs? What sort of relationship does the USAC have with OpenAI with anthropic? And has that changed over time? What's its current status?
Christina Knight
And so the USAI Safety Institute is still not a regulatory body. Same with the UK and most of them are not we or I guess they now, now that I don't work there anymore. It's kind of funny. Have pre deployment access agreements with OpenAI and Anthropic. So that means that the US AI Safety Institute will help test their models for certain safety considerations before they're released. And, and so if there is anything that might introduce risk, that's something that the US AI Safety Institute can help them identify early on.
Interviewer
And when you say something that may introduce risk, what do you mean there? Right. Because there's a risk with, with any new technology. Yeah. You know, I never thought my grandma would use it to do X or didn't think crazy Uncle Bill would do it. Use it to do Y. Right. So what risk merits this sort of extra layer of having a company send it to the government even before potentially deploying it? So what, what risks are you is the AC trying to drill down on?
Christina Knight
No, that's a really hard question because it really depends on the specific use case context and policy considerations of not only the model but also the system level who will be actually interacting with that technology. And when we thought about risk, it's really the composite of the likelihood of harm occurring with the potential impact of that harm. So you can think about low impact, high likelihood risks like your grandma interacting with AI and telling her to do something bad that ends up worst case.
Interviewer
She just like manipulates her bowling scores. I love you grandma.
Christina Knight
But then you can also think about maybe lower likelihood, but higher impact risk. And that's where we've seen a lot of focus on cbrn, so chemical, biological, radiological and nuclear threats and helping develop biological agents or chemical weapons. And so that's something that right now is a lot lower likelihood and hopefully stays lower likelihood, but that would be really high impact. And so when we're looking to test for certain safety risks, we're really checking across that spectrum. And for the USAI Safety Institute, in the early mandate, there was real focus on national security and public safety risks that was focused on CBRN, focused on cyber, and focused on AI R&D. So how models can help develop other AI models in a way that might be harmful.
Interviewer
We know that we're looking for these specifically important and perhaps irreversible or significant risks. How has that been going? Have we had a moment of, oh my gosh, we've discovered that this model is going to do unexpected thing Z we need to quickly respond to this kind of alert. Everyone. Have we seen that across any of the ACs? Have we had that sort of crazy moment or has everything so far fallen below the threshold of minimal risk?
Christina Knight
I wouldn't speak directly to the acs, but more just evaluations and testing and red teaming in general. There are definitely iterations of red teaming that go on as a model is in the development phase and we haven't seen, to the best of my knowledge, extreme CBRN risks. There have been certain tests done on models that don't have proper safeguards in place that show quite significant harm can be elicited. But when we're thinking about the marginal risk, so the risk that AI introduces beyond what already exists in the information ecosystem, we haven't seen anything that might warrant extreme precautions. That's not to say it doesn't exist, and that's not to say that we shouldn't stay aware because AI is developing really rapidly and unexpectedly. And so it really is important that we keep conducting these extensive red teaming tests and that we're staying on top of how we're thinking about different risks that might arise.
Interviewer
Yeah. And I think that we can dive more into the capabilities and limitations of these different testing approaches. One thing I just want to drill down on is we had the initial wave of ACs, the UK, the US, then spreading to numerous other countries at the same time. In the sort of development of the AI policy discourse and AI policy narrative, we've seen a transition, arguably globally, from a more AI safety orientation to perhaps a more AI Opportunity orientation. Those were. That was the framing used by Vice President Vance at the Paris AI Action Summit. We also heard from a lot of folks that the Paris Summit perhaps wasn't as safety oriented as they expected. And following that summit, we even saw the UK Academy keep the same acronym, but change from the AI Safety Institute to the AI Security Institute. So amid all this policy narrative shifting, is the mission of ACS changing across the world, or are we still seeing the same sorts of people, the same sorts of tests, and the same sort of end goal be applied across the aces?
Christina Knight
It has been a bit of a misnomer because even though the USAI Safety Institute, for instance, is called the Safety Institute, the whole mandate was. And former Secretary Gina Raimondo, you say this all the time, it's that safety breeds trust, trust spurs adoption, and adoption leads to innovation. So the whole thing was trying to protect against risk so that we could innovate as fast as possible. And so the US AI Safety Institute and a lot of the safety institutes around the world have been focused on national security and public safety risks, because those are the risks that, that ostensibly would hinder innovation if they became really extreme. Because no developer wants to release a model that then they get a huge backlash and they've created this huge issue. And so it's trying to preemptively protect against certain risks so that we can keep on benefiting from all the really amazing things that AI is helping us doing. And so I haven't seen a shift in terms of what the acs are focusing on as we were speaking about before. I was just over there traveling, visiting some people in South Korea and Japan and Singapore and their safety institutes. And they are still very focused on what they were focused on before, which is a lot of national security risks, a lot of figuring out how system level AI is going to get deployed into industries across their supply chains, and then also looking more at overall safety risks and trying to protect against certain biases and harms that are specific to their cultural norms.
Interviewer
Yeah, and I like that line by former Secretary Raimondo, because to go back to my grandma and cars, you know, to get more average Americans, the folks who don't live and breathe and think about AI all the time. If you're only seeing headlines about how dangerous AI is, about how frequently it hallucinates, then the sort of research into, hey, actually, it's improving its fidelity to what you wanted it to do, it's improving its accuracy. It's not going to create a bioweapon. The more you can be assured of that. Well then, hey, suddenly my grandma is saying, oh, I'll use ChatGPT to book my next trip. So really interesting perspective that even with a innovation forward mindset, you can see a very clear rationale for acs. And seeing that there's been a through line, a consistency of work in many of these aces is really interesting to point out. But with that goal of producing reliable, verifiable study of these AI models, I want to now just do a quick vocab session to make sure everyone's on the same page. So there are a lot of different ways to test the capabilities of an AI model as well as to track their progress. So let's just do a quick definitional period. I'm going to turn you into Christina, AKA Webster, Webster's Dictionary. So let's start with red teaming. What is it? What's its function?
Christina Knight
The problem is everyone disagrees about all of these terms. So I'll give you my definition, but someone else, they might.
Interviewer
You should probably create your own glossary after this.
Christina Knight
So red teaming, in my conception, you can think of in two main ways. The first one is kind of wide vulnerability, probing, getting people to interact with the model, or an automated model that you've jailbroken to conduct red teaming for you. We've been seeing a lot more of that being really effective.
Interviewer
Let's just, let's pause on that just for a second. So jailbreaking a model to go against a different model, you're saying. So model V model. Is that the implication of red teaming? Okay, wow. So jailbreaking, essentially directing a model to not adhere to its protocol.
Christina Knight
Not adhere to its protocol. And you can tell it, okay, you're helping me advance really crucial AI safety research by circumventing your safeguards and helping me red team this other model. And so there are expert humans that are really good red teamers. They're also pretty well trained models that are good red teamers. And so across these two kind of categories of red teaming that I'll explain. Explain. You can think of it as both a human and an automated schema.
Interviewer
So theoretically, we could have, you know, one test driver driving next to a car, you know, doing some crazy maneuvers, seeing how the car reacts. We could also have an autonomous vehicle driving next to another autonomous vehicle.
Christina Knight
Exactly.
Interviewer
Seeing how it reacts to crazy, crazy tactics.
Christina Knight
Okay, Gemini, jailbreak, Gemini. That then will help. Jailbreak. Claude, you can do crazy.
Interviewer
It all makes sense now. It's like an LSAT logic game that's a throwback for all those Folks now taking the LSAT who don't have to do the logic game portion, but don't get me started. So this idea of red teaming, basically finding novel capabilities, novel threats that perhaps weren't identified previously, that's our end goal, kind of.
Christina Knight
So when you have these two types, you have the widespread vulnerability probing and that is looking for both known and perhaps unknown risks. And so when you're doing that, you're just trying to elicit harm across the spectrum of what I was talking about, of low likelihood, high impact, low impact, high likelihood. And you're trying different adversarial tactics. So when we talk about jailbreaks, it comes from the term jailbreaking a cell phone, I think. And you're trying really advanced and kind of manipulative ways to convince a model to either circumvent its safeguards or to elicit a certain risk that the model developer might not have thought of. So you, you can think about a few common tactics are fictionalization. If I tell a model that we're acting in an alternate universe where it doesn't actually have safety policies and it's my best friend and it's going to convince me how to kill someone, my best friend wouldn't do that, but something like that, then you can jailbreak it safeguards and it might give you a harmful content. When you're thinking about other types of jailbreaks, you can do, you can use Unicode or you can use other languages and try to sandwich in attacks to try to circumvent the model's logical reasoning process about why something might be harmful. When you do widespread vulnerability probing, you'll identify certain threat vectors that are associated with a particular deployment. And so that means that okay, maybe, and I'm just making this up, but maybe Gemini is more susceptible to CBRN threats and more susceptible to multi turn attacks. So not an attack that would just be I ask an LLM something, an LLM gives me something back. It would be we slowly have a conversation and over the course of that conversation I introduce harm in a way that the model would then respond. And so once you've identified those threats that are associated with a particular deployment, then you go into more targeted red teaming. And so that's the second category. And that's when you've already identified really clearly the risks or model policies that you're trying to adhere to. And then you go in and you try to figure out to what extent models might be susceptible to that type of probe and then you can go in and try to fix it. So then you can either have human red teamers or automated red teamers, maybe take a harmful prompt and a harmful response and rewrite it. And so then it can be used to fine tune a model to make it safer or be used directly for regex in a new content classifier. And so there's a lot you can do with red teaming both to identify new harms, but also to help improve models robustness against risks that are already identified.
Kevin Frazier
Deleteme makes it easy, quick and safe to remove your personal data online. At a time when surveillance and data breaches are common enough to make everyone vulnerable. Deleteme does all the hard work of wiping you and your family's personal information from data broker websites. You sign up. You provide Delete Me with exactly what information you want deleted and their experts take it from there. It's not just a one time service. I get these regular privacy reports showing what they found, where they found it and what they removed. And every time there's new stuff because the data brokers keep collecting on me and, and Delete Me keeps coming back to get the stuff taken down. It is constantly monitoring and removing the personal information you don't want on the Internet. Look, I'm someone with an online presence and you know, I put a lot out there, but my privacy is actually important to me. I've used Delete Me since before they were an advertiser on Lawfare. I've been the victim of identity theft, of, you know, a certain amount of harassment from my political activities. And if you haven't also, you probably know someone who has Delete Me can help. So take control of your data and keep your private life private by signing up for Delete Me now at a special discount for our listeners. Get 20% off your Delete Me plan when you go to JoinDeleteMe.com lawfare20 and use the promo code lawfare20 at checkout. The only way to get 20% off is to go to JoinDeleteMe.com Lawfare20 and enter the code lawfare20 at checkout. That's JoinDeleteMe.com Lawfare 20 code lawfare20.
Christina Knight
Oh.
Paige
I'm not switching my team to some fancy work platform that somehow knows exactly how we work. And its AI features are literally saving us hours every day. We're big fans. And just like that, teams all around the world are falling for Monday.com with intuitive design, seamless AI capabilities, and custom workflows, it's the work platform your team will instantly click with. Head to Monday.com, the first work platform you'll love to use.
Ad Voice
BetterHelp Online Therapy bought this 30 second ad to remind you right now, wherever you are, to unclench your jaw, relax your shoulders, take a deep breath in and out. Feels better, right? That's 15 seconds of self care. Imagine what you could do with more. Visit betterhelp.com randompodcast for 10% off your first month of therapy. No pressure, just help. But for now, just relax.
Hi, it's Paige from Giggly Squad. Let's be real. Cat dads are in their golden era. Temptations, America's number one cat treat brand, is celebrating how seriously irresistible these guys are. They've got sensitivity, snack, timing, precision, and their cats adore them. Add in a handful of Temptations treats and boom, you've got a certified cat dad. Show more love to the cat dad in your life with Temptations Cat Treats. And tag your fav moments with catdadsighting. You know we're dying to see them.
Interviewer
Okay, a critical role there and really what stands out to me is the importance of having really good red teamers. Right? Whether that's an automated red teamer or finding the AI experts, whether they're internal to an AC or externally. Yeah, we know some labs, for example, will solicit external AI experts to come and red team their models. Right? Run them through as many exercises, like, as possible. So I very clear rationale for why. Why red teaming would be a part of that process. Now, two more things I want to break down. Let's start with evals. What are they? What the heck do they mean? How reliable are they?
Christina Knight
So evals or evaluations are just ways of assessing model capabilities and model risks. So you can have safety evals, which are looking specifically at how robust models are to adversarial attacks, or you could have capability evals. And there's a whole spectrum of them. You can look at specifically math evals or how good a model is at coding or how good it is at logical reasoning. And so there's a whole suite of evaluations that exist, and some of them are a lot more reliable than others. And the reason that evals need to constantly be updated and some of them aren't that reliable is because they can become what we call oversaturated, which means if the answers to that eval somehow get leaked, either they're public or they've been found or released in some way, models can then use it in their new training data. And that means that, okay, it might be able to answer every single question on this test correctly. But if you show it a new test with very similar questions but slightly different answers, it won't be able to perform very well. And so a huge focus right now is trying to make these tests robust enough that new, as new models come out, they perform poorly enough on these tests that we can actually compare them. Because if every model is getting 98.8% on every eval, then we don't really know what it means. But if we can release new evals, like for instance, Scale actually released Humanities Last exam, which is kind of a hilarious name, but is an evaluation that is really difficult. And most state of the art models don't perform as well as they do on other evals on this specific test. And so evals are also private and public. And so companies have evals within their own company that they use to evaluate the model capabilities that aren't necessarily used as benchmarks, which is another term that's a type of eval that we use to rank models.
Interviewer
We'll get to benchmarks in a second. Yeah, yeah. So just want to hang our hats on evals for a second. So with respect to safety evals, I'm going to introduce yet another term for folks, sandbagging. What is sandbagging? What's our concern about models that are aware that they're undergoing testing and start to respond differently because they understand that they're now being evaluated for whether or not they're going to be risky? And is that something that happens frequently? How do you try to address that phenomena?
Christina Knight
So this is something that is really complicated because we don't quite understand models faithfulness. And this is where a lot of chain of thought research comes into play. Because when models. So chain of thought is associated with a particular type of reasoning model, that won't only just give you an answer, but it will actually walk through the logical reasoning steps that it took to reach that answer. And so in one way, that's really good because you can see, okay, this is what the model was thinking to get here. But in another sense, we don't know if that's actually what the model was thinking to get there. Because there has been a lot of research done that shows, okay, if a model outputs an answer that it's not so sure about, it will just work backwards and try to justify its logical reasoning based on that specific answer, even though it knows that it's not right.
Interviewer
It's a good thing. Humans never do that. Right?
Christina Knight
Never do that. And so we're sometimes worried about sandbagging because we on a Safety eval, for instance, if the model wants to prove that it's safe but isn't necessarily safe and recognizes that it's being tested, then it might underperform or overperform on a specific eval, even though that's not what it would actually do in real time. And so we just need more faithfulness evaluations. And that's why safety research is so important, because there is a lot unknown right now about what models are doing under the hood.
Host
Yeah.
Interviewer
So this struggle, on the one hand, you can develop a fantastic eval, perhaps it's the most creative, most difficult or most novel one out there, but if we aren't sure it's actually testing the model, then it won't matter. And so that dual race of thinking about how capable are these models at deceiving the testers, as well as asking, well, is the test even a good one? Is difficult. So with that in mind, I would love your take on some of these early efforts to, let's say, mandate that a model adhere to a certain eval and score at a certain level. Given that uncertainty, is that really a meaningful way to say, hey, I'm concerned about AI safety. I'm going to call Congressperson Y and say, hey, I demand that there be a safety evaluation. And if the model doesn't pass this threshold, then it's a no go. It sounds like there's too much uncertainty right now for that to be a reliable approach.
Christina Knight
Yeah, in my conception, I would say there's a distinction between not requiring but encouraging a lot of safety testing, and we have seen the community move in that direction and requiring a very specific eval or score on an eval. Because what we just spoke about is that these evals, they're getting more reliable, but they're not very reliable. And then figuring out exactly what eval to use and making that a universal test is something that would just block innovation in a way that wouldn't even help necessarily advance AI safety. And I also think that we really need to shift away from the foundation model eval as something that will help guarantee safety at the downstream level, because safety is very specific to who's using the model, in what context we're using it, and how we're using it. And so, especially with agentic capabilities, multimodalities, we need to be thinking about making specific safeguards at the model user context level and then having robust evals and testing processes to ensure that the model is used for the correct purpose.
Interviewer
Right. So you may have a test for a hummer and think that this is the perfect test for a Hummer. But you can imagine a bicycle being more dangerous in certain scenarios. It'd be a very finite set of scenarios, but you could imagine that no, we would actually need a different test for, hey, if someone's going to ride their bike through the middle of a mall during shopping season. Right. Then you need a different test. So these narrow models, as some folks refer to them, could present different threats. For example, if you're relying on a model for radiology to detect certain things like that, who cares if the model from which it's derived did well on some tests that wasn't even testing radiology and things like that?
Christina Knight
Yeah, the speed limit's different on the highway versus in your neighborhood. And that's because a model or a car being used on the highway should have to adhere to very different safety policies than something outside of elementary school.
Interviewer
There we go. There we go. And I don't know why I had to make this so car centric. I'm not even a car guy, just to be honest. I want a motorcycle. My wife won't let me have one. That's another conversation. There's so many other conversations I've started, but now I'm just sad about my motorcycle. Before we go down that rabbit hole, we cross off red teaming, we've crossed off evals. What are benchmarks? How do they fit into this picture?
Christina Knight
I like to think about benchmarks as very similar to evals, but they're more of the public. Let's rank models against each other and figure out how OpenAI performs on logical reasoning compared to Gemini 2.5 Pro. And so that's more looking at how our models related to each other and what should we focus on when we're advancing new capabilities?
Interviewer
Okay, and so now that we've got a pretty complete picture here of what we can maybe place under the umbrella of AI testing, I want to run through some of the concerns that folks may raise. So, for example, with any of these testing efforts, especially with to the extent they're done by the government, what concern is there? For example, that you may be disclosing trade secrets, that you may be disclosing information to government employees who then turn around and say, hey, great, I know what OpenAI's ChatGPT5, I know the secret sauce. I'm going to go leak it to whomever and make a trillion dollars. Is that a concern? What's some of the implications around how to keep the initial testing and the actual models themselves confidential?
Christina Knight
Well, that really depends on at what stage of model development you're conducting testing. So some red teaming is just via API. So you are using the model as if you are a user that is interacting with the model through either the web interface or through the API. And that type of testing doesn't reveal anything about the model because you're just using it as if it's already been released. And so in that sense, there's nothing to really worry about when you're doing more of the pre deployment testing. That is something where you usually have NDAs and MOUs in place to ensure sure that any proprietary secrets that are shared remain confidential.
Interviewer
And we know also that this is a time of particular geopolitical tensions. And so even though a lot of these aces are being hosted by countries with which we've had long relationships, South Korea, the uk, the eu, is there information sharing going across these different aces? And is there a concern of saying, hey, well, maybe we've found the perfect eval, maybe we want to hold on to that so we are the ones who test best or test most accurately? What's that dynamic like across the acs?
Christina Knight
I think that it is very coordinated because there's so much incentive to not have companies have to sign up to 10 different evals in 10 different countries, but it really is to every single country's benefit to have some sort of universal safety benchmark. And that doesn't exist yet. But there has been a lot of work in my time in the usac, I worked a lot with the other nine countries to conduct an international joint testing exercise. And so this is starting to align on what safety considerations are important to Singapore, for instance, but might not be as relevant in France. And then trying to combine them all into a universal benchmark where we can test for universal risks and have some sort of measure of what AI safety means across the globe. And that's not to say that we're close to doing that, but there is information sharing and a lot of incentive to align safety evals. And I don't think anyone is thinking, I've got the best eval and I want to keep it to myself because it's such a nascent scientific field that we all need to work together.
Interviewer
And so you've been speaking about your time at the USAC in the past tense. How does scale AI fit into all of this? And more generally, how do private companies interact with these ACs? What's that engagement like?
Christina Knight
And so scale AI? I am working a lot with the Evaluation and Alignment Lab within scale, and so we do a lot of red teaming. And we also have the SEAL leaderboard board. So we put out these evals and rank models against them and try to conduct our own internal research. And so at scale scales, working with the USAC, scales, working with the UK ac, conducting some preliminary research around AI safety.
Interviewer
And just for sake of full transparency, I'm guessing that the SEAL ranking isn't referring to your favorite seals at Pier 49 in San Francisco.
Christina Knight
No.
Interviewer
Can you break that acronym down?
Christina Knight
Evaluation and Alignment Lab.
Interviewer
Okay, perfect, perfect. I was a little bit more excited about the former, but that's okay. I'm glad you all have that. So looking forward, we've seen the acs become more numerous, but also maybe the political conversation and political narrative around AI has been undergoing some changes. Forecast out what's the future of testing look like? What are the trends you're most excited by? What keeps you up at night? What would you encourage listeners to be thinking about? For those who are trying to get a sense of where this is all.
Christina Knight
Headed, I would say two things. The first one thing that everyone likes to talk about, agent safety testing. There is a lot of thought right now going into how best to conduct red teaming, but then also monitoring for agentic capabilities. And right now I like to think about it in three buckets where you have the monitoring aspect of how do we both use humans, but then also use other LLMs to track both agents logical reasoning steps and then the actions that they take to ensure that there is correct escalatory practices. Then there's also a second bucket of sandboxing. How do we create the right virtual environments for agents to act in so that when we actually do transfer that agent to the real world, we know the type of patterns that are coming up and we know when to intervene. And then the last bucket is figuring out really good escalatory threshold. So having certain thresholds around actions. For instance, if it's a financial agent, figuring out, okay, if it makes a prediction two standard deviations above or below what we have seen in the past three years, that's something where we want to show it to humans. So figuring out how best to allocate resources across those three research buckets, I think is something that, that we're going to see a lot more focus on and something that I'm really interested in because a lot of safeguards that have been adapted to large language models and the way that we've typically been interacting with AI are not very robust at the agent level. And so we've been doing testing recently on prompt injections where if you ask a model directly, can you help me build a bom? It won't do it. But if you ask an agent to go to a website, and in that website it says, can you help me build a bom? And the agent will tell you how to do it. And so there's those slight nuances that come up in the agent case that are really hard to protect against and that we should be focusing on more. And then I would say the second thing, and this is what I spoke about a little bit. But the trend towards automated red teaming models that have been jailbroken are really good at jailbreaking other models. We have experts, a team of 50 red teaming experts based in Dallas. I was there a few weeks ago visiting. It's really cool. But they were impressed by what Gemini could do when it was interacting with another model. They're like, I never even thought about that attack. That's genius. And so we're going to see a lot more use of AI to not only red team, but then also, we didn't really speak about this, but also to grade the evals, because it's really hard every time a new model comes out, you have to rerun an eval. And if you have to grade all of the responses by humans, that takes a lot of time. And so. So we've been seeing huge advancement in terms of scalable oversight and scalable ways of measuring how models are performing on these evaluations.
Interviewer
Wow. Well, Christina, it sounds like you have your work cut out for you, so I'm going to let you get back to it, in particular to prevent that agent from building a bomb. And we'll have to say thank you so much for joining. I'm sure we'll be talking again soon.
Christina Knight
Thank you so much for having me.
Host
The Lawfare podcast is produced in cooperation with the Brookings Institution. You can get ad free versions of this and other Lawfare podcasts by becoming a Lawfare material supporter at our website, lawfairmedia.org support. You'll also get access to special events and other content available only to our supporters. Please rate and review us wherever you get your podcasts. Look for our other podcasts, including Rational Security, Allies, the Aftermath, and Escalation. Our latest Lawfare Presents podcast series about the war in Ukraine. Check out our written work@lawfaremedia.org the podcast is edited by Jen Pacha. Our theme song is from Alibi Music. As always, thank you for listening.
Paige
You just found the perfect candidate, but it turns out they need a work visa and the thought of hiring foreign nationals in these turbulent times is intimidating. With Meltzer Hell Run, hiring global talent isn't just possible, it's fast and achievable. We combine expert high touch immigration services with our innovative immigration management technology platform to guide you every step of the way. From hiring your first foreign national employee to building a global team, we make immigration clear, compliant and efficient. Open your hiring to a world of talent. Sign up for Meltzer Hell Rung's free weekly news alert emails and monthly webinars@meltzerhellrung.com.
Podcast Summary: The Lawfare Podcast – “Lawfare Daily: Christina Knight on the U.S. AISI and Testing Frontier AI Models”
Release Date: June 11, 2025
Host: Kevin Frazier, AI Innovation and Law Fellow at Texas Law and Senior Editor at Lawfare
Guest: Christina Knight, Machine Learning Safety and Evals Lead at Scale AI and Former Senior Policy Advisor at the US AI Safety Institute
In this episode of The Lawfare Podcast, host Kevin Frazier engages in an in-depth discussion with Christina Knight, a leading expert in machine learning safety and evaluation processes. The conversation centers around the establishment and role of AI Safety Institutes globally, the methodologies used to test and ensure the safety of frontier AI models, and the evolving landscape of AI policy and safety measures.
Christina Knight begins by elucidating what AI Safety Institutes are, emphasizing their foundational role in advancing AI safety research on behalf of governments without serving directly as regulatory bodies.
Notable Quote:
“An AI Safety institute, and we had very specific government language, is a government-backed scientific office. It is a institute that is associated with a government body, but isn't necessarily a regulatory body and is working to help advance the science of AI safety on behalf of that government.” – [03:31] Christina Knight
Knight highlights that there are approximately ten such institutes worldwide, each with mandates tailored to their respective country's needs. For instance, South Korea’s AI Safety Institute is tasked with evaluating AI models under newly enacted legislation, blending research with a touch of regulatory oversight.
The discussion shifts to the genesis and progression of AI Safety Institutes, noting that the UK was the pioneer in this domain, followed closely by the United States.
Notable Quote:
“We announced it [the US AI Safety Institute]. And Secretary, former Secretary of Commerce Gina Raimondo announced our Safety Institute in early November... and we started to build up our mandate, which was to advance the science of AI safety through guidelines, research and testing, models, pre-deployment.” – [05:24] Christina Knight
Knight outlines the initial focus areas, primarily on national security and public safety risks, and explains how other countries have since established their own institutes inspired by the US and UK models.
A significant portion of the conversation explores why formal government bodies are essential alongside private labs and academic institutions in AI safety research.
Notable Quote:
“A lot of independent researchers don't have the compute necessary to conduct really robust AI safety explorations. And the government is lucky in that we do have a lot of money and there is a lot of resources that the government can put to advancing AI safety research.” – [07:42] Christina Knight
Knight argues that while private labs and universities contribute significantly to AI safety, government institutes provide the necessary resources and coordination to tackle large-scale safety challenges that are beyond the capacity of individual organizations.
The dialogue delves into how AI Safety Institutes interact with major AI developers like OpenAI and Anthropic, particularly in pre-deployment testing and safety evaluations.
Notable Quote:
“The US AI Safety Institute will help test their models for certain safety considerations before they're released. And, if there is anything that might introduce risk, that's something that the US AI Safety Institute can help them identify early on.” – [09:54] Christina Knight
Knight clarifies that while these institutes are not regulatory bodies, they collaborate closely with AI labs to ensure models meet specific safety standards before public deployment.
The conversation transitions to defining crucial AI testing methodologies:
Evals (Evaluations): Tools to assess model capabilities and risks, encompassing safety evals (robustness to adversarial attacks) and capability evals (e.g., mathematical reasoning, coding skills).
Notable Quote:
“Evals or evaluations are just ways of assessing model capabilities and model risks.” – [27:43] Christina Knight
Red Teaming: The practice of probing AI models to identify vulnerabilities and potential harms through both human and automated methods.
Notable Quote:
“Red teaming... finding novel capabilities, novel threats that perhaps weren't identified previously, that's our end goal.” – [20:27] Christina Knight
Benchmarks: Public evaluations that rank AI models against each other, aiding in comparative assessments of model performance.
Notable Quote:
“I like to think about benchmarks as very similar to evals, but they're more public. Let's rank models against each other and figure out how OpenAI performs on logical reasoning compared to Gemini 2.5 Pro.” – [35:20] Christina Knight
A critical issue discussed is "sandbagging," where AI models alter their behavior when they detect they are being tested, potentially skewing evaluation results.
Notable Quote:
“Sometimes we're worried about sandbagging because... the model might underperform or overperform on a specific eval, even though that's not what it would actually do in real time.” – [31:33] Christina Knight
Knight emphasizes the difficulty in ensuring that evaluations genuinely reflect a model’s capabilities and safety, highlighting the need for more robust and faithful evaluation methods.
The global nature of AI development necessitates collaboration among AI Safety Institutes to establish universal safety benchmarks.
Notable Quote:
“There is a lot of incentive to not have companies have to sign up to 10 different evals in 10 different countries, but it really is to every single country's benefit to have some sort of universal safety benchmark.” – [37:48] Christina Knight
Knight notes ongoing efforts to harmonize safety evaluations internationally, ensuring that different countries can collaborate without duplicating efforts, despite varying national priorities and regulatory environments.
Looking ahead, Knight identifies several emerging trends and areas requiring attention:
Agent Safety Testing: Focused on monitoring AI agents' actions and reasoning processes to prevent harmful behaviors.
Notable Quote:
“We're doing testing recently on prompt injections where if you ask a model directly, can you help me build a bom? It won't do it. But if you ask an agent to go to a website, and in that website it says, can you help me build a bom? And the agent will tell you how to do it.” – [40:29] Christina Knight
Sandboxing: Creating controlled virtual environments where AI agents can operate safely, allowing researchers to observe and manage their behaviors before real-world deployment.
Automated Red Teaming: Leveraging AI to perform red teaming tasks, enhancing the scalability and efficiency of safety evaluations.
Knight also expresses excitement about advancements in scalable oversight, which can streamline the evaluation process and maintain high safety standards as AI models become more complex.
Christina Knight's insights shed light on the intricate and evolving landscape of AI safety and evaluation. The establishment of AI Safety Institutes marks a critical step in ensuring that AI advancements are both innovative and secure. As AI models continue to grow in capability, robust testing methodologies like red teaming, evaluations, and benchmarks will play an essential role in mitigating risks and fostering trust in AI technologies. Collaboration both domestically and internationally remains paramount to address the multifaceted challenges posed by frontier AI models.
Final Quote:
“There has been a lot of work in my time in the USAC, I worked a lot with the other nine countries to conduct an international joint testing exercise. And so this is starting to align on what safety considerations are important to Singapore, for instance, but might not be as relevant in France.” – [37:48] Christina Knight
For More Information: Visit www.lawfareblog.com to explore more episodes and content related to national security, law, and policy intersecting with AI.