Summary6 min read

The Lawfare Podcast — Lawfare Archive: Elliot Jones on the Importance and Current Limitations of AI Testing

Date: March 15, 2026
Host: Kevin Frazier
Guest: Elliot Jones (Senior Researcher, Ada Lovelace Institute)

Episode Overview

This episode digs deep into the complexities of artificial intelligence (AI) testing in the context of recent developments in AI regulation, such as the European AI Act, and high-profile industry moves by OpenAI and Anthropic. Kevin Frazier interviews Elliot Jones about the state of AI assessments, the importance of rigorous testing, current regulatory approaches in the EU, UK, and US, and the practical and ethical limitations of existing methodologies. The discussion is grounded in a thorough report co-authored by Jones, offering listeners clear definitions, critical policy analysis, and a nuanced look at the stakes involved as AI systems become more pervasive and powerful.

Key Discussion Points & Insights

1. Why AI Testing Matters

Recent Developments Drive Concern
- Unprecedented growth in AI capabilities (e.g., ChatGPT and successors) has increased attention on how these systems work and their risks (03:28–04:10).
- There is a lack of comprehensive guides on how useful current assessment tools are.
Quote:

"We've seen massive developments in the capabilities of AI in the last couple years... but we don't really understand how it works, how it's impacting society, what the risks are."
— Elliot Jones (03:28)

2. Clarifying the Language: Benchmark, Evaluation, and Audit

Definitions and Distinctions (04:10–06:33):
- Evaluation: Broad umbrella; understanding model performance or impact.
- Benchmark: Standardized test (often questions with known answers) for comparability.
- Audit: Structured, standardized process akin to financial audits, possibly including governance assessments, often conducted by a third party and more rigorous/expensive.

3. Unique Challenges of Testing AI

Difficulty vs. Traditional Tech (06:33–08:13):
- Narrow AIs (e.g., medical imaging) can be tested with clear success/failure criteria, but foundation models (LLMs) have unpredictable, broad applications.
- Developers cannot foresee all uses or risks—making comprehensive evaluation extremely hard.
Quote:

"With a car... we have a pretty good idea how the combustion engine works... If you ask a foundation model developer, why does it give the output it gives? They can’t really tell you."
— Elliot Jones (07:07)

4. Regulatory Approaches: EU, UK, and US

EU:
- European AI Act mandates systemic risk evaluations and considers codes of practice, with a forthcoming European AI Office possibly establishing third-party assessments (08:47–11:46).
UK:
- Voluntary regime. The UK AI Safety Institute conducts and publishes evaluations with industry cooperation.
US:
- Similar to the UK, voluntary and collaborative with a new AI Safety Institute being stood up.

5. Who Should Test? Independence and Talent Gap

Third-Party Assessments:
- EU models may rely on third-party auditors, but procedures and independence are still unclear.
Government Involvement:
- UK/US Safety Institutes may increasingly run their own benchmarks, reducing risk of manipulation.
Talent Shortage:
- Current reliance on industry talent creates potential bottlenecks; knowledge sharing and specialization among government agencies is anticipated (13:30–16:09).

6. Testing Gaps and Incentives

Voluntary Participation and Competition:
- Companies might hesitate to share sensitive models due to commercial risk or fear of being singled out for criticism (18:07).
Quote:

"If you choose to share your model... the AI Safety Institute says it's got all these problems, but someone else doesn't, then that just makes you look bad..."
— Elliot Jones (18:07)

7. Critical Harms and Current Limitations

SB 1047 Example (California):
- Labs would need to assure that their models do not enable critical harms—but current tests are largely inadequate for this (20:32–22:15).
Proxy Benchmarks:
- Proxies for dangerous knowledge exist, but lack real-world validation or predictive power.

8. What Happens After Testing?

Post-Market Monitoring and Remediation:
- The report calls for continuous monitoring, incident reporting, and practical interventions (such as fixing datasets or installing guardrails).
- In rare cases, labs may delay or forgo release due to risks (e.g., OpenAI’s unreleased voice cloning model) (30:21).
Quote:

"There is some stuff you can do that's just as you're training the model, as you're testing it, kind of adjusting it..."
— Elliot Jones (30:21)
- But commercial pressure often drives premature or risky releases.

9. Costs and Resources

Testing is Expensive:
- Implementing even off-the-shelf benchmarks can require hundreds of thousands of dollars and significant engineering resources (33:47–34:50).

10. Risks of Audit Washing

Audit/Credential Tokenism:
- Danger that audits become symbolic “stamps of approval” rather than meaningful safety measures, especially given limited maturity of current evaluation techniques (36:01–36:43).

11. Process Behind the Report

Methodology:
- Based on direct interviews with developers, evaluators, government, and academics, plus a review of the literature. Some sources remained off-record due to sensitivity (36:56–38:36).

12. Advice for Regulators

Regulatory Humility and Prioritization:
- Evaluations are most useful for risk-scoping and targeted investigations, but not yet reliable for blanket safety approvals.
- Regulators should consult with industry, academics, and affected communities to develop test criteria.
Quote:

"We haven't actually seen these in the real world long enough to know what those consequences are going to be... I would really not want anyone to say this is a stamp of approval just because they passed a few evaluations."
— Elliot Jones (36:01)

13. Critical Gaps: Inclusion of Affected Communities

Missing Voices:
- Almost no evaluation currently involves affected or marginalized communities, either in risk identification or defining acceptable risk levels.
- Essential for fairness, legitimacy, and real-world relevance (44:54–46:25).
Quote:

"We asked almost everyone we spoke to: do you involve affected communities in your evaluations? And basically everyone said no. And I think this is a real problem."
— Elliot Jones (44:54)

14. The Market for Audits

Evolving Ecosystem:
- The emerging "audit market" is opaque and lacks standardization—important challenges as regulatory capacity remains limited.
Future work at the Ada Lovelace Institute will focus on third-party auditing and broadening meaningful access and transparency (43:15–44:36).

Notable Quotes & Moments

On the stakes of voluntary, non-standardized testing:

"If you've got the benchmark in front of you, you can also see the answers. So you're not just choosing what tests to take, you've also got the answer sheet right in front of you."
— Elliot Jones (12:51)
On audit washing:

"Given the current state of play... these evaluations are only ever going to be indicative. You should, with the current evaluations, not ever say, 'look, we did these four tests and it’s fine.'"
— Elliot Jones (36:01)
On lacking community involvement:

"What is an acceptable level of risk is something that we don't want to be left just to the developers or even just to a few people in a government office."
— Elliot Jones (44:54)
On regulatory capacity constraints:

"If I take the European AI office for example, they've got, I think, maybe less than 100 people now for such a massive domain..."
— Elliot Jones (41:31)

Timestamps for Important Segments

Defining evaluation, audit, and benchmark: 04:10–06:33
Why it’s so hard to test AI: 06:33–08:13
Outline of EU, UK, US regulatory approaches: 08:47–11:46
Who actually conducts testing and the risk of gaming results: 11:46–14:28
International collaboration and talent pools: 14:28–16:09
Gaps in participation; voluntary vs. mandatory access: 18:07–19:28
Limits of testing critical harms (California’s SB 1047): 20:32–22:15
Costs of robust evaluation: 33:47–34:50
Risks of audit washing: 36:01–36:43
Community involvement in evaluation: 44:54–46:25

Final Thoughts

This episode emphasizes that while regulatory mandates and technical methodologies for AI safety are quickly evolving, current tools are only partially fit for purpose—especially for models with broad, emerging capabilities. A consistent call for transparency, third-party validation, community engagement, and regulatory humility runs throughout the discussion. The need to balance rapid AI advances with robust, inclusive, and meaningful safety practices emerges as both urgent and deeply unresolved.

For more on the state of AI regulation, assessment, and policy, listen to the full episode at Lawfare Media or consult the referenced report by Elliot Jones and the Ada Lovelace Institute.

Loading summary

Transcript64 lines

[00:01]
Lawfare Announcer
The Electronic Communications Privacy act turns 40 this year and it's showing its age. On Friday, March 6, Lawfare and Georgetown Law are bringing together leading scholars, practitioners and former government officials for installing updates to ecpa, a half day event on what's broken with the statute and how to fix it. The event is free and open to the public in person and online. Visit lawfaremedia.org ecpaevent that's lawfairmedia.org ecpaevent for details and to register.
[00:37]
Siemens/Noom Advertiser
Want to change the efficiency game? AI it Automate tedious tasks to spend more time on the future Transform the
[00:48]
ActiveCampaign Advertiser
Everyday with Siemens ActiveCampaign is the marketing automation platform built for big swings and big dreams with intelligent suggestions. Powered by AI and your data. Generate ideas in seconds, import your brand and create full campaigns with simple prompts. Send personalized messages backed by real time feedback, smart segmentation and effortless reporting that tracks every win. Let's redefine what's possible together. Get started for free@activecampaign.com.
[01:28]
Lawfare Announcer
Foreign.
[01:33]
Marissa Wong
I'm Marissa Wong, intern at Lawfare, with an episode from the Lawfare archive for March 15, 2026. On March 9, Anthropic filed a lawsuit against the Department of Defense's designation of the artificial intelligence company as a supply chain risk, a category associated with foreign adversary companies such as Russia's Kaspersky and China's Huawei. When the talks with anthropic fell through, OpenAI made a deal with the Pentagon that reportedly keeps the same limitations on the military's use of its AI systems that caused the Anthropic agreement to fail. For today's archive, I chose an episode from August 30, 2024, in which Kevin Fraser and Elliot Jones sat down to discuss the state of AI systems testing and regulation and how industry leaders such as OpenAI and Anthropic have taken different approaches to testing their methods.
[02:37]
Kevin Frazier
It's the Lawfare podcast. I'm Kevin Frazier, assistant professor at St. Thomas University College of Law and a Tarbell Fellow at LawFair, joined by Elliot Jones, a senior researcher at the Ada Lovelace Institute.
[02:50]
Elliot Jones
One thing we actually did hear from companies from academics and from others is they would love regulators to tell them what evaluations do you need for that? I think that a big problem is that there isn't actually had been that conversation between what are the kinds of tests you would need to do that regulators care about, that the public cares about that are gonna test the things people want to know.
[03:08]
Kevin Frazier
Today we're talking about AI testing in Light of a lengthy and thorough report that Elliot co authored before navigating the nitty gritty, let's start at a high level. Why are AI assessments so important? In other words, what spurred you and your co authors to write this report in the first place?
[03:28]
Elliot Jones
Yeah, I think what really spurred us to think about this is that we've seen massive developments in the capabilities of AI in the last couple years, the kind of chatgpt and everything that has followed. I think everyone's now aware that how fast some of this technology is moving, but I think we don't really understand how it works, how it's impacting society, what the risks are. And I think a few months ago when we started talking about this project, there was the UK AI Safety Summit. There were lots of conversations and things in the air about like, how do we go about testing how safe these things are, how these things are. But we felt a bit unclear about where the actual state of play was there. We thought about, we looked at this and looked around and there was a lot of interesting work out there, but we couldn't find any kind of comprehensive guide to actually how useful are these tools, how much can we know about
[04:11]
Kevin Frazier
these systems to level set for all the listeners out there? Elliot, can you just quickly define the difference between a benchmark, an evaluation and an audit?
[04:21]
Elliot Jones
That is actually a slightly trickier question than it sounds. At a very high level, a kind of evaluation is just trying to understand something about the model or the impact the model is having. And when we spoke to experts in this field who work on in foundation model developers, in independent assessors, some of them used the same definitions for audits. Sometimes audits were a subset of evaluations, sometimes evaluations are a subset of audits. But I think for a very general sense for listeners, evaluations are just trying to understand the kind of what can the model do, what behaviors does it exhibit and maybe what the broader impacts of it has on, say, energy costs, jobs, the environment, other things around the model. For benchmarking in particular, benchmarking is often using a set of standardized questions that you give to the model. So you say we have these hundreds of questions, maybe from say, like an AP history exam, where you have the question, you have the answer, you ask the model a question, you see one answer, you get back and you compare the two, and that allows you to have a fairly kind of standardized and comparable set of scores you can compare across different models. So a benchmark is a kind of subset of, of the broader category of valuations.
[05:30]
Kevin Frazier
And when we Focus on that difference between evaluations and audits. If you were to define audits distinctly, if you were trying to separate them from the folks who conflate it with evaluations, what is the more nuanced, I guess, definition of audits?
[05:48]
Elliot Jones
The important thing when I think. When I think about audits is that they are kind of well structured and standardized. So if you're going into, say, a financial audit, there's a process that auditors are expected to go through to assess the books, to check out what's going on. Everyone kind of knows exactly what they're going to be doing from the start. They know what endpoints they're trying to work out. So I think an audit would be something where there is a good set of standardized things. You're going in, you know exactly what you're going to do and you know exactly what you're testing against. Audits might also be more expensive than just the model. So an audit might be a kind of governance audit, where you look at the kind of practices of a company or you look at how the staff are operating, not just what is the model and what is the model doing. Whereas evaluations sometimes can be very structured, as I kind of discussed with benchmarks, but can also be very exploratory, where you just give an expert a system and see what they can do in 10 hours.
[06:34]
Kevin Frazier
We know that testing AI is profoundly important. We know that testing any emerging technology is profoundly important. Can you talk to the difficulties, again, at a high level of testing AI folks may know this from prior podcasts. I've studied extensively car regulations, and it's pretty easy to test a car, right? You can just drive it into a wall and get a sense of whether or not it's going to protect passengers. What is it about AI that makes it so difficult to evaluate and test? Why can't we just run it into a wall?
[07:07]
Elliot Jones
Yeah, I think it's important to distinguish which kind of AI we're talking about, whether it's narrow air, like we're talking about a chest X ray system where we can actually see what the results we're getting. It has a very specific purpose. We can actually test it against those purposes. And I think in that area, we can do the equivalent of running it into a wall and seeing what happens. And what we decided to focus on in this report was foundation models, these kind of very general systems, these large language models which can do hundreds, thousands of different things. And also they can be applied downstream in finance, in education, in healthcare. And because there are so many different settings. These things could be applied in so many different ways that people can fine tune them that they can find new applications. I think the developers don't really know how the system can be used when you put them out in the world. And that's part of what makes them so difficult to actually assess because you don't have a clear goal, you don't know exactly who's going to be using it, how they're going to be using it, how. And so when you start to think about testing, you're like, oh God, where do I even start? I think the other difficulty with some of these AI systems is we actually just don't understand how they work on the inside. With a car, I think we have a pretty good idea how the combustion engine works, how the wheel works, if you ask a foundation model developer. So why does it give the output it gives? They can't really tell you.
[08:13]
Kevin Frazier
So it's as if Henry Ford invented all at the same time a car, a helicopter, a submarine, put it out for commercial distribution and said, figure out what the risks are, let's see how you're going to test that. And now we're left with this open question of what are the right mechanisms and methods to really identify those risks. So obviously this is top of mind for regulators. Can you tell us a little bit more about the specific regulatory treatment of AI evaluations? And I guess we can just run through the big three, the us, the UK and the eu.
[08:48]
Elliot Jones
Yeah, so I guess I'll start with the EU because I think they're the furthest along on this track in some ways that the European Union passed the European AI act earlier this year. And as part of that there are obligations around trying to assess some of these general purpose systems for systemic risk. To actually go in and find out how are these systems working, what are they going to do? And they've set up this European AI Office which right now is consulting on its own codes of practice. The GA are going to set out requirements for these companies that say maybe you do need to evaluate for certain kinds of risks. So is this a system that might enable more cyber warfare? Is there a system that might enable systemic discrimination? Is this a system that might actually lead to overreliance or concerns about critical infrastructure? So the European AI office is already kind of consulting around whether or evaluation should become a requirement for companies. I think in the US and the UK things are both much more on a voluntary footing right now. The uk, back in when would have been November, set up its AI Safety Institute and that has gone a long way in terms of voluntary evaluations. So that has been developing different evaluations, often with a national security focus around, say, cyber bio, other kinds of concerns you might have. But that has been much more on a voluntary footing of companies choosing to share their models with this British Government institute. And then somehow, and I think I'm not even really sure exactly how this kind of plays out, the institute is doing these tests, they've been publishing some of the results, but that's all very much on a kind of voluntary footing. And there has been reports in the news that actually that's caused a bit of tension on both sides because the companies don't know how much they're supposed to share or how much they want to share. They don't know if they're supposed to make changes. When the UK says, look at this result, they're like, cool, what does that mean for us? And I think the US is in a pretty similar boat, maybe one step back because the United States AI Safety Institute is just still being set up and so it's working with the UK AI Safety Institute and I think they're kind of working a lot together on these evaluations, but that's still much more on a. The companies choose to work with these institutes, they choose what to share and then the government kind of works with what it's got.
[10:45]
Kevin Frazier
So there are a ton of follow up questions there. I mean, again, just for folks who are thinking at my speed, if we go back to a car example, right, and let's say the car manufacturers get to choose the test or choose which wall they're running into at which speed and who's driving all of a sudden we could see these tests could be slightly manip, which that's problematic. So that's one question I want to dive into in a second. But another big concern that kind of comes to mind immediately is the company is running the test themselves. Where if you had a car company, for example, controlling the crash test, that might raise some red flags about, well, do we know that they're doing this to the full extent possible? So you all spend a lot of time in the report diving into this question of who's actually doing the testing. So under those three regulatory regimes, am I correct in summarising that it's still all on the companies, even in the eu, the UK and the us?
[11:47]
Elliot Jones
So on the EU side, I think it's still yet to be seen. I think they haven't drafted these codes of practice yet. This kind of stuff hasn't got going, I think some of this will remain with the companies in the act. There are a lot of obligations for companies to demonstrate that they are doing certain things, that they are in fact carrying out certain tests. But I'm pretty sure that the way the EU is going, there is also going to be a requirement for some kind of like third party assessment. This might take the form of the European AI Office itself carrying out some evaluations, going into companies and saying, give us access to your models, we're going to run some tests. But I suspect that similarly to how finance audits work, it's likely to be outsourced to a third party, where the EUI office says, look, we think that these are reputable people, these are companies or organizations that are good at testing, that have the capabilities. We're going to ask them to go in and have a look at these companies and then publish those results and get a sense from there. It's a bit unclear how that relationship's going to work. Maybe the companies will be the ones choosing the third party evaluators, in which case you have still some of these concerns and questions, maybe a bit more transparency. In the UK and US case, some of this has been the government already getting involved. As I kind of just said earlier, the UK AI Safety Institute has actually got a great technical team. They've managed to pull in people from OpenAI, from DeepMind, other people with great technical backgrounds, and they're starting to build some of their own evaluations themselves and run some of those themselves. I think that's a really promising direction because as you were kind of mentioning earlier about companies choosing their own tests, in this case, it's also having for, like, for a benchmark. For example, if you've got the benchmark in front of you, you can also see the answers. So you're not just choosing what tests to take, you've also got the answer sheet right in front of you. Whereas if you've got, say, the UK AI Safety Institute or the US AI Safety Institute building their own evaluations, suddenly the companies don't know exactly what they're being tested against either. And that makes it much more difficult to manipulate and game that kind of
[13:31]
Kevin Frazier
system and going to that critical question of the right talent to conduct these AI evaluations. I think something we've talked about from the outset is this is not easy. We're still trying to figure out exactly how they work, what evaluations are the best, which ones are actually going to detect risks and all these questions. But key to that is actually recruiting and retaining AI experts. So is there any fear that we may start to see a shortage of folks who can run these tests? I mean, we know the US has an ac, the UK has an ac again that's AI Safety Institute. South Korea I believe is developing one, France I believe is developing one. Well, all of a sudden we've got 14, 16. Who knows how many aces are out there? Are there enough folks to conduct these tests to begin with or are we going to see some sort of sharing regime, do you think, between these different testers?
[14:28]
Elliot Jones
I'll tackle the sharing regime question first. So we are already starting to see that for some of the most recent tests on Claude 3.5, where anthropic shared early access of their system, they shared it with the US and the UK AC and they kind of worked together on those tests. I think it was the USAC primarily getting that access from Anthropic, kind of getting, using the heft of the US government basically to get the company to share those things, but leaning on the technical skills within the UK AC to actually conduct those tests. And there's been an announced kind of international network of AI Safety institutes that's hopefully going to bring all of these together. And I expect that maybe in future we'll see some degree of specialization and knowledge sharing between all these organisations that in the UK they've already built up a lot of talent around national security evaluations. I suspect we might see the United States AI Safety Institute looking more into questions of systemic discrimination or more societal impacts. Each government is going to want to have its own kind of capabilities in house to do this stuff. I suspect that we will see that sharing precisely because as you identify, there are only so many people who can do this. I think that's only a short term consideration though, and it's partly because we've been relying a lot on people coming from the companies to do a lot of this work. But I think the existence of these AI safety units themselves will be a good training ground for more junior people who are coming into this who want to learn how to evaluate systems, who want to get a grasp of these things, but don't necessarily want to join a company. Maybe they'll come from academia, they'll be going to these aces instead of joining a DeepMind or an OpenAI. And I think that that might kind of ease the bottleneck in future. And I kind of imagine that I was talking earlier about having these third party auditors and evaluators. I suspect we might see some stuff from these AI Safety Institutes going off and founding them and kind of growing that ecosystem to provide those services over time.
[16:09]
Kevin Frazier
When folks go to buy a car, especially if they have kids or dogs or any other loved ones, for all the bunny owners out there or you, you always want to check the crash safety rating. But as things stand right now, it sounds as though some of these models are being released without any necessarily required testing. So you've mentioned a couple times these code for practices that the EU is developing. Do we have any sort of estimate on when those are going to be released and when testing may come online?
[16:46]
Elliot Jones
Yeah, so I think we're already starting to see them being drafted right now. I think that over the course of the rest of the summer and the autumn, the EU is going to be starting to create working groups that are going to kind of work through each of the sections of the Code of Practice. I think we're kind of expecting it to wrap up around next April. So I think by the kind of spring of next year, we'll be starting to see at least the kind of first iteration of what these codes of practice look like. But that's only when the codes of practice are published. When we see these actually being implemented, when we see companies taking steps on this questions, maybe they'll get ahead of the game, maybe they'll see this coming down the track and start to move in that direction. A lot of these companies are going to be involved in this consultation, in this process of deciding what's in the codes of practice, but equally they could get published and then it take a while before we actually see the consequences
[17:29]
Kevin Frazier
of that April of next year. I'm by no means a technical AI expert, but I venture to guess the amount of progress that can be made in the next eight months can be pretty dang substantial. So that's. That's quite the time horizon. Thankfully, though, as you mentioned, we've already seen in some instances, compliance with the UK AC testing, for example. But you mentioned that some labs maybe are a little hesitant to participate in that testing. Can you detail that a little bit further about why labs may not be participating to the full extent, or maybe a little hesitant to do so?
[18:08]
Elliot Jones
Yeah, it's not quite clear which labs have been sharing and not sharing. I know that Anthropic has, because they said it when they published Claude 3.5 to the others, it's kind of unclear. There's a certain opaqueness on both sides about exactly who is involved, but as to why they might be a bit concerned. I think there are some legitimate questions, like, say, around commercial sensitivities. If you're actually evaluating these systems, then that means you probably need to get quite a lot of access to these systems. And if you're a meta and you're publishing llama 300 billion just out on the web, maybe you're not so worried about that. You're kind of putting all the weights out there and just seeing how things go. But if you're an OpenAI or a DeepMinded anthropic, that's a big part of your kind of your value. If someone leaked all of the GPT4 weights onto the Internet, that would be a real, real hit to OpenAI. So I think there are legitimate security concerns they have around this sharing. I think there's also another issue where, because this is a voluntary regime, if you choose to share your model and the AI Safety Institute says it's got all these problems but someone else doesn't, then that just makes you look bad because you've exposed all the issues with your system, even though you probably know that the other providers have the same problems too, because you're the one who stepped forward and actually given that access and let your system be evaluated, it's only your problems that get exposed. So I think that's another issue with the voluntary regime of if it's not everyone involved, then that kind of disincentivizes anyone getting involved.
[19:29]
Kevin Frazier
Oh, good old collective action problems. We see them yet again and almost always in the most critical situations. So speaking of critical situations, I'll switch to critical harm. Critical harm is what is the focus of SB 1047. That is the leading AI proposal in the California State legislature that as of now, this is August 12th, is still under consideration. And under that bill, labs would be responsible for identifying or making reasonable assurances that their models would not lead to critical harm, such as mass casualties or cybersecurity attacks that generate harms in excess of, I believe, $500 million. So when you think about that kind of evaluation, is that possible? How do we know that these sorts of critical harms aren't going to manifest from some sort of open model or even something that's closed, like anthropics models or OpenAI's models?
[20:32]
Elliot Jones
I think with the tests we currently have, we just don't know. I think the problem is that I guess there's a step one of trying to even create evaluations. Some of these critical harms. There are some kind of evaluations out there, like the Weapons of Mass Destruction proxy benchmark, which tries to assess using multiple choice questions Kind of whether or not a system has knowledge of biosecurity concerns, cybersecurity concerns, kind of chemical security concerns, things that maybe could lead down the track to some kind of harm. But that's, as it says, very much just a proxy. The system having knowledge of something doesn't tell you whether or not it's actually increasing the risk or chance of those events occurring. So I think that on one level there's just a generalization problem or a kind of external validity problem of a lot of the tests can do what they need to do. It can tell you, does the system have stored that knowledge? But translating that's whether the system has stored knowledge or not into can someone take that knowledge, can they apply it, can they use that to create a mass casualty event? I just don't think we have that knowledge at all. And I think this is where in the report we talk about pairing evaluation with post market monitoring, with incident reporting. And I think that's a key step to be able to do this kind of assessment of saying, okay, when we evaluated the system beforehand, we saw these kind of properties, we saw that it had this kind of knowledge, we saw it had this kind of behavior. And at the other end, once it was released into the world, we saw these kind of outcomes occur. And hopefully that would come long before any kind of mass casualty event or really serious event. But you might be able to start matching up results on say, this proxy benchmark with increased chance of people using these systems to create these kinds of harms. So I think that's one kind of issue. But right now I don't think we kind of have that historical data of seeing how the kind of tests before the system is released match up to behaviors and actions after the system is released.
[22:15]
Bill Advertiser
This episode is brought to you by Bill, the intelligent finance platform that helps businesses and accounting firms scale with proven results. When you're growing a business, the stakes get higher. You can't afford infrastructure that breaks under pressure. If you care about security, reliability and scale, I want to let you in on a secret. Bill is the foundational software that nearly half a million businesses and 90 of the top 100 US accounting firms use to automate back office workflows, add secure controls to payment processes and scale without increased overhead. With AI powered accounts payable automation, Bill erases the busy work from capturing invoices, routing approvals and processing payments, syncing seamlessly with the top accounting software platforms. So your books are always accurate. But Bill isn't just accounts payable. It supports the full payments Workflow. Bill has processed over $1 trillion in transactions, leveraging that expertise to help you manage, move, and maximize your finances. So stop the guesswork and start scaling with the proven Choice. Go to Bill.comProven to talk with a payments expert and get a $250 gift card as a thank you. That's Bill.comProven terms and conditions apply. See offer page for details.
[23:55]
Noble Gold Advertiser
Hey, folks, I don't know about you, but I've been thinking a lot about May 15 lately, because that is the day that we could see a major shift in leadership at the Federal Reserve. There's been a lot of talk about politicization of the Fed. There's been some action in that regard. And May 15 is the day every Fed chair has a different tolerance for inflation, a different approach to interest rates, and a different philosophy about economic growth. And so when the leadership changes, markets often reposition in anticipation of what comes next. I'll tell you what doesn't reposition gold. Gold historically performs really well during periods of monetary uncertainty, whether caused by leadership transition at the Fed or other things. It's not because the sky is falling and people are, you know, shoving bars of gold under their mattresses, but it's just because investors reassess risk. And gold is this thing that has maintained its value for a really long time. Smart investors don't wait for the headlines to confirm what's already developing. They start positioning themselves early, especially when there's this clear economic milestone, like the one coming on May15, that's approaching. So if you've been wondering about whether
[25:31]
Bill Advertiser
gold should be part of your economic
[25:33]
Noble Gold Advertiser
portfolio, this is exactly the kind of moment when it makes sense to speak with somebody who understands the market. And that's exactly why I suggest you turn to Noble Gold investments. Noble Gold has been helping investors protect their savings with physical gold and silver for nearly a decade. And here is why they're different. They provide this, like, white glove service from start to finish. They'll walk you through how physical gold and silver work, whether you're considering a direct purchase or rolling over part of an IRA into precious metals. Everything is transparent. They're not pushy. They're not high pressure.
[26:20]
Bill Advertiser
They're educators.
[26:21]
Noble Gold Advertiser
They explain exactly what what you're buying and why. In fact, when Lawfair started doing Noble Gold ads, the head of the company met with me about it and explained why gold was something he felt strongly about, why he was in the business that he was in. They genuinely put customers first, and the team makes it simple to gain real investor level knowledge and insight. They've built a reputation on being trustworthy and reliable. You can get the answers you need and stop guessing and decide with confidence. So don't wait six months from now wishing you had positioned earlier. Have the conversation now. Schedule a free Gold strategy session@noblegoldinvestments.com lawfare that's noblegoldinvestments.com lawfair get the information you need to make smart decisions about protecting your Future. Discover the 1% wealth strategy at noblegoldinvestments.com LawFair.
[27:38]
Matt from P1
Hi, this is Matt from P1 with Matt and Tommy and this episode is sponsored by ebay. The cars you'll find on ebay are just different. They come with a story that you can't wait to share. Like this 1973 Dodge Charger on ebay that has been tucked away in an Arizona Barn for over 40 years. Only 55,000 miles and somehow in great running order. It even has a rare sunroof. Suddenly a car that was hidden for decades is being delivered in just a few clicks with ebay's secure purchase. All the paperwork handled. There are thousands of cars on ebay, from unique finds like the Pontiac Grand Prix SJ to daily drivers flavors. And now with a new way to buy them. EBay. Things people love.
[28:21]
Elliot Jones
Oh, could this vintage store be any cuter?
[28:24]
Marissa Wong
Right?
[28:25]
Kevin Frazier
And the best part? They accept Discover. Except Discover in a little place like this?
[28:30]
Elliot Jones
I don't think so.
[28:31]
Kevin Frazier
Jennifer oh yeah, huh? Discover is accepted where I like to shop. Come on baby, get with the times.
[28:37]
Noble Gold Advertiser
Right.
[28:38]
Kevin Frazier
So we shouldn't get the parachute pants. These are making a comeback, I think.
[28:44]
Noble Gold Advertiser
Discover is accepted at 99% of places that take credit cards nationwide, based on
[28:49]
Elliot Jones
the February 2025 Nielsen report.
[28:51]
Siemens/Noom Advertiser
Have you ever been stuck on a weight loss plateau, trying everything and anything you can to lose that extra weight and reach peak health? We've all been there. But Noom's unlocked a secret to reaching the mountaintop. Goin micro the Noom GLP1 microdose program starts at $99 and is delivered to your door in seven days. Start your microdose GLP1 journey today at noom.com that's N-O-O-M.com Noom micro changes big results Noom GLP1 RX program involves healthy diet, exercise and support. Individual results may vary. Meds and personalization based on clinical need. Not reviewed by FDA for safety, efficacy or quality. No affiliation with Novo Nordisk, Inc. The only US source of FDA approved semaglutide. Not available in all 50 US states,
[29:38]
Kevin Frazier
as you pointed out earlier, usually when we think about testing for safety and risks, again, let's just go to a car example. If you fail your driving test, then you don't get to drive. Or if you fail a specific aspect of that test, let's say parallel parking, which we all know is just way too hard when you're 15 or 16, then you go and you practice parallel parking. What does the report say on this question of kind of follow up aspects of testing? Because it's hard to say that there's necessarily a whole lot of benefit to testing for the sake of testing. What sort of add ons or follow up mechanisms should we see after testing's done?
[30:21]
Elliot Jones
Yeah, I guess there's like a range of different things you might want to see a company do. I think for some tests where you see somewhat biased behavior or somewhat kind of biased outputs from a system, maybe all that means is that you need to look back at your data set. You're training your system on say, okay, it's underrepresenting these groups. It's not including say African Americans or African American perspectives as much. So we need to add some more of that data into the training and maybe that can fix the problem you've identified that can go some way to actually resolving that issue. So there is some stuff you can do that's just kind of as you're training the model, as you're testing it, kind of adjusting it, making sure that it's kind of adding onto that. A kind of second step you can do is you might find that actually it's very difficult to fine tune out some of these problems, but that actually there are just certain kinds of prompts into a system. Say someone asking about how would I build a bomb in my basement? That you can just build a safety filter on top that says if someone asks this kind of question as a system, let's just not do that. And so your evaluation tells you there is this harmful information inside the model where you can't necessarily completely get rid of it, especially if it's going to really damage the performance. But you can put guardrails around the system that make that inaccessible or make it very hard for a user to do that. And similarly you might want to monitor what the outputs of the model is. If you start seeing it mention how to build a bom, then you might just want to cut that off and either ban the user or prevent the model from completing its output. I think when we get into slightly trickier ground and areas where I think companies haven't been so willing to do is on delaying deployment of a model or even restricting access to model completely and deciding not to publish it. I think one example of this is that OpenAI had a kind of voice cloning model of a very, very powerful system that could generate very realistic sounding voice audio and they decided not to release it. And I think that's actually quite admirable to say. We did some evaluations, we discovered that this system could actually be used for say mass spear phishing. If you think about you get a call from your grandparents and they're saying, oh, I'm really in trouble, I really need your help. And it's just not them. And imagining that capability being everywhere, that's something really dangerous and they've decided not to release it. But equally, I suspect that as there are more and more commercial pressures as these companies are competing with each other, there's going to be increasing pressure to this system is a bit dangerous. Maybe there are some risks, maybe there are some problems, but we spent a billion dollars training the system, so we need to get that money back somehow. And so they're going to push ahead with deploying the system. And so I think that's the kind of steps that a company might take that are going to get a bit more tricky around. Not just putting guardrails around it or tweaking it a bit, but actually saying we've built something that we shouldn't release.
[32:54]
Kevin Frazier
I feel as though that pressure to release, regardless of the outcomes, is only going to increase as we hear more and more reports about these labs having questions around revenue and profitability. And as those questions maybe persist, that pressure is only going to grow. So that's quite concerning. And I guess I also want to dive a little bit deeper into the actual costs of testing. When we talk about crashing a car, you only have to take one car. Let's say that's between 20 grand and 70 grand. Or for all those Ferrari drivers out there, we've got a half a million dollar car or something that you're slamming into a wall. With respect to doing a evaluation of an AI model, what are the actual costs of have a dollar range on what it takes to test these different models?
[33:48]
Elliot Jones
To be perfectly honest, I don't have that. I don't know the amounts. I think the closest I've kind of seen is that anthropic talks about when they were implementing one of these benchmarks. Even this off the shelf, kind of publicly available, widely used benchmark that still required a few engineers spending a couple months of time working on implementing that system. And that's for something that they don't have to come up with a benchmark themselves. They don't have to come up with anything new. It's just taking something off the shelf and actually applying it to their system. And so I can imagine a few engineers add a couple months of time and they pay their engineers a lot. So that's going to be in the like hundreds of thousands of dollars range, let alone the cost of compute of running the model across all of these different prompts and outputs. And that was just for one benchmark. And many of these systems are trained on lots of different benchmarks. There's lots of red teaming involved. When, say, a company like OpenAI is doing red teaming, they're often hiring tens or hundreds of domain experts to try and really test what capabilities these systems have. And I can imagine they're not cheap either. So I don't have a good dollar amount, but I imagine it's pretty expensive.
[34:51]
Kevin Frazier
I think it's really important to have a robust conversation about those costs so that all stakeholders know, okay, maybe it does make sense. If you're an AI lab and now you have 14 different AI safety institutes demanding you adhere to 14 different evaluations, that's a lot of money, that's a lot of time, that's a lot of resources. Who should have to bear those costs is an interesting question that I feel like merits quite a robust debate. Elliot, We've gotten quite the overview of the difficulty of conducting evaluations, of the possibility of conducting audits, and then in some cases instituting benchmarks. One question I have is how concerned should we be about the possibility of audit washing? This is the phenomenon we've seen in other contexts where a standards developed or a certification is created and folks say, you know, we took this climate pledge or we signed this human rights agreement, and so now you don't need to worry about this product. Everything's good to go. Don't ask any questions, keep using it, it'll be fine. Are you all concerned about that possibility in an AI context?
[36:02]
Elliot Jones
Yes, I'm definitely concerned about that. I think the one thing we'd really want to emphasize is, like, evaluations are necessary. You really have to go in and look at your system. But given the current state of play of this quite nascent field, these evaluations are only ever going to be indicative. They're only ever going to be here are the kind of things you should be kind of thinking about or worrying about. You should, with the Current evaluations, not ever say, look, we did these four tests and it's fine. Partly because as we kind of discussed before, we haven't actually seen these in the real world long enough to know what those kind of consequences are going to be. And without that kind of follow up, without that kind of post market monitoring, without that instant reporting, I would really not want anyone to say this is a stamp of approval just because they passed a few evaluations.
[36:43]
Kevin Frazier
Thinking about the report itself, you all, like I said, did tremendous work. This is a thorough research document. Can you walk us through that process a little bit more? Who did you all consult? How long did this take?
[36:57]
Elliot Jones
Yeah, sure. This was quite a difficult topic to tackle in some ways because a lot of this as a quite nascent field is kind of held in the minds of people working directly on these topics. So we kind of started off this process by between January and March this year talking to a bunch of experts, some people working in foundation model developers, some people working in third party auditors, evaluators, people working in government, academics who all worked in these fields to just try and get a sense from them, people who have like hands on experience of running evaluations and seeing how hard they are to do in practice of repeating those things and seeing do these actually play out in real life. So a lot of this work is based on just trying to talk to people who are kind of at the coal face of evaluation and getting a sense of what they were doing as to exactly who. That's a slightly difficult topic I think, because this is quite a sensitive area. A lot of people wanted to be off the record when talking about this, but we did try and cover a fairly broad range of developers of assessors of these kind of things. Alongside that we did our own kind of deep dive literature review. There are some great survey work out there. Laura Wiedinger at DeepMind has done some great work kind of mapping out the space of socio technical risks and the evaluations there. And so drawing on some of these existing survey papers, doing our own kind of survey of different kinds of evaluation. We worked with William Agnew as our technical consultant, who has a bit more of a computer science background. So he'd get into the nitty gritty of some of these more technical questions. So we tried to marry that kind of on the ground knowledge from people with what was out there in the academic literature. I would say this is just a snapshot. This took us like six months and I think some of the things we wrote are essentially already out of date. Some of the work we did looking at where are evaluations at, what is the coverage? People are publishing new evaluations every week. So this is definitely just a snapshot. But yeah, we tried to kind of marry the academic literature with the speaking people on the ground.
[38:37]
Kevin Frazier
So we know that other countries, states, regulatory authorities are going to lean more and more on these sorts of evaluations, and they already are to a pretty high extent from this report. Would you encourage a little more regulatory humility among current AI regulators to maybe put less emphasis on testing or at least put less weight on what testing necessarily means at this point in time?
[39:04]
Elliot Jones
To a degree, I think it depends what you want to use these for. I think in our report we try and break down kind of three different ways you might use evaluations as a tool. One is a kind of almost future scoping what is going to come down the road. Just giving you a general sense of the risks, what to prioritize, what to look out for. I think for that evaluations are really useful. I think that they can give you a good sense of maybe the cybersecurity concerns a model might have, maybe some of the bio concerns. It can't tell you exactly what harm it's going to cause, but it can give you a directional question of where to look. I think another way in which current evaluations can already be useful is if you're doing an investigation. If you're a regulator and you're looking at a very specific model, say you want to look at ChatGPT in May 2024 and you're concerned about how it's representing certain different groups or it's how it's being used in recruitment. Say you're thinking about how is this system going to view different CVs and what comments is it going to give about a cv, Depending on different names, you can do those tests really well. If you want to test it for that kind of bias, I think actually we're already kind of there and it can be a very useful tool for a regulator to assess these systems. But I think you have to have that degree of specificity because the results of evaluations change so much just based on small changes in the system and based on small changes in context. Unless you have a really clear view of exactly what concern you have, they're not going to be the most useful. The third kind of way you might use it is this kind of safety sign off. Say this system is perfectly fine. Here's our stamp of approval. We are definitely not there. And I think if I was a regulator Right now, one thing we actually did hear from companies, from academics and from others is they would love regulators to tell them what evaluations do you need for that. I think that a big problem is that there isn't actually had been that conversation between what are the kinds of tests you would need to do that regulators care about, that the public cares about, that are going to test the things people want to know and what are they going to build. And I think absent that guidance, industry and academia are just going to pursue what they find most interesting or what they care about the most. So I think right now it's incumbent on regulators, on policymakers to say, here are the things we care about, here's what we want you to build tests for. And then maybe further down the line, once those tests have been developed, once we have a better sense of the science evaluations, then we could start thinking about using it for that third category.
[41:12]
Kevin Frazier
And my hope, and please answer this in a favorable way. Have you seen any regulators say, oh my gosh, thank you for this great report, we're going to respond to this and we will get back to you with an updated approach to evaluations. Has that occurred? What's been the response to this report so far?
[41:31]
Elliot Jones
I don't want to mention anyone by name. I feel like it'd be a bit unfair to do that here. But yeah, I think it's generally been pretty favorable. I think that actually a lot of what we're saying has been in the air already. As I said, we spoke to a lot of people kind of working on this already thinking about this. And part of our endeavor here was to try and bring together conversations. People are already having discussions, already have, but in a very comprehensible and public facing format. And I think the regulators were already and are taking these kind of questions seriously. I think one difficulty is a question of regulatory capacity. Regulators are being asked to do a lot in these different fields. If I take the European AI office for example, they've got I think maybe less than 100 people now for such a massive domain. And so one kind of question is just they have to prioritize, they have to try and cover so many different things. And so I think without more resources going into that area, and that is always going to be a political question of what things do you prioritise? Where do you choose to spend the money? It's just going to be difficult for regulators to have the time and mental space to deal with some of these issues.
[42:28]
Kevin Frazier
And that's a fascinating one too because if we See this constraint on regulatory capacity, I'm left wondering, okay, let's imagine I'm a smaller lab or an upstart lab. Where do I get placed in the testing order, right? Is OpenAI going to jump to the top of the queue and get that evaluation done faster? Do I have the resources to pay for these evaluations if I'm a smaller model? So really interesting questions when we bring in that big I word, as I call it, the innovation word, which seems to dominate a lot of AI conversations these days. So at the institute you all have quite an expansive agenda and a lot of smart folks. Should we expect a follow up report in the coming months or are you all moving on to a different topic or what's the plan?
[43:16]
Elliot Jones
Yeah, I think partly we're wanting to see how this plays out, wanting to see how this field moves along. I think one question that we are thinking about quite a lot and might explore further is this kind of question of third party auditing, third party evaluation. How does this kind of space grow? As we kind of mentioned a bit briefly in the report, there is currently a kind of a lack of access for these evaluators right now, a lack of ability of them to get access to these things, especially on their own terms rather than on the terms of the companies. There is a lack of standardization. If you are someone shopping around as a smaller lab or a startup for evaluation services, it's a bit opaque to you on the outside. Who is going to be doing good evaluations, who does good work and who is trying to sell you snake oil. And so I think that one thing we're really thinking about is how do we kind of create this auditing market where people on both sides, so you as the lab know you're buying a good service, that regulators will trust that everyone will work. But also you as a consumer, when you're thinking about using an AI product, you can look at it and say, oh, it was evaluated by these people. I know that someone has kind of certified them, that someone has said these people are up to snuff and they're going to do a good job. And so I think that's one thing we're really thinking about of how do you build up this market so that it's not just reliant on regulatory capacity? Because I think while that might be good in the short term for some of these biggest companies, it is just not going to be sustainable in the long term for government to be paying for and running all these evaluations for everyone. If AI is as big as some people think it will be, and thinking
[44:37]
Kevin Frazier
about some of those prospective questions that you all may dig into and just the scope and scale of this report. Is there anything in the off chance that not all listeners go read every single page, Is there anything we've missed that you want to make sure you highlight for our listeners?
[44:54]
Elliot Jones
I think one other thing I do want to bring up is the kind of lack of involvement of affected communities and all of this that we asked almost everyone we spoke to. So do you involve affected communities in your evaluations? And basically everyone said no. And I think this is a real problem that, as I kind of mentioned before about what do regulators want, what does the public want in this question? Actually deciding what risks we need to evaluate for and also what is an acceptable level of risk is something that we don't want to be left just to the developers or even just to a few people in a government office. It's something we want to involve everyone in to decide there are real benefits to these systems. These systems are actually enabling new and interesting ways of working, new, interesting ways of doing things, but they have real harms too. And we need to actually engage people, especially those most marginalised in our society, in that question and say, what is the risk you're willing to take on what is an acceptable evaluation mark for this kind of work? And that can be at multiple stages. That can be in actually doing the evaluation themselves. Have you got a very diverse group of people red teaming a model, trying to pick it apart? Have you got them involved? The goal setting stage at that kind of product stage, when you're about to launch something into the world, are you making sure that it actually does involve everyone who might be subject to that? If you're thinking about using a large language model in recruitment, have you got a diverse panel of people assessing that system and understanding, is it going to hurt people from ethnic minority backgrounds? Is it going to affect women in different ways? So I think that's a really important point that I just want everyone to take away. I would love to see much more work in how you bring people into the evaluation process, because that's something we just really didn't find at all.
[46:25]
Kevin Frazier
Okay, well, Elliot, you've got a lot of work to do, so I'm going to have to leave it there so you can get back to it. Thanks so much for joining.
[46:32]
Elliot Jones
Thanks so much.
[46:37]
Lawfare Podcast Producer
The Lawfare podcast is produced in cooperation with the Brookings Institution. You can get ad free versions of this and other Lawfare podcasts by becoming a law fair material supporter through our website, lawfairmedia.org support. You'll also get access to special events and other content available only to our supporters. Please rate and review us wherever you get your podcasts. Look out for our other podcasts, including Rational Security, Chatter, Allies and the Aftermath. Our latest Lawfare Presents podcast series on the government's response to to January 6th. Check out our written work@lawfaremedia.org the podcast is edited by Jen Pacha. Our theme song is from Alibi Music. As always, thank you for listening.
[47:33]
Siemens/Noom Advertiser
Want to change the efficiency game? AI it automate tedious tasks to spend more time on the future. Transform the everyday with Siemens.