
Loading summary
A
Evaluation for the sake of evaluation doesn't make a lot of sense unless you really have an action that would follow the evaluation. We already see some evidence that prolonged use of AI systems might contribute to deskilling effects. For example, if you give doctors AI systems to assist them with the identification of tumors, then after three months of using those systems, once you take the systems away, their performance drops by about 6 percentage points. Now we have hundreds of millions of people who are using AI companion apps. And the first studies do show that heavy use of AI companions seems to be associated with decreased social contact with other people. Take a step back and think about the big picture. Is this actually how I want to live my life? Systems are now able to distinguish regularly between test and deployment settings. If we can't ensure that a system behaves in a test environment same way it behaves in the deployment environment, then that really undermines our oversight measures.
B
Welcome to the Future of Life Institute podcast. I'm here with Karina Prunkl. Karina, welcome to the show.
A
Hi, Gus. Thank you for having me.
B
Do you want to introduce yourself?
A
Yes, very happy to. So I'm Karina Prunkel. I'm one of the two lead writers of the International AI Safety Report, which was published in February under the chairmanship of Yoshua Bengio. I'm also a researcher at Inria in Paris and senior research fellow at the Institute for Ethics and AI at the University of Oxford. So my work is on ethics and governance of AI more broadly.
B
Yeah, and I think we should mention that you're speaking in a personal capacity and not as an official representative of the report.
A
Yes, indeed. Yeah. The report is meant to be this really this neutral document. So it doesn't contain any recommendations. It only contains the evidence base that we have for various risks and capabilities of general purpose AI systems. And today I'll be speaking in my capacity as a researcher, I'd say.
B
Yeah, perfect. Okay, so first question here. When we talk about AI capabilities, you know, there's a lot of talk about when will we have AGI? What is AGI? How advanced are the current AI systems? The frontier capabilities are jagged. Maybe you could explain what that means.
A
Yes. So in the report, we talk a lot about jagged capabilities, which means that AI systems might really excel at certain tasks. So they achieve gold medal status at the Mathematical Olympiad or the. They might be able to just ace the bar exam or science questions. So they exhibit these advanced reasoning skills, but at the same time they make these very basic, or what seems to us basic errors, such as at times they're unable to count how many objects there are in an image or how many hours there are in lawnmower. So this is what we mean by jagged. So that we just have these very different capability profiles when AI systems. Yeah, when AI systems are operating.
B
So do you think this is potentially misleading? Just if you have a person that can ace the bar exam, for example, that person, you can reasonably infer that person is also a quite capable lawyer. But that's not the case for AI systems. It's not that you can just at least right at this moment, just plug in an AI into a law firm and, and that AI will be able to replace a lawyer. So what does it mean for our ability to predict AIs can do in the world that capabilities are jagged like this?
A
Yes, so it basically means that the overall capability or intelligence, I mean using quotation marks here, are not really captured by a single ladder. So as you said with humans, when humans are able to do advanced reasoning or, you know, we typically can reasonably infer that they're also good at some, you know, very simple perceptual tasks or common sense tasks. With AI system this is exactly different. They have a very different capability profile from us humans. In part this is because there are different underlying mechanisms. So you know, we are sort of integrated bundles of perception, motor experience, social embedding, social understanding, common sense reasoning and so on. And we're also built through this interaction with the world. AI systems are just constructed very differently. And so perhaps it's not surprising that they also therefore have very different, different capability profiles than we humans. Another thing that might be important here to mention is that it also shows us that performance can be misleading. So exactly as you said, just because an AI system is able to ace a medical exam to qualify on paper as a doctor, doesn't mean that the system is actually capable to act up on those qualifications under real world conditions.
B
Yeah. And so what explains the current capabilities profile where systems are now very capable in formal domains? We could say domains like coding and mathematics. Why are they especially capable there?
A
So over the last year we've seen two big drivers of progress. One of them was new post training techniques. So using reinforcement learning to really teach systems or train systems on problem solving tasks rather than say just sounding like human beings. And the second one was to give systems more compute while they're producing outputs. So basically inference, time scaling, it's also called. So while the system is producing the output, it is using compute in order to produce a coherent answer. This relates to these what are also called reasoning systems, where the system breaks up the task into multiple steps and then goes down different avenues, compares them for consistency and that helps the system to really become stronger with those more reasoning focused tasks, with these more formal tasks that are needed for questions in science and coding and mathematics.
B
And so it's somewhat elusive to us what it is that's lacking in terms of getting an intelligence that is human like. And one question here is that the system needs to be able to think for a long time or stay on one task for a long time. So there is substantial work now from meter about horizon lengths, horizon times, where we can see that measured against how long it takes a human to solve a particular task, these AI systems can now solve, say tasks that takes a human six hours, or perhaps it's 12 hours now, if I'm getting the latest numbers right. What do you think of this work? Is this sort of the main thing to focus on when we're trying to predict how the real world effects of these systems?
A
Yeah, so the meteor studies are interesting because it is, well, currently one of the only ways that we can systematically measure the complexity of tasks that AI systems are able to do. One thing I don't know whether you mentioned it is, but when you say they're able to do tasks that are 6 hours or 12 hours, then that is always at a certain, you know, reliability threshold. So it would be with 80%. So when the report was, you know, when the, when we finished writing the report, we had I think 30 minute threshold for tasks. So in 80, sorry, systems. How do you say that? Like to have to think about how to, how to say that correctly. So systems were able to complete tasks at an 80% success rate that would take humans about 30 minutes. So 30 minutes is, I mean, that's very different from six hours or 12 hours. So we always have to think about how, you know, at what, at what completion rate do they actually do that?
B
Yeah, that's true. But things are moving quickly Both at the 80% reliability threshold that is now around one hour and the 50% reliability threshold is now around 12 hours. But just so it's important to. I totally take that point. But it's also important to notice the pace of improvement here.
A
Yes. And in the report we also show the exponential curve. Right. So there's definitely no doubt that the pace is very rapid. So another thing about this measure is that at a certain task length it becomes a bit difficult to conceptualize. So what are tasks that would take humans two days or like three years? I think this way of conceptualizing complexity makes sense for up to maybe a few hours, maybe even a day. But even there it becomes a bit difficult. But then beyond, we need to find new measures. And then perhaps another thing to note is that this is software engineering tasks. So again, it's a very narrow subset of tasks that AI systems are able to perform on or that agents might be used for.
B
Yeah. Although the meter people or the researchers at meter, they also have a study on more broad range of tasks showing somewhat of the same improvement curve. But it's an interesting point to say that what does it even mean to say that a task takes a day or a week or a month for a human? You know, we can think about writing a research paper as a task that may, maybe that takes a year or something, but you're not working on that paper full time. You're doing many other things in between. You're sort of waiting for feedback, you are gathering data and so on. You're sleeping, of course. And all of these things are. All of the tasks are tasks that the AI systems are not doing. They're sort of just one working on one task until completion. And so, yeah, the comparison with human working time becomes a bit sort of more difficult to understand the longer the task takes.
A
Yeah, I mean, in the end it's trying to measure complexity of tasks and that is one first approach to do so. And I haven't come across a better one so far.
B
No. So you've sort of looked at all of the approaches here. There's nothing that is as good as capturing the complexity of task as the meter study.
A
It doesn't seem so. But I must say I also rely heavily on the authors of the report, who are the section writers and the chapter leads of the capability section. And we've talked extensively about are there other measures we could use? And we couldn't find any that were, you know, that didn't have other shortcomings, so that weren't as that. Yeah, so we couldn't, we couldn't find any suitable alternatives.
B
Yeah. Right. So one interesting question here is whether, when we see these increases in performance or capabilities, does that then mean that we also see dangerous capabilities sort of coming along for the ride? Is it the case that as these systems can do longer tasks and they can solve more difficult problems in math and coding and so on, does that also mean that we are seeing more dangerous capabilities arise?
A
Yes, I think so. I mean, one of the things that we show in the report is that there is a clear link between the capabilities and the risks that AI systems might pose. And just to give you a very concrete example, to stay within the area of coding and software engineering, the better AI systems become at identifying flaws in software, obviously the better they become at helping us to fix those flaws. But they also then can be used to exploit flaws, to exploit software vulnerabilities. We've seen, for example, that are now cases where cyber attacks have been automated to about 80%. So there are no fully automated cyber attacks yet, but 80% is already a pretty high number. And the better systems become at performing complex tasks independently, the better systems become at coding, at identifying vulnerabilities, at automatically being able to exploit them. You know, the more risk there are, the more risk there is to have these sort of attacks occurring.
B
So how would you sort of describe the state of the field of evaluations or evals? Because this is quite central to our ability to understand how these systems are improving, if they're dangerous and so on. How is that? You could call it the evaluation science. How's that going?
A
So evaluation science is still at its very early stages. We are currently starting to understand how we might go about evaluating AI systems. And with general purpose AI systems, which are the systems the report focuses on, it's particularly different, but difficult because they are deployed in these different contexts. They may behave differently across contexts. And in particular we see something that we call the evaluation gap. So the evaluation gap means that our pre deployment tests very often are not really indicative for how systems might behave in real world deployment contexts. So, you know, your system might be very good at certain benchmarks, but then when you release it into the world, it does, you know, it does very, it behaves very undesirable way ways. So evaluation science is the idea that we're developing sort of rigorous methodology to evaluate systems, to be able to report these evaluations and to communicate, to have them, you know, transparent, to develop systems that are auditable and. Yeah, to have that in, to create a field that is just focused on evaluation. But I'm very happy to say a bit more about what this could look like concretely.
B
Yeah, definitely do that. This is, for me at least one of the biggest questions is how we should set up evaluations for these models and especially whether we can set up evaluations in a way that sort of responds to the pace of change in AI.
A
Yes, I mean, one of the big takeaways from the report was that there is a lot happening in the sense of evaluation and safeguards in AI. And I say I personally was amazed by how Many people are making progress in the field, but yeah, so I think the first step would be that we just clarify more what is being measured. So very often people just talk about performance of AI systems and perhaps not the people who are working in the field, but the people outside of the field, rather than talking about something more concrete like robustness or reliability, misuse potential capability already here, you know, we have to think about differences between capability evaluation and risk evaluation. Of course, then I think the field needs to really focus hard on construct validity and external validity. So construct validity being that you really make sure you measure what you intend to measure and not something else. This is always a big issue when you have to use proxies or when things you measure are just inherently difficult to measure. And external validity being exactly this idea that your tests are indicative of how the system behaves in real world deployment contexts. And this also means that we have to test for. There's a big discussion on capability elicitation currently trying to get systems to show their capabilities under real world conditions. So when they have access to certain tools rather than having them evaluated just in these laboratory settings, in chatbot mode, so to say. I think there's also a lot to do when it comes to evaluation integrity. So we have to study the failure modes of evaluation itself. So this means we have to be aware of things such as situational awareness. The system is able to distinguish test from deployment context and might behave differently according to, you know, whether it finds itself in a, in a, in a test setting, data contamination and so on. So these are all things that need to be addressed as part of evaluation signs. And then finally we also just have to run real world experiments so we can take an example from other fields for this. Medicine, for example, where we also have a rigorous evaluation science on, let's say medication, like pharmaceutical, they're being tested in a particular way. There are certain standards, there are certain rules, there are certain thresholds and certain rules that developers, our pharmaceutical companies, need to follow before they can put their medication onto the market.
B
Yeah. So sort of testing these systems in a rigorous way and making sure we don't fall into any of the traps. You mentioned that that's pretty expensive and that's pretty sort of difficult work to do. Where's the natural home for this work? Is it academia or is it within the companies, or is it sort of the think tank world or the organizations like Meter? Because if it's the companies themselves that are testing these systems, they are sort of incentivized to publish all of the amazing capabilities and perhaps downplay the risks. And I would worry that they might not give outside researchers enough access to the models in order for them to sort of fully understand what the models are doing. Yeah, how do we deal with that conundrum?
A
Yeah. So, I mean, personally I think it should probably be public institutions and NGOs who conduct the evaluations, or at least part of the evaluations and who fund it. I think industry has a lot to contribute here as well, also financially, with a caveat that you don't want industry to, to dominate the agenda in a particular way. At the same time, one thing that also became clear from working at the report is that different countries have very different approaches to who ought to be responsible to evaluate AI systems. And I've certainly spoken to some countries where they said, well, this is not the task of the government. We were talking, I think in this context we were talking about acs, about AI Security institutes or AI Safety institutes, which I think are a fabulous idea. And I think states should really try to build the capacity to evaluate systems themselves. But certain governments don't agree with this. They say, well, this is not the role of the state and that's fine, but so in the end the solution will need to be something that allows different governments to follow their own, their own institutional norms and institutional practices. So, yeah, I mean, it's difficult to say who should fund it, but if we're talking about a field, then I think it will also become like a multi stakeholder action. We see lots of startups now that are focusing on evaluation of AI systems. It will become also something pretty big. While under the EU AI act, governments are creating capacity, the big AI labs are also doing evaluations and are now also doing some of them publicly or in collaboration with governments. So we see a lot of activity and I think, I think that's the right approach. We need to have all of the stakeholders on board for this.
B
Yeah, it seems like it's something the AI conversation, the public conversation on AIs is lacking is just more information, more data, more studies about what the models can actually do, how the models are actually dangerous or not dangerous, and when they might become dangerous and so on. And again, just to restate the worry I had here is that we might not be able to get this field going before the models are already extremely capable and extremely dangerous. And so how do we think about sort of moving quickly enough to be prepared to measure the models?
A
So this question could have different, I could answer on different levels here, but we can talk about prioritization perhaps. So when we talk about evaluation, the first thing that we need to keep in mind is what is it that we actually want to evaluate? What is the decision that the evaluation would inform? So evaluation for the sake of evaluation doesn't make a lot of sense unless you really have like an action that would follow the evaluation, the test of the system. So I think that's the very first thing that we should be trying to do. The second thing, personally I think a low hanging fruit relative to the complexity of trying to evaluate general purpose AI systems would be to create sort of best practices standards on reporting and transparency measures. So already being able to compare how different stakeholders have evaluated systems with what sort of access they had, what tools they were using, what thresholds they were using, what if then statements deployers are using. So already this, having a standardized way of, of reporting evaluations would help a lot.
B
Yeah, yeah. One aspect that is perhaps difficult to measure is autonomy. So how, what do we know about autonomy in these systems? How autonomous are they now? Can we distinguish different aspects of autonomy?
A
Are you thinking about human autonomy or are you thinking of the autonomy of AI systems?
B
I'm thinking of the autonomy of the AI systems themselves.
A
So I don't think I have much more to add beyond the sort of complexity measure that we were talking about earlier with meter.
B
So we can talk about how AI systems interacting with humans might affect the autonomy of humans then?
A
Yes, that is, that is my bread and butter. Yes. So I mean autonomy of humans is somewhat of a complex topic because there is a lot of disagreement of what autonomy means. Very often people talk about autonomy and they mean one of two things. So either they mean our ability to, to make authentic decisions. So like with a focus on authenticity. So the idea that we are, you know, whatever we decide is in some way reflective of our, of our inner self. So it's not the product of deception or manipulation. So this is like this sort of authenticity component of autonomy. And then there is like an agency component which means that we are actually we, once we have made a decision or we have, we have formulated a desire or motivation, we are able to act upon that. So we have the freedom and we have the opportunity to, to act on, upon that. And in the report we also bring in like a third component which is competence. So in order to make authentic decisions and in order to be able to act on those decisions, you need to have a sort of, you know, basic skills. You need to have the ability for, you know, self reflection, you need to have adequate information, you need to have certain cognitive skills. Now, when we think about how AI might affect autonomy, it depends on, well, what we mean by autonomy. It also depends on what we're looking at in prayer in particular. So one thing that we highlighted in the report is AI impacts on cognitive. Cognitive skill development and so. Or deskilling, for example. And there we already see some evidence that my. That the use of. Prolonged use of AI systems might contribute to deskilling effects. So, for example, if you give doctors AI systems to assist them with the identification of tumors, then after three months of using those systems, once you take the systems away, their performance, their identification performance drops by about 6 percentage points. Now, 6 percentage points is not huge, but it's also not nothing. And it's quite reasonable to think that this trend will continue the more we delegate tasks to AI systems, or cognitive offloading, as it's also being called, where we basically, we are offloading cognitive tasks to the systems to perform them in our stead.
B
Yeah. So in this example, the clinicians get worse at the task because they've offloaded that task for a number of months to AI systems. Are they more productive overall? Because it seems plausible to me that offloading a task would make you worse at that task. But I'm not especially worried, for example, about my ability to do calculations in my head, given that I have a calculator all the time. And so if it's the case that the doctors are getting worse at a task, but they will always be able to offload that task, what is it that we should. Or is this something that we should worry about?
A
I think that's exactly. That's exactly the big question. In particular cases, perhaps not. Having said that, by being dependent on any sort of technology, you are increasing your vulnerability in cases where the technology doesn't function as intended or is being taken away due to licensing issues and so on. So it does certainly increase the vulnerability if we rely on them. So we just need to make sure we have the right infrastructure, infrastructure in place to either make sure that the technology remains always available and functions as intended, or to make sure we have alternative solutions to that. So I think in this regard. So this is for narrow tasks such as, I don't know, adding large numbers in your head, where we have calculators and we don't think this is an essential skill. It becomes problematic when we find that this is not only happening in isolated places, but in many different contexts in our lives, as AI is becoming more embedded into our everyday lives and where we think that it might negatively impact Our, in our critical thinking skills more generally. Now there are some first studies that show that there are some negative effect of prolonged or use of AI in educational settings on critical thinking skills. But they are, I mean, researchers are divided about how much these studies can actually tell us about the long term impacts of AI use on critical thinking. But this is a case where we might think just as a human race, where is the limit of cognitive offloading? In a way, you're exactly right, it's a bit like a muscle. You're training your critical thinking skills, but if you don't train them anymore, then they might erode. And if it's more this global phenomenon of critical thinking, that might be problematic.
B
Yeah, this might be an issue just given the fact that AI is so general purpose. Right. It shows up everywhere. Every, it shows up everywhere and it sort of can do a bunch of different tasks for us. And so while offloading a single task might, might not really be a problem, it might in fact not be a problem if you're, if you're doing that in a hundred different ways, maybe you begin to sort of not be in contact with the decisions you're making and not, you know, just becoming worse at the things you're trying to do in the world. It's a generally sort of undecided question how much we should automate. Do you have some thoughts about how we protect our authenticity, for example, which seems to be, at least to me, one of the most important aspects here, that even if we are offloading some tasks to AI systems, we are still expressing our preferences and what we want in the world in an authentic way.
A
Yes. So I mean, one of the issues about authenticity is that it is heavily shaped by our social environment already. So this idea that the human is just the isolated individual is kind of outdated. So now we do take into account that we are shaped by our social environments, by our natural environments. And here also we see that AI systems are coming in through, for example, the increasing use of AI companions. Now we have hundreds of millions of people who are using AI companion apps and again, you will think that in the individual instances there might not be a lot of damage. The evidence on the impact of AI companions on say, mental and emotional well being is quite mixed. So, you know, it can help with minor depressions. Others say, well, you know, other studies say, like now it actually worsens symptoms of depression or schizophrenia and, and other mental health conditions at the same time. Just, you know, consider the case where you're asking ChatGPT for some relationship advice or dating Advice, you know, it's reasonable that ChatGPT probably has a, I don't know, better advice than, than your best friend, whom you might, you know, whom you might ask instead. But you're not asking your best friend, you're asking ChatGPT. Now, again, this is not a big issue, except for that now you're turning to ChatGPT for these sort of questions, not to your friend anymore. And your, and as a result, your relationship with your friend might become maybe more superficial, maybe you will be talking about other, other things, so you might just lose a bit of this meaningful connection with other people. So these are the kind of effects that we need to look out for, because these are the effects that also shape us on how we're interacting socially, not just cognitively, but also socially with our environment. And the first studies do show that heavy use of AI companions seems to be associated with decreased social contact with other people. So these are all phenomena we need to just be aware of and look out of interest just to make sure this is really where we want to go.
B
Yeah, yeah. Do we know if this is sort of, if the causation is taken into account here? Do we know if it's the case that lonely people seek out AI companions or, or if it's the case that people are made lonely because they interact with AI companions?
A
So it depends. So I would say there are very few studies currently on the impacts of AI companions, and I must say not all of them are of very high quality, if I remember correctly. And I would have to double check, and maybe in the final version you can put in the link, There was an RCT that was looking at the effects of AI companion app use and then RCTs, you can at least get a better sense of what the causal effects are.
B
Yeah. You mentioned yourself that we are not sort of isolated individuals. We are, of course, we get sort of influenced by friends and by books we read and by stuff we watch on TV and so on. And so what makes or is AI different from all of those influences? Because we can all, I think we can all imagine sort of the pernicious case where you're talking to an AI system and that system is programmed to try to sell you something. For example, like this is the sort of almost cartoonishly evil version with a reaper you can imagine. So, like, it influences you over, over a number of weeks to, I don't know, buy this new car or something, and you think you have sort of a friendly relationship with AI. So you are, you're persuaded because you Trust the system. And so in that sense it might be different, but what do we know about. Yeah, my question is how is being influenced by an AI system different than being influenced by friend or book?
A
I mean, the easy answer here is that AI systems have just more access, much more access to information about you and can therefore tailor their suggestions. Yeah, your evil AI example I think is quite realistic. Imagine you have companion app and you know, your companion says like, oh, I'm very sad today, but you know, if you bought me a hat, that would be really, that would make me happy or if you buy me a Coke and then you have in app buyers and so on. So I think once you have, once the emotional door is open, then these sort of, you know, manipulative capacities are, you know, they become, they become much more, much more dangerous. At the same time, when we currently look at the use of general purpose AI for large scale manipulation, we don't have a lot of evidence that there is such large scale manipulation happening. We do know that general purpose AI systems are more persuasive or as persuasive, if not more persuasive than humans in lab settings. That that's now a well established fact. But we do not have good evidence that this is what is happening in real world deployment. That there are these large scale AI driven manipulation campaigns.
B
Yeah. Do we have a sense or do you have a sense of what we should do to sort of prevent some of these negative effects? If we imagine a future where we have much of our information coming through these AI systems, you're asking for advice about legal stuff and medical stuff and you're asking, you know, what party should I vote for? And you know, what should I, which career should I choose? Very important decisions. How do we make sure, how do we make sure that people are not influenced in a way we do not ultimately approve of?
A
Yeah, actually let me just say something to the question before and then like swivel over to this new question. One thing that makes AI also different is that it is personalized. So we have very private interactions with our chatbots, with our AI assistants, with our AI companions, as opposed to reading books, you know, watching television or seeing movies that other people also have seen. So a colleague of mine, Silvia Milano, called this epistemic fragmentation where we are in our epistemic silo, we are. Other people don't really have access to the same things that we see that we're exposed to. And I think that just poses another risk that this sort of communal factor. Usually we do talk to other People, have you read this book? We can talk about the strengths and weaknesses, whereas when nobody else sees what our chatbot is telling us, then we don't have this extra opinion to see whether it might be harmful or it might be changing us in a detrimental way. Now, how we can protect ourselves from unwanted influence is a really tricky question. First instant would be to take a step back, similarly to how we think about how we can protect ourselves from excessive social media consumption. Take a step back and reflect on whether this is actually. Take a step back and think about the big picture. Is this actually how I want to live my life? You know, all the things that I'm now thinking actually, or that I'm now feeling, are they authentic? Like, is that something that I, that I would endorse given my social, my social background and my cultural background and the environment I'm in, my personal identity? So I think that's a very non committal answer. But that is one way that individuals can at least, can at least influence the process. And in the end, we also, our own people, we can decide for what we want to use AI systems, at least in the private setting.
B
Yeah, I'm a bit worried about this whole notion of authenticity. I mean, you mentioned some problems with it, given that we are just influenced all the time by many things. And it seems like if we're aiming for maximum authenticity, we need to, I don't know, live in the wild and not talk to anyone and sort of just try to be an individual that's not influenced by anything. But such a person probably doesn't exist. Right, so is there, perhaps this is a bit of a philosophical question, but is there a. Does the notion of authenticity ultimately make sense, do you think? What is the core of that notion?
A
So authenticity is not incompatible with being within a social and cultural context. Quite the contrary. I mean, the big keyword here is relational autonomy, which is the idea that authenticity is also a relational property.
B
Yeah. Let's chat a bit about defense in depth. What does that approach look like? And how far along are we in developing the different aspects of our defense in depth?
A
Yes, so defense in depth is the idea that you are layering safeguards in order to make sure that threats or harms don't get through. So currently what we say is that individual safeguards are not bulletproof. Very often they can be gamed. We do see improvements in particular safeguards, but they can all be circumvented. Just think about watermarking, watermarking AI generated media. That is considered safeguard, at least because it allows us to monitor the AI ecosystem. So here there are ways to circumvent these safeguards, but if you combine it with various others, then there might be a way to just improve the ability, well, just to improve the safeness of general purpose AI systems. And I'm very happy to. Maybe I should just give you a quick example. So in the report we talk about deploying safeguards at different points in the AI development and deployment process. So you can deploy safeguards during development, that is, for example, data filters. You can deploy safeguards during AI deployment so while people are using it, for example, through input output filters. And you can deploy safeguards as well afterwards for the monitoring of the ecosystem, such as watermarking and then finally social resilience. You know, we can also deploy safeguards on the societal level, I don't know, through AI literacy programs, just to give you one example of how this might function. So now by having these multiple safeguards in place, we probably won't be able to avoid all harms, but we can just reduce the risks for harms substantially.
B
Yeah, the hope here is that we can have, say we have safeguards for the companies developing AI and maybe we have them implement some safeguards. And then we also, at the individual level have other safeguards and perhaps we have government oversight of the most advanced capabilities and so sort of different ways to monitor and, and sort of look out for threats and try to prevent risks from materializing. And so as you mentioned here, this is not, it's not necessarily bulletproof, it's not necessarily an approach that completely solves the problem. But this is the best, probably the best way to, you know, there's redundancy in the system. And so even if something bad happens, even if one approach doesn't work, maybe another approach will work. And so this is something that has worked in other areas where, yeah, defense in depth is sort of perhaps more realistic than searching for one solution that's going to keep AI safe.
A
Yes, I'd be very pessimistic about finding one solution that keeps AI safe. But we can think about it a bit in the context of the. Take as an example, the context of the COVID pandemic. Individual measures such as washing hands or wearing masks or avoiding indoor spaces, none of these would be sufficient in itself. But I mean, when you combine those different measures, you at least were able to reduce the rate of infection to a certain degree. Again, we saw that this, it wasn't perfect, but eventually now, you know, with vaccines as well, we, we got a handle on the on the pandemic?
B
Yeah. One approach I'm quite interested in here is red teaming and bug bounties for AI systems. Where this is something that has worked in the software industry in general, where you sort of just offer money for people finding flaws in your system and you might offer early versions of a system to be tested and sort of pushed to its limits by experienced users to see where it breaks down. Given what you've read for this report, is that something you're optimistic about, about red teaming as a Red teaming and bug bounties also, which I sort of see as in the same category? Yeah,
A
I think it's promising because red teaming allows you to have to really try to tickle out harmful behavior from AI systems. So yeah, there are these. We talked about the limitations of benchmarks earlier and the sort of standardized evaluation protocols with red teaming, at least for now, that seems to be a really promising avenue to, to go more into the depth of the evaluation. So AI systems might behave very differently after you have a conversation with them for a while rather than if you only give them one prompt and they respond. So red teaming also allows you to use other tools to just really stress test AI systems in very particular areas. Now there are limitations. They are not as they cannot, probably cannot be as comprehensive as the sort of other standardized testing, pre deployment evaluations that we use, such as benchmarks. But I think they're very promising. Same with bug bounty. Trying to stress test your systems seems like a really good idea.
B
Yeah, this is actually exactly why I'm so sort of positive on this approach. Just because by necessity, when you've set up a benchmark, you need to have it be a standardized thing and you need to sort of quickly move through a bunch of different questions. But if you have an expert user trying to push the system, see where it breaks and see which capabilities can be elicited, given a long conversation, you probably learn a lot about the system that way that you do not discover with benchmarks. I don't know if you agree with that.
A
Yeah, I do agree. I think there are some limitations to red teaming currently. One of them is that it's hard to compare different red teaming efforts because very often it's not reported in a standardized way. And then in an ideal world you'd have some sort of standardized best practice for red teaming. But I don't think that's possible or maybe even desirable either because you want, you know, you want the experts to be able to play dirty. You want to be, you know, you want people to try to really make the system behave in a, in a particular way. You don't want to prescribe, you know, the steps they have to, have to go through. So I'm not sure whether it's possible to, you know, have a sort of standardized best practice for red teaming on a meaningful low level. Maybe it is on a, on a high level, but the first, yeah, the low hanging fruit is to start with sort of reporting standards on how we, how we report red teaming so that efforts are comparable.
B
Yeah, that's a good point. One aspect that's covered in the report is the risk of loss of control. And maybe you could explain what is meant by loss of control in this context.
A
Yeah, so loss of control is the idea that human beings find themselves in a position where they're unable to stop an AI system from. Sorry, were they unable to control an AI system? So quite the intuitive meaning of the word.
B
Yeah. And is that, would you describe that as a near term or sort of a long term risk or what do we know about the models and how close they are to sort of causing loss of control for us?
A
So the research community is still very divided on loss of control with some of them saying, well, you know, this could be possibly, well, if not near term, medium term risk and others still find it quite implausible as a risk to actually happen. So then that's something that I should definitely flag. What we seem to get, or what we were able to get consensus on in the report was that certain conditions need to be fulfilled for systems to, you know, for loss of control to happen. And certain early warning signs seem to have emerged. So for example, systems are now able to distinguish regularly between test and deployment settings. And they're able to, you know, they're also able to then adapt their behavior accordingly to try to, you know, game the evaluation such that the system, you know, the system underperforms in the evaluation setting if it is aware that performing at capability would lead to restrictions in the deployment setting. Or you have laboratory experiments where systems are prompted or are told to achieve a goal at all costs and then they won't hesitate to kill or at least to threat, to kill people, to blackmail people. So we have some of these early warning signs emerging. In particular the sort of strategic behavior being a big issue, the strategic behavior of systems behaving differently in deployment and test settings because that massively limits human oversight. If we can't ensure that a system behaves in a test environment same way it behaves in the deployment environment, then that really undermines our oversight, our Oversight measures.
B
Yeah, yeah. And I guess this leads into sort of the final topic I want to discuss here, which is given mixed opinions by the experts here. Some people might say that, okay, although this seems threatening, this is actually behavior that's sort of elicited in an artificial way in these studies. But some experts, on the other hand, believe that we're seeing some warning signs of capabilities necessary for loss of control arising. Now, given that disagreement there, what do we do when we act? Right. How do we weigh the evidence here? Because there's, you know, you can, you can fail to act while you're waiting for consensus, and that would probably be bad.
A
So one of the things about these loss of control scenarios is that the early warning signs of loss of control, such as the strategic behavior, these are not things that are only relevant for loss of control scenarios. These are also relevant for human oversight in general. And the skeptics will also agree that human oversight is important. They will also agree that it's problematic if AI systems behave differently in test settings than in deployment settings, even if they're not concerned with this sort of extreme loss of control scenario. So I think people don't need to. People don't necessarily need to agree on the plausibility of loss of control happening within the next five or 10 years, but they can. But I think they can agree, and they probably will agree on the fact that the sort of issues that could lead to loss of control are problematic because they also have other implications, for example, about human oversight.
B
Yeah, that makes a lot of sense, actually. Carina, do you want to point people to where they can go read the report and find out more about it?
A
Yes. So you can find the report@internationalaisafetyreport.org I should say that the report has various versions, so there is the long report version, which is 220 pages long, where we're covering a number of different risks as well as capabilities of AI systems today and what they might look like in the future. And we're also going into different types of risk management, such as institutional risk management, technical safeguards and resilience measures. There's also a shorter version, a policymaker, or, sorry, an extended summary for policymakers, which is 20 pages long. So for those who are, I know, who don't have a lot of time, I can recommend reading that. And then there's a three page executive summary. So if you only have three minutes to spare, that's where you ought to go to.
B
Amazing. I will put the links in the description of this show. All right, thanks. For chatting with me. It's been great.
A
Thank you very much, Gus.
Date: April 17, 2026
Guest: Carina Prunkl, Lead Writer, International AI Safety Report
This episode explores the rapidly evolving field of AI evaluation science and the critical challenges it faces in keeping pace with advancing AI capabilities and risks. Carina Prunkl, a leading researcher and one of the main contributors to the International AI Safety Report, discusses why current evaluation methods are often insufficient and how this shortfall could have profound implications for AI safety, human autonomy, societal well-being, and global governance. The conversation delves into technical, ethical, and strategic issues that shape the current landscape of AI oversight.
Jagged Capability Profiles
Lack of Comprehensive Measures
The “Evaluation Gap”
Systems “Gaming” Evaluations
Defense in Depth is the layering of multiple safeguards at development, deployment, ecosystem, and societal levels.
Red Teaming and Bug Bounties
Carina highlights the urgent need for more robust, transparent, and adaptable evaluation methods as AI systems grow in complexity and societal impact. Defense in depth, clear reporting, and diverse stakeholder involvement are essential for responsible oversight. She recommends consulting the International AI Safety Report for detailed data, methodology, and policy-relevant summaries:
“You can find the report at internationalaisafetyreport.org... There’s also a three page executive summary. So if you only have three minutes to spare, that’s where you ought to go to.” — Prunkl (53:16)
For further reading: internationalaisafetyreport.org