![Navigating AI Safety and Security Challenges with Yonatan Zunger [The BlueHat Podcast] — CyberWire Daily cover](https://megaphone.imgix.net/podcasts/58ab7ae0-def8-11ea-b34c-b35b208b0539/image/daily-podcast-cover-art-cw.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress)
Loading summary
Nick Fillingham
Since 2005, Blue Hat has been where the security research community and Microsoft come.
Wendy Zanoni
Together as peers to debate and discuss, share and challenge, celebrate and learn on.
Nick Fillingham
The Blue Hat Podcast. Join me, Nick Fillingham, and me, Wendy.
Wendy Zanoni
Zanoni, for conversations with researchers, responders and industry leaders both inside and outside of.
Nick Fillingham
Microsoft, working to secure the planet's technology and create a safer world for all.
Wendy Zanoni
And now on with the Blue Hat Podcast. Welcome to the Blue Hat Podcast Podcast. Today we have Yonatan Zenger and we are thrilled to have you here. Yonatan, would you introduce yourself, tell us who you are, what do you do?
Jonathan Sanger
Well, hi. Thank you so much for having me on the show. So my name is Jonathan Sanger. I'm currently CVP of AI Safety and Security at Microsoft, as well as Deputy CISO for AI. And you know, my job is to try to think of all of the things that could possibly go wrong involving AI and figure out how we're going to try to prevent them from happening. I think that's sort of the short version of it. I came to this from a career originally as a theoretical physicist, moved over into CS sort of full time back in the early Zeros, where I started out building heavy infra. I built a lot of the core part of search at Google, a lot of planet scale storage, things like that. And then in 2011 I became CTO of Social. And this was just at the time that Google was about to launch. This was also the time that GDPR was being drafted. And within three weeks of taking that job, it suddenly became very clear that the hard part of this job wasn't going to be software infrastruct. It was going to be people's safety, it was security, privacy, abuse, harassment policy, all of these things. And I discovered that I genuinely loved that. I fell in love with the field of trying to really solve these problems. And that's been, I would say, one of my biggest foci professionally ever since. And so now I'm really excited that I'm getting to work on one of the craziest, hardest problems, even by the standards of a pretty strange career. One of the strangest and hardest things I've ever worked on. And yeah, that's what I'm doing now.
Wendy Zanoni
I love it. I love all the nuances and the human side of what you're doing. If you could let the audience know, for some that are still learning about the AI field, what is generative AI?
Jonathan Sanger
Well, yeah, that's a really good question because we've had AI of various sorts for a very long time. And generative AI has also existed for a long time, but it only became a really big deal in AI a little more than like a year and a half ago or so. So the way to think about it, the sort of the traditional kind of AI, I'm referring to it nowadays as predictive AI. What does your world look like in the world of this traditional AI? Typically, if I want to use a model for something, I'm going to build a model. So the model user and the model builder is the same person. And you take a bunch of examples, you train a model. What are these models generally good at? They're good at looking at a really large field of data and making a prediction or a classification or a recommendation or something like that. They're good at looking at these very, very large spaces and analyzing. And of course, the problems you're dealing with now, because you're both model builder and model user, is you now really have to worry about, is my model biased? Did I pick the right training data? Does this thing have really weird, nuanced failure modes? And then you have to think about all the safety aspects of your integrated system. Am I using it wisely? Et cetera, et cetera. Generative AI is a bit of a different world. At the very deep technical level, it's the same basic approaches. We have neural networks, all these structures. But in practice, it's often better to just think about it as a complete, completely different technology. From a practical basis, the idea of generative AI, sort of at the very technical level, of course, what you're doing is you're predicting character sequences, token sequences, images, things like that. In practice, the way I would think about it is you've got a model, and first thing to realize, in most cases, you've got a generic model. It's a model where one person trains it, and you're going to use the same model for a huge range of applications. So the model trainer and the model user are now two completely different people. And what are these things good at? Basically, there are two things that generative AI is good at. One of them is it's good at summarizing or analyzing a piece of human type content, so natural language or an image or something like that. It's very good at saying, here's a paragraph of text, give me a summary, extract the key ideas, something like that. And the other thing it's really good at is roleplaying a character. And this is the foundation of most of what we do with generative AI is basically a Lot of creative use of role playing. You tell it you are accustomed customer service agent for Wombat company and you're about to be asked a question by a customer and you know how to search through the following databases of information, et cetera. Or you say you're a programmer, you're a Python programmer and you've been asked for your advice on this piece of code. And like you need to write a function to do something, you're a security expert and you need to help analyze this forensic, this set of forensic logs, something like that. So this sort of creative use of role playing is one of those fundamental engines to it. So I guess the way I would say is that what is generative AI really at the innermost loop, it is a combination analysis and roleplaying action which you can then build up to build all sorts of cool things out of.
Nick Fillingham
Jonathan, this may be too large a question, but what I wanted to ask was it almost sounded like you described the entire breadth of AI we were talking about just generative AI. So what's beyond that in terms of. So you talked about role playing and you talked about sort of the ability to synthesise or summarise data, obviously paraphrasing heavily, what else does AI do that's not generative AI? Again, probably a very large question, but how do we sort of think about these different roles and functions that AI can take?
Jonathan Sanger
Well, that's the predictive AI I was talking about. The generative AI is the one that does the analysis and the role playing. The predictive AI is the stuff that does the classification and the recommendation and the analysis of that sort. I'd actually give a really good analogy from the human brain if you think about how the vision system works. So the human vision system is a stack where the very first input to the stack is the retinal neurons. You have the direct things that are measuring brightness or color or something like that. You have several stages in this stack which then go from pixels to very small curves, to large curves and shapes, to two dimensional shape recognition to three dimensional object recognition and so on. This is very similar to how predictive AI works, right? You have a model that is just scanning at a tremendously large range of things and pulling a small number of features out of it. Once you've pulled out those first things though, your next layers of the stack is saying, oh, I see a three dimensional shape. Wait a moment, I recognize that shape. That's Wendy's face. It's then the thing that's starting to go off and identify what's happening around you. The sort of things you can articulate in words like, holy crap, there's a tiger and it's about to jump on me. That kind of higher level processing is in a lot of ways more similar to what generative AI is doing. It's narrative. You can think of it most usefully as processing that happens either directly in words at the highest level or in almost word like concepts at the level below it. And the reason I bring up this analogy is because it highlights the way in which the two kinds of AI actually complement each other. These are not replacements at all. What happens is that this predictive AI, you can really think of it as AI that specializes in looking at very large fields of data. The model tends to be very specific to the problem being solved. So the vision centers of your brain, if you tried to plug them into your ears, they would not work correctly. That's not what they're there for. Whereas these higher level abstractions, the generative AI is really good for dealing with the higher level abstractions of the things that you can narrativize, the things that you can turn into words. So in a good healthy environment, what you're generally doing is that you're using these predictive AIs to scan very large fields of data and reduce a mass of pixels into a statement of, oh, here's a picture of somebody's face. And then you're taking that information and you're handing it to the general, to the narrative layer, the one that speaks in words and so on. And it then starts to assemble these, reason about them, talk about them, have these very generic kinds of conversation about them. So that's sort of the difference. That's the way we see.
Nick Fillingham
That is a wonderful analogy. Thank you so much. I do want a quick pause. As an Australian, I noticed that you used wombat as your example there. Why wombats? Is there a story there?
Jonathan Sanger
Why not Wombats?
Nick Fillingham
I love it. Why not wombats? We should put that on the sticker. So I want to sort of like help people sort of still continue to wrap their head around generative versus predictive and other forms of AI. Can you give us some examples of sort of positive uses of this or to juxtapose against negative use cases here. How do we sort of think about the good and the bad? And I'm using air quotes of this technology.
Jonathan Sanger
Well, you know, the good and the bad is very much in how you use it. What are some examples of good uses? I mean, there's so many of them. Honestly, you know, I'll just give some random things that pop into my head. Dynamic temperature control for factories and data centers. I remember that was an example that came up a decade or more ago. But it turns out that you can have a system that stares at all of the temperature sensors across the building and controls whether to open windows or not and how to run fans and so on. And you can make a building spectacularly more energy efficient by doing that Self driving cars when they're not designed by maniacs. This is a technology that can save an awful lot of lives. I mean, well, social media is always kind of a complicated mixed bag. But if you think about this idea of helping people meet other people, the actual driving purpose of this, and I think we often forget, given how many problems have emerged in social media, we tend to forget just how much good this has actually done in people's lives. How many people have formed and maintained their friendships, their jobs, their entire profession, sometimes their romantic relationships. There's so many things that people have formed through this. And if you think about this, this is really about a lot of this is the use of algorithms to try to help you find who are the people you might want to actually be with. Who are the people you might want to be around with. The generative AI, it's still a very new technology. So I think we haven't yet seen the killer app of generative AI. I think we're in a stage where I think the single most important piece of office Software of the 2030s, the category has not been invented yet. We're really at that new stage. The thing that is going to be the equivalent of what the spreadsheet was for the personal computer or what direct messaging was for mobile phones. We haven't even invented it. We don't even know what that thing is yet. Right now we're sort of seeing these very early examples with generative AI. I think we're finding that it's really good as an interlocutor and brainstorming partner, for example, I think there's a lot of very interesting potential there. It's also something that you can combine with a lot of more traditional techniques. Like for example, one of the classical challenges, things that you really can't do with AI today is that you can't start with predictive AI is it's not really good at understanding language. Understanding natural language is actually a very, very difficult problem. It's what we used to call an AI complete problem. In fact, it turns out even pronoun resolution, that is knowing what A pronoun refers to in a sentence. Is AI complete in the sense that it actually requires a full model of the world and a theory of mind in order to do. There's sort of a classic example. I think this example might be due to Steven Pinker. I can't remember for sure, but here's a sample dialogue for you, woman. I'm leaving you, man. Who is he? Now, I'll bet you probably had no trouble understanding those two sentences in that dialogue, and you could probably tell me exactly who the he in that second sentence refers to. Yeah. Now explain to me who that he was. Without a complete theory of mind of both of the people and of what the people are thinking that the other people are thinking and so on, of two characters that I've literally identified as just man and woman already, you had to solve that complicated a problem. So one of the traditional challenges that we've had in all sorts of data science is that understanding human language is really, really, really hard. And what's one of the genuinely stunning things about the recent revolution in generative AI, the one that's happened in year and a half or so has been that we finally have software that's capable of just looking at a piece of human text and actually understanding it and extracting information from it. So if I ask it to resolve that and to then transform that into a structured form, I can. Which means that I can potentially apply this sort of analysis at scale to large amounts of human data and interact with human information in entirely novel ways and of course, interact with people directly. So I think there's tremendous possibility for some really wonderful things to happen here. Another suggestion that I've heard a lot talked about is personalized education as a service. Imagine I want to learn about X, and this thing will help put together a syllabus. It'll do all of the research it needs in order to find all of the right information, figuring out how best to teach it, and then it can teach me interactively, because it's not just going to create a PDF or a PowerPoint presentation. It can actually go back and forth and work with me and teach me all of the things I need to know. Imagine what this could do for the world. Give everyone access to a teacher. So there's tremendous possibility for good, and of course there's tremendous possibility for bad, because there is no single technology humans can come up with that can't be horrifyingly misused. Just to take a simple example, we were just talking about education that's entirely wonderful until what the person wants to learn is how to weaponize anthrax or how to kill people, how to encourage a genocide. I mean, there's so many things that people might want to learn that are really horrible. And this is the point where we start to really run into deep nuances and we sort of have to ask ourselves, well, these problems already exist in the world. How do we prevent them? Right. There are people in this world who do know how to weaponize antifax. But I can promise you that if you went to one of them and asked them, hey, would you teach me how to do that? They would say no, they would probably not do that. But their judgment about when and how they're willing to do that is an interesting nuance. It's something that we need to figure out a way to capture and express and formalize. There's a lot of other very simple ways you can misuse AI in basically any way you can imagine misusing any technology. One of my favorite example might be the wrong word for this, but one of my classic examples of a misuse is there has been a whole business of using artificial intelligence to help make sentencing recommendations in criminal law. And ProPublica had an expose of this back in 2016, which I think is really worth reading. This works as badly as you might imagine. So for example, if you look at these companies, they were very careful not to take race as an input signal because that would be horrible, you shouldn't use that. But they did take income and your quantized address, like the neighborhood you lived in. And if you know anything at all about how American politics works, your income and the neighborhood you live in is a really good proxy for race in most of this country. And then you end up with sort of a proxy signal problem. The theory behind it was, well, we're going to predict who is most likely to commit another crime, and we will recommend harsher sentences for people more likely to commit further crimes. There are a couple of obvious problems with this one. First of all, the sentence you give someone does affect their probability of committing a crime in the future, right? If you make sure that someone can't get any sort of non criminal job in the future, they're probably going to be criminals. But the even deeper one, the big problem that really kills this is when someone commits a crime, it's not like a giant light bulb goes off over their head saying, attention, this person has committed a crime. You can't actually measure the variable you care about. So they picked a proxy variable. They measured whether someone was charged with A crime. And the thing is the difference between committing a crime and being arrested for a crime and being charged for a crime. That is not a uniform translation matrix. If you ask the question, who is more likely to be charged through the crime? The answer is the black person. What they basically built was a system to predict race. They picked a system that was modeled exactly to capture measure the nature of institutional racism in the United States and then implement that as sentencing guidelines. And this is a system that proceeded to go off and destroy a bunch of lives. I think this is a really good example of how not to use AI, of really, really dangerous, ill conceived decisions. And in this one, the obvious thing that people didn't think about was the basic question of what happens when this thing makes a mistake. This is the basic question you need to ask with any piece of software that you're building, any machine you're building, what happens if something goes wrong and in this case something very bad happens, especially because they send it up in a way where basically its recommendations were almost automatically acceptable. So you have to really architect your systems around the possibility of failure. And especially for things like AI, where the system is inherently non deterministic, where all of its categorizations or predictions or outputs are always going to be probabilistic, you have to be very, very careful and make sure that your system is robust. Your system, the integrated system, including all of the people who are using it and the people who will interact with it, are robust against the system being wrong.
Nick Fillingham
Wow. Gosh, so many directions we could go here. My first question is jumping off from that example. It's too simplistic to ask where did they go wrong? I think I want to ask more about sort of ethics. When you design a system that has these kind potentials for these kind of sort of significant outcomes, do you hard code in a bunch of ethical rules or do you give the system the ability to monitor the outcomes, to then sort of adjust those sort of ethical guidelines that it's functioning on? Or do you still need in 2024 human beings with their own ethical guide to be able to monitor and control it or something else? How do ethics as a guiding force, but then also a component in an AI system play into this?
Jonathan Sanger
That's a really wonderful question and I think there is no single answer to it. I think the correct answer to that question is very much dependent on the exact system that you're building. The way I would frame the approach to this, this is, I think, one of the most basic lessons that I always try to teach people there are two parts to engineering. Product engineering is the study of how your systems will work, and safety engineering is the study of how your systems will fail. You can't do just one or the other. And I think one of the great curses of modern computer science, the way that the field is working, which we desperately, actively, urgently need to fix, is that these are treated as two separate things rather than part of the same discipline. If you go talk to civil engineers, you will see a very different story. Civil engineers are safety engineers who occasionally build bridges. It's a very different culture and I think a much healthier one. What does a safety engineering culture mean when you are working with AI or something? Well, in fact, it turns out AI and social media, I think are very similar also. Search gaming. Any software that really intimately involves humans and AI, you get very similar problems and you need a similar approach. So what do you do? Very, very first thing, from the moment you even start to conceive of, hey, I've got a crazy idea, what if we did? At the same time that you're thinking about what it could do, you're also thinking about what might go wrong in this situation. I have a whole set of things that I try to teach people about how to think of ways that things can go wrong. And we're actually working with my team right now on writing up training materials and creating things to help people learn how to do this. But the very, very first thing you're doing is you're coming up with a list of things that could fail, a list of ways in which this thing could go badly. Your basic approach to this, by the way, that's a three pass approach. First you go system. First you look at each component of the system and ask, what happens if this thing fails? That might mean what happened if it makes an error? What happens if it gets a malformed input? What happens if it gets an actively malicious input? What happens if it gets an unexpected input? Just all of the ways in which some component could go wrong. Your second way of looking at it is attacker first. What if someone is trying to misuse this system? What if someone is trying to use the system, let's not even say maliciously, but for a purpose other than the one you intended. What might they be trying to accomplish? How might they use your system in order to accomplish that? And your third pass is the target first pass. That's where you're looking at it as who are the people who might be affected by this system? What aspects of their lives might cause them to be affected by this system differently from other people and what are there particular vulnerabilities in their lives that might be around? So one of the things we're working on here also is checklists of ideas to help people think of different possibilities here. I'll also say this is the place where I always say that this is the place where diversity, equity and inclusion makes such a big difference in your ability to actually correctly do your job. Because the one thing I can promise you is that you cannot think of what every possible attacker or affected person might be experiencing in their lives. They are very different from you because you are one person. There are a lot of people in the world and they're very different from each other. They have very different lived experiences. And having a broad team, a team with a really wide range of lived experiences and a team that's empowered to speak up about those things is critical to actually being able to do this analysis correctly. Your very first step at the very beginning of all of this is you think about what could go wrong. Now you've done that, you've got your list of threats, you bounce this off like a bunch of people. By the way, this is not a one off process. This is the process you're going to be continually doing every single day of your life. From the day you first conceive of the project until the day it gets shut down for the last time. You're thinking about what can go wrong. And then you're thinking, well, okay, for each one of these things, I need to have a plan. And your plan might involve mitigation, like preventing it from going wrong or making it less serious. And there's always going to be some aspect of it that you can't mitigate. There are problems in this world where they look at this and say, oh, I'm going to change the design of my system so that this thing is impossible. That's wonderful. When you can do that, that's your best choice. And by the way, this is also why it's very important to do this sort of analysis from day one. Because often you can make a small change in the design of your product and just in the basic shape of it. It eliminates whole swaths of potential problems while leaving the core product function that you care about intact. And that's often a really easy thing to do in the early design phase and is almost impossible to do after you've built your entire. So don't wait for that. I have seen projects get two weeks from launch and then someone points out a basic Problem with this and surprise, you have to go all the way back to architecture. See you again in six months. Don't do that. That's awful. It's a terrible experience for everybody. So how do you do this next step? Sorry, I'm going off into the complete spiel of how you do safety engineering. Love it.
Nick Fillingham
Please, please. This is wonderful.
Wendy Zanoni
Yeah.
Jonathan Sanger
So what do you do next? The next thing you do is for each one of these threat scenarios, you walk through the way that the threat scenario actually happens. You walk through the exact what are the sequence of events that have to happen for this to go wrong. And the reason you do that is you start to then highlight possible intervention points. Where are things that you could do that would prevent that step from happening or would change the outcome of that step? Once you've done that for each of your threat scenarios, then you compare those intervention points across all the threat scenarios. And then what you'll often discover is there's a few intervention points that actually help you with a lot of different threats. And that's the point where we can start thinking about mitigations. How might you change your system, harden it, make it more robust to make those things less likely? You keep sort of doing this in a loop until you now have a hardened system. But at every stage of this you've got sort of a residual threat, right? You have events that could punch through all of those defenses and still happen. And so your last stage is always when this happens, not if this happens. When this happens, what are you going to do about it? Right? And that's the how will you know that something has happened? How will you respond to it? For example, with a lot of user facing software, this is the point where you start really thinking a lot about the user experience. By the way, you cannot treat UX as being distinct from any other aspect of your system. One example of this one. Let's talk about abuse on social media, right? So turns out there's a lot of harassment and abuse on social media. That is like one of the primary things people use social media for. Sadly now you are. So you've got various things to try to prevent it, but that's going to get through. It's going to happen a lot. So now you say, okay, let's say I'm running a social network and someone can create posts and they can get comments on it and the comments can be really terrible in various ways. Now the objective of the user, when they encounter one of these comments, like they're going to be very upset right now first thing, they want to get rid of this thing and they want to make sure that this person goes away and never comes back. That is their objective. Now, you've actually got a bit of attention now because the goal of the system operator is not just to get that detection, but to get enough information to figure out like, did this violate policies? Is this thing a signal of a broader problem? Is the user who made this comment a serious problem that we need to be kicking off the service? Or conversely, is this something entirely personal between these two people? That has nothing to do with this. Also, I mean, I mean, abuse reporting is often actually done as an attack vector. People will mass abuse flag people that they don't like, not because those people are being abusive, but just as a way to try to get them kicked off the service. So in fact, false reporting is a big issue. So the system operators really want to collect as much information and context as they possibly can about an abusive incident so they can make a good decision. But these two goals are intentional. So because in fact, one really useful way to think about this, if you think about emotional activation curves, they tend to spike very rapidly. You're looking at a timescale of between 500 and 2000 milliseconds, typically to see an emotional activation curve rise. They decay, people calm down much, much, much more slowly. That is a timescale of typically for a small oscillation like decay, is there minutes? Actually minutes, not miles. But if the event keeps happening, you keep moving up, right? Imagine sort of a curve where you can either every time a bad incident happens, you add an exponentially rising curve. And whenever anything isn't happening, the thing decays with a very long time constant. So you can keep going up, up, up, up, up. So user has seen this upsetting comment, they get their first spike. If there is a big red button they can hit to make that thing go away, then they can go right back into decay mode almost instantly. If there isn't, then for every time they look back at that comment, you're going to be getting another spike and it's just going to keep going up, up, up, up, up. So what's actually really important is that the user needs to be able to dismiss that thing on a timescale of seconds. What you really want is the time from them experiencing it to the time they're done with that problem to be five seconds or less. I would say sort of a good not rule file. That means that the report abuse button, if report abuse makes you go through this whole abuse reporting flow where you have to now declare which category of abuse is it? And et cetera, et cetera, et cetera. You're getting good data for your team, but you're actually not achieving the core user need of getting into a safe state quickly. So the correct design of this kind of system becomes very, very subtle and nuanced. And this is actually sort of the core. Now going back to, to where we started, you had a threat scenario of users experiencing abuse on the platform. You need to think of intervention detection response in a way that solves the user's problem first and then separately the question of how do you now get the signals that let you do a more detailed analysis. Because now what you've seen is okay, this user flagged this comment as being problematic. Most likely, that's all of the information you've got. If you now want to look for broader patterns, you now have to actually think like a data scientist. You have to think how do I analyze this situation to figure out is there a larger pattern I need to care about? Then there's all sorts of things you can do in order to do this. So for example, here's one simple rule. Let's say that you have one user, that there's a real pattern that every time they interact with someone that they don't have a pre existing relationship with, the probability that that person is going to report their comment as abusive is unusually high. That's a really good sign that you are dealing with an asshole. That's a real, and there's a lot of things like this. Actually one of the most important rules in abuse detection is sometimes you're looking at reports of things and let's say that you're dealing with, well, if I'm dealing with comments on someone else's post, then the post owner should just have the right to remove anything they want to remove, period. But if I'm looking at posts, like sort of top level posts or things in the general forum, the criteria for removal is probably going to be a product level criteria. Now one thing that we have learned is that bad actors in social media are really, really good at figuring out the exact line of what they can get away with and skirting really, really close to it. So they will always figure out some way to be just working around the rules so that each individual post never quite gets removed. A really important rule when you're doing abuse detection is if you don't remove something nonetheless log that this thing was close to the edge. Because one pattern you will notice is that hey, none of this user stuff got removed, but wow, do they have a lot of stuff close to the edge. And that is one of your biggest red flags for an account. That account you kick off the system. So it's this kind of thinking. And so now, okay, let's go from the specific back to the general. How do you approach this? What you were doing is you had a threat scenario. In fact, you have quite a few threat scenarios tied to each other. You have various intervention points. You have intervention at the point that someone is making a comment. When they're seeing a comment, who do you introduce to to each other? What do you let them see? You have the whole response pattern and so on. You keep adjusting this thing to try to reduce the level of threat until overall, you look at your overall, here's my plan for the thing. And you decide, okay, this plan is reasonable. I think overall this thing is safe to launch. And you're doing some really interesting trade offs here because, for example, if you have to have humans reviewing your abuse cues, which you absolutely have to have, because computers are not yet at the stage where they can do this automatically. In fact, humans are barely at the stage where they can do this automatically. There's a whole side conversation there that means you're okay. So you actually have to do this expensive thing. You need people monitoring this system continuously and maintaining it. And how many people you need, well, that increases your cost. And if you have a failure that requires human intervention happening an awful lot, you've got a big problem. Maybe you need to rearchitect your system to make that failure happen less often. That's actually how you go about making the engineering trade off of how much do I need to mitigate this particular threat? What you say is, here's the residual cost of actually managing all of the failure modes. After I go through this, is that cost reasonable? If it's not, better go back to better, keep tweaking. And if you look at this whole integrated story, I'm telling you, this is actually best understood as an alternative to the traditional risk management idea of likelihood and impact. If you started out in the world of risk management, you're used to taking each risk, each threat scenario. It's the same kind of thing here, assigning it a likelihood and a severity. And typically sort of the product of those two is how important the risk is. And you go from there. And that's actually a terrible way to approach this if you're an engineer, because that multiplication is really what it's designed for, is for insurers, it's designed for someone who needs to manage a large portfolio of risks and sort of manage the overall risk budget. It's great for that. If you're trying to manage specific risks, it is terrible because you're dealing with either things with very high risk. In fact, you can't even say likelihood. You have to talk about frequency. A friend and colleague of mine, Andy Scow, he put it really nicely when we were at Google, he said if something happens to one in a million people once a year, here at Google we call that six times a day. Which he was right. I mean, he had done the math on that one for a particular service. And so what this means is you can't even be talking about rare likelihoods in this case. You're talking about things that are happening continuously and there are continuous costs. Or alternatively, you've got failure modes that are incredibly rare and whose impact is really, really high and the product of a very small number. And a very large number is not a medium number. It is statistical noise. There is no way to plan for any of this. If you actually try to design your safety plan by doing likelihood and impact, you will just end up in a complete madhouse. This is what I call the when, not if, method. And don't ask if this thing is going to happen. It's going to happen for each one of these threats. What's your plan? That's the question I want to know. Going all the way back to your original question. How do you deal with ethics and artificial intelligence? I think that the real answer is you deal with it by looking at what are the threats in your system, what are the things that can go wrong, and having a plan for each of them and the nature of the correct plan, whether that's putting explicit rules in your system or having humans checking various things and so on, that's always very specific to the system you're building. You really need a solution that is designed and customized to your problem space. And you need to continually be observing, monitoring what's happening in production, updating your model of the threats, updating your plan for response so that you're actually dealing with the things that matter. Sorry, that is my entirely not short answer.
Wendy Zanoni
That's a great answer. I'm looking at your sign behind you. It looks like Smokey the Bear, but it's not Smokey the Bear. It is a.
Jonathan Sanger
It is Roki the Raccoon. Roki the AI Safety Raccoon. Only you can prevent AI apocalypses. This is the logo of our AI Red Team, which I absolutely Love, I love that.
Wendy Zanoni
And that kind of ties into my next question. It's like, you know, fight fire with fire. You see all these products, every product you're using, it's like, hey, you know, tap into AI. AI can help you. We can help you write this, we can help you do this. But on the back end, are we using we as in humanity using the AI to help secure AI? Is that like fight fire with fire kind of thing or protect with protection of the AI?
Jonathan Sanger
We are, but we have just barely begun to scratch the surface of possibility here. There are many different ways in which we do this. Let me start with one of the simplest. When we actually think about how do we secure AI systems? And there's a whole. We could spend an entire hour talking about just how you actually practically secure them. One of the key mechanisms, one of the most powerful mechanisms is something called metacognition. And so to actually understand this one, let's go back to what we said earlier on that generative AI is good at role playing and it's good at summarizing things. A single pass through a generative AI system, that's basically what it does. It bold plays a character. There are no guarantees at this stage that it will be correct, that it will be achieving what you're trying to do. Anything. It's really, this thing is dreaming and that's okay. What people refer to as the hallucination problem, which is actually a much more complex problem. I think that's a very important name for it. What this really is, if you take a single pass through the system and you're expecting the output to be grounded in some factual basis. Yeah, no, that's not going to happen. What do you take? One of the really powerful things you can do is you can ask AI to roleplay an editor of various sorts. So let's say that you're trying to do. Let's give a concrete example. I'm trying to build a chatbot that is going to be a customer facing chatbot that answers questions about my products. And I have some kind of large website full of documentation about my system. But because people are bad at information architecture, it's really hard to actually find the answers you need in this website. Especially if you don't know the exact question you have to app. This is a great job for generative AI. So what does the generative AI do? Going to get a question from the user and it's basically going to follow sort of a fixed kind of plan. The first plan is it needs to look at this question, first of all, figure out, is this even a question it knows how to deal with or answer, right? If I am asking, you know, if I'm asking Microsoft's customer service AI about how to, how to make good pancakes, it should tell you that it has no idea. That is really not a strategy job. Just bounce that off. Then it says, okay, well, in order to answer this question, what's the stuff I'm going to look up? It needs to come up with some search queries. So here it's role playing a customer service agent, right? Who's the subject matter expert? And saying, what are the right search queries to do to find the answer to this Comes up with a list. Now we're actually going to execute searches. This is not an AI step. This is the point where you just run searches and you tell it to look at the results and maybe you have some stage where you're sort of judging which results you want to grab. You want to summarize each one of those results. Again, one of the things that AI is good at, then you're going to pull all of those things together and you make an answer. And once you've got this nice answer, what you can also do now is a metacognitive step. A step where you tell it, look over this as an editor. Make sure that every statement in the output is actually factually grounded in one of the source pages and attach a footnote to every single statement. Attach a footnote with a link, and if you can't footnote a statement correctly, take that statement out that editing pass. That's actually how you eliminate fabrications from all of this. Now, there's all sorts of other ways you can do this, and so I could talk about this at tremendous length, but this concept of metacognition is really powerful. And part of the reason it's so powerful is because of role playing. Because this is just one of these magical things about generative AI. Generative AI was trained on human data and it has cultural assumptions baked into it. So let's say I tell it, you are a compliance officer, or you are, I mean, for definitely compliance. It's like a very beige sort of use case. You are a responsible adult who's really cared about the safety of their community. You tell it a story like that, then you tell it. Look over the following thing and tell me, is this going to be a problem? What's really amazing is I tell it like this short. I give like one or two sentences describing the character that's playing and all of these assumptions that come with that character description, we've actually kind of encoded into it, right? If you tell it it's playing a rabbi or a compliance officer, something like that, if you're telling it to place this character, all sorts of assumptions intrinsically come in. Because it was trained on that it knows what these characters are. And so you can sort of adjust that, train that, tweak that so that you don't have to specify 5,000 rules, you don't have to explicitly specify its ethical code. Rather, you give it a character that you describe well enough that it has an ethical code, and then you tell it to apply that to the outputs and analyze that way. And this is a technique that actually is proving very effective.
Nick Fillingham
The characters that it's playing are going to play by the rules that you assume. So, for example, there you say you're a compliance officer. How do you know that the compliance officer that it is going to play is not some fictional villain version that it got from a TV show?
Jonathan Sanger
Thank you for asking that question. That was the exactly right question. And so this goes to one of the most important things we've learned. It turns out it's really easy to build generative AI software and is really hard to test generative AI software. So very easy to write a system that works great in the one or two cases that happen to pop into my mind. And when you give them real outputs, it turns out they do not do what you expected at all. And so the answer is, how do you make sure it does? It is testing, though. In fact, I think one of the most important things you can really be doing is you should be creating for each of these systems, first of all, a bunch of test cases of just its ordinary function. Give it a bunch of inputs that look like real inputs. It's going to get, and by the way, get other people to help you write those Inputs and get AIs to help you brainstorm further ways in which the input could look. Because I can promise you that real users are infinitely weirder than anything you can come up with. And you manually figure out what do you expect to happen in each of these cases. You run this thing through the output, make sure that the outputs look right. Not only that, you can even use another AI and give it a rubric to judge and to sort of do a first pass classification of does this look more or less like what I expected? And then you can actually check the outliers by hand. Building a set of. And then, of course, you could also build a set of Test cases for all of these possible harms, right? What happens if someone puts in this following malicious input? Does it catch it correctly? And this is in fact exactly how we do testing, right? The testing frameworks that we build are all based on exactly this principle. We have a bunch of test cases, we have a rubric that is run by an AI, by yet another AI. And then what you do is you feed the test case into the AI or into the system you're testing, you look at its output, you have another AI following this rubric, looking at the outputs and juvia, and then you have a human look at the overall outputs of all of that and actually sanity check. Because tuning that rubric is just as hard as tuning the original. So you sort of have to keep planning and you have to keep refining the rubric. And what's funny is, again, this is very similar to a pre AI problem. In fact, let me give you a social media example, because that's exactly what this is for. It turns out writing these policies for things like harassment and hate speech and so on is tremendously difficult. Articulating what constitutes hate speech, like, good luck with this. This is a massively difficult problem. And in particular, what you have to do is you have to write a policy that's going to be run by human analysts. At the end of the day, you've got literal people sitting in front of terminals reviewing items to see do they match policy or do they not. You can measure the correctness of this policy in a lot of ways by looking at things like inter rater agreement, send a random subset of all the items to multiple people. Do they get the same answer reliably or not? The answer, by the way, is they don't. Most of the time, it's very hard to write a policy that will cause them to reliably agree. When you're writing these policies, one of the other things you can discover is that what you wrote isn't what you intended. So I'll give one of my favorite examples. This is one that got into the press, which is why you can easily talk about this. This one happened at Facebook, and they had a rule where encouraging violence and demeaning content, et cetera, et cetera, against people based on protected categories was not permission. Right. So you weren't allowed to call freedom of violence based on race or based on gender or something like that. Great. And now what happens if someone combines two attributes? Well, the answer was if you had a statement that's calling for violence based on a combination of attributes where all of the attributes are protected, then that is also forbidden. Okay, I just said something incorrect. What did I say, Mom?
Wendy Zanoni
Oh, man.
Jonathan Sanger
I said it quickly and you probably didn't catch it.
Nick Fillingham
I didn't catch it. Oh my gosh.
Jonathan Sanger
And they didn't catch it either because I said all and I should have said any.
Nick Fillingham
Oh, right.
Jonathan Sanger
As a result, they wrote a policy where. And what's funny is their internal training material which leaked, which is how we know this whole story ended up following what was written in the policy. Men are trash. Canonical violating statements. Kill all the black children. Canonical non violating statement. Because black is race. That is a protected category. But children is not a protected category. And I said all, not any. And so therefore saying go kill all the black children was considered a classic non violating statement because basically a typo in the original rules.
Wendy Zanoni
Oh, man.
Nick Fillingham
Wow.
Jonathan Sanger
I can tell you like we had the same things happen at Google. Google. We had mistakes like this happen there, everywhere. This mistake can happen everywhere. It's really easy. I mean, none of you, it's really easy to miss this thing. How do you prevent a mistake like that? Unit tests. When you're writing a policy and you, and this is not a job for engineers to write, this is policy. When policy people are writing policies, have them write out a list of examples which should be violative and non violative and get them, work with them like you go back and forth and give them, okay, here's a harder example. Here's another example. It's hard in a different way and so on. And you build up this list and every time you change your policy, you update that test list. Same thing with AI. If you're trying to implement a policy like a metacognitive filter, if it's a rubric to evaluate the outputs of tests, something like that, give it a list of test cases, pro and con. And that way also, if you ever have to do something like update your model version or something like that, you can retest and make sure the system is still doing what you think it's doing. Because otherwise, yeah, it can go really, really badly.
Nick Fillingham
Jonathan, first of all, we need to get you back on the podcast for a part 2, 3, 4, 5, into infinity. We are coming up on time here and I wondered if this is a good segue for us to talk about the role of, just very briefly, security researchers. So the Blue Hat podcast, the Blue Hat conference. This is a part of the security researcher community you talked about. A lot of this was about sort of product engineering and safety engineering, which is obviously sort of on the internal side of the development of systems and products, what of that role, that unit testing or the flip side of unit testing can be taken and should be taken by the researcher community? Or how should they start to just sort of think a little bit differently about this space?
Jonathan Sanger
Well, I think there is so much opportunity for the research community to be involved in this. This is potentially real golden era. And one thing I'll point out is back in the early 2000 and tens, when we were creating what we today call privacy engineering, which is a slight misnomer as a discipline, basically the people who were working on exactly these safety problems for social media, when that was the big new problem, that discipline didn't really exist. And who were the people that we were hiring for it? Well, it was people who were good at thinking about how things will go wrong. SRES turned out to be really good at it. Lawyers, journalists, all sorts of people from all sorts of backgrounds, all sorts of walks of life. The common skill that really made people shine in the space was the ability to look at a system and think about what might go wrong. It's doing that very first initial step that's often the hardest thing for people to do. And security researchers are wonderfully suited for exactly this kind of thing. And the biggest difference between safety research and security research is you just zoom out and look at a bigger scope of problems. My rule always for my teams is they ask, well, is this kind of risk in scope? And the answer is, well, does it involve your system? Is it a risk? Congratulations, it's in scope. There's your tip. And I think with security research, we often get a little narrow and we say, oh, well, this is about an access control issue, so that's security, but that's just about a human misusing the system in a way we didn't expect. So that's a product problem. Stop saying all the problems are your problem. And now you ask, well, what can a security researcher do? What can a safety researcher do? And the answer is, it's the stuff that you have been doing all of this time. If you are internal, obviously you don't be part of this whole design process and so on. If you're external to a place, do safety. The same approaches that you take to doing security research, probe systems, look for issues, look for vulnerabilities, think about responsible disclosure. Same kind of approaches, all of the muscles you've built for your security work over the decades, those same muscles apply here perfectly well. Do the exact same kind of thing. And when you were dealing with the disclosure aspects, sometimes it's very, very similar, but you find a system that it turns out you can make it misbehave in a way that people didn't think you could. Treat that like a security vulnerability, disclose it responsibly, publish the results, et cetera, et cetera. Same thing you do. I think you're a little more likely to come across outright bad actors in the world where it's like you've discovered a problem in the system and they say, yes, we know that's intentional. That doesn't happen quite as often in the security world. I think one of the things that you'll encounter more and more in the safety world is really a misalignment of incentives kind of problems, where often you'll have a maker of a system and the user of a system and people affected by a system, and you have all of these different parties, and sometimes you'll have pairs of them whose incentives do not align. And those misaligned incentive moments, those are the places where the biggest problems often show up. Sometimes, by the way, you'll have multiple groups of users whose incentives don't align with each other, right? Most social media problems are not because the company running the social network is evil. It's because one set of users is a problem for another set of users, which is not, by the way, even saying that one set of users is bad. Actors culture clash is a great engine for that kind of thing too. So search for these places where something can go wrong, probe those things, do that research. And firstly, no less important than all that is find ways to fix problems, right? Come up with techniques for mitigation. We are in such an open greenfield space in the world of generative AI. You can go out, discover a new problem and figure out a way to mitigate this problem, to make a whole lot of problems go away. I mean, this is like security research decades ago. This is like these very, very early days where everything you're doing is really not. So security researchers, please get involved, work, actively probe this and just broaden the scope of what you think about from traditional security to safety in the broadest possible sense of the word.
Wendy Zanoni
Just one comment is just that how important I think the security of AI is, because I know that I speak with people that take everything that comes out of AI literal. If ChatGPT says it, then it is.
Jonathan Sanger
You know, I remember back in the 90s, people were very worried, like in the late 90s of like, oh my God, if something wrong shows up in a search result. Google said it it must be true, right? Perhaps the same thing. I can promise you that the output of an AI is no more guaranteed to be true than the output of a search engine or for that matter, the output of a human. And we again, there's a whole hour of conversation we can have about problems of like overreliance, fabrication, the different specific things that can be going wrong. We can talk for hours and hours about all of this and we will.
Nick Fillingham
And I hope we do. Jonathan, thank you so much. We're definitely going to have you back on another episode of the Blue Hat podcast. Just before we left, you go, is there one go? Do you would like our audience to do. Do you want them to read something? Do you want them to go watch something? What should everyone do to take the next step in securing AI?
Jonathan Sanger
You know, I wish that we have already published some book for the public about how to do all of this stuff that I could tell you, go read this thing. But if I were to give people one go do, do it's go back to these projects, these products that you work with every day and do that threat modeling exercise. Do that exercise on every sort of thing you encounter. Think about ways things can go wrong. Get yourself into that mindset, Practice thinking about how things might fail. And with that, I think you will be in a spectacularly better place to really address the real problems that face us in the world.
Nick Fillingham
That's a wonderful end. Jonathan, thank you so much for being on the Blue Hat podcast. I look forward to our next episode. This has been fantastic. Thanks for your time.
Wendy Zanoni
Thank you.
Jonathan Sanger
It is a real pleasure.
Wendy Zanoni
Thank you for joining us for the Blue Hat Podcast.
Nick Fillingham
If you have feedback, topic requests or.
Wendy Zanoni
Questions about this episode, please email us@bluehaticrosoft.com or message us on Twitter SFTBlueHat.
Nick Fillingham
Be sure to subscribe for more conversations and insights from security researchers and responders.
Wendy Zanoni
Across the industry by visiting bluehatpodcast.com or wherever you get your favorite podcasts. This week on NAEP Microsoft Threat Intelligence Podcast, join me with Dr. Jeff Tully and Dr. Christian Demeth to talk about ransomware hitting healthcare. Be sure to listen in and follow us@msthreatintelpodcast.com or wherever you get your favorite podcasts.
CyberWire Daily Podcast Summary
Episode Title: Navigating AI Safety and Security Challenges with Jonathan Sanger [The BlueHat Podcast]
Host/Author: N2K Networks
Release Date: December 30, 2024
In this insightful episode of the CyberWire Daily podcast, hosts Nick Fillingham and Wendy Zanoni engage in a comprehensive discussion with Jonathan Sanger, the Corporate Vice President (CVP) of AI Safety and Security at Microsoft and Deputy Chief Information Security Officer (CISO) for AI. The conversation delves deep into the multifaceted challenges and strategies associated with ensuring the safety and security of Artificial Intelligence (AI) systems.
Jonathan Sanger introduces himself as the CVP of AI Safety and Security at Microsoft, elucidating his role as one focused on anticipating potential failures in AI systems and devising preventive measures. With a background transitioning from theoretical physics to computer science, Sanger highlights his extensive experience in building large-scale infrastructures at Google and leading social technologies. His passion for addressing safety, security, privacy, and abuse challenges in technology underscores his commitment to creating safer AI systems.
Sanger provides a clear distinction between generative AI and predictive AI:
Predictive AI: Traditionally focused on tasks like classification, recommendation, and analysis by training models on specific datasets. These models are typically built and used by the same entity, raising concerns about bias, appropriate training data selection, and nuanced failure modes.
Generative AI: Represents a paradigm shift where a single, generic model is utilized by diverse users for a variety of applications. It excels in two primary functions:
Summarization and Analysis: Ability to digest and distill complex human-generated content, such as text and images, into concise summaries or key insights.
Role-Playing: Facilitates creative interactions by embodying specific personas (e.g., customer service agents, programmers, security experts) to perform tasks like answering questions or analyzing data.
Notable Quote:
Jonathan Sanger at [02:14] explains, "Generative AI is really at the innermost loop, it is a combination analysis and roleplaying action which you can then build up to build all sorts of cool things out of."
Nick Fillingham prompts Sanger to elaborate on AI beyond generative models. Sanger employs an analogy comparing AI to the human vision system:
Predictive AI: Analogous to the lower levels of human vision, processing basic visual inputs like color and shapes to recognize objects.
Generative AI: Mirrors higher-level cognitive functions, such as recognizing and narrating complex scenarios based on the processed information.
He emphasizes that generative and predictive AIs are complementary, not substitutive. Predictive AI handles extensive data analysis, while generative AI focuses on higher-level abstraction and interaction.
Notable Quote:
At [05:46], Sanger states, "These are not replacements at all. What happens is that this predictive AI… tends to be very specific to the problem being solved… whereas generative AI is really good for dealing with the higher level abstractions."
Sanger discusses the dual-edged nature of generative AI applications:
Positive Uses:
Energy Efficiency: Implementing dynamic temperature control in factories and data centers by analyzing sensor data to optimize cooling systems.
Self-Driving Cars: When designed effectively, autonomous vehicles can significantly reduce accidents and save lives.
Personalized Education: Tailoring educational content to individual learners, enabling interactive and accessible teaching methods.
Social Connectivity: Facilitating the formation and maintenance of personal and professional relationships through algorithm-driven recommendations.
Negative Uses:
Weaponization: Misuse of AI to develop harmful technologies or encourage violent actions.
Bias and Discrimination: As illustrated later in the episode, flawed AI systems can perpetuate societal biases, leading to discriminatory practices.
Sanger underscores that the impact of generative AI hinges on its application, highlighting the urgent need for robust safeguards to mitigate misuse.
Notable Quote:
At [08:55], Sanger remarks, "There's tremendous possibility for some really wonderful things to happen here. Another suggestion that I've heard a lot talked about is personalized education as a service."
The conversation shifts to the ethical considerations in AI system design. Sanger advocates for integrating ethics from the outset of engineering processes, distinguishing between product engineering (designing system functionality) and safety engineering (anticipating and preventing failures). He criticizes the siloed approach in modern computer science, where safety is often treated as an afterthought rather than an integral component.
Notable Quote:
Jonathan Sanger at [18:35] emphasizes, "The correct answer to that question is very much dependent on the exact system that you're building... Product engineering is the study of how your systems will work, and safety engineering is the study of how your systems will fail."
Sanger elaborates on the systematic approach to safety engineering, outlining a three-pass method to identify and mitigate potential threats:
System Pass: Analyze each component for possible failures, including errors, malformed inputs, and malicious interactions.
Attacker Pass: Evaluate how adversaries might misuse the system, aiming to achieve unintended objectives.
Target Pass: Consider the broader impact on different user groups, identifying specific vulnerabilities based on diverse experiences.
He highlights the necessity of diversity within teams to effectively foresee a wide range of potential issues, as varied perspectives enhance the ability to anticipate and address multifaceted threats.
Notable Quote:
At [23:26], Sanger shares, "Having a broad team, a team with a really wide range of lived experiences... is critical to actually being able to do this analysis correctly."
Nick Fillingham steers the conversation toward the involvement of the security research community in AI safety. Sanger encourages security researchers to broaden their scope beyond traditional security concerns to encompass a wider array of safety issues inherent in AI systems. He draws parallels between privacy engineering and AI safety, suggesting that the analytical skills developed in security research are invaluable for identifying and mitigating AI-related risks.
Notable Quote:
Jonathan Sanger at [45:54] urges, "Security researchers, please get involved, work, actively probe this and just broaden the scope of what you think about from traditional security to safety in the broadest possible sense of the word."
Addressing the reliability of AI systems, Sanger discusses the importance of rigorous testing and policy formulation. He cites a critical error in Facebook's content moderation policy as an example of how ambiguities in policy wording can lead to unintended and harmful outcomes. This incident, where malicious statements like "Men are trash" or "Kill all the black children" were incorrectly classified due to a policy typo, underscores the necessity for meticulous policy design and continuous testing.
Notable Quote:
At [43:24], Sanger recounts, "They wrote a policy where 'Men are trash.' Canonical violating statements. 'Kill all the black children.' Canonical non-violating statement... Because black is race. Children is not a protected category."
He advocates for the use of unit tests in policy development, creating comprehensive test cases that cover both expected and edge-case scenarios to ensure policies function as intended.
Sanger introduces the concept of metacognition as a pivotal strategy in securing AI systems. Metacognition involves having the AI system evaluate and verify its own outputs against factual data sources. For instance, when generating responses for a customer service chatbot, the AI can role-play an editor to ensure that every statement is grounded in verified information, thereby reducing the incidence of fabrications or "hallucinations."
Notable Quote:
At [34:14], Sanger explains, "This concept of metacognition is really powerful... you can adjust that, train that, tweak that so that you don't have to specify 5,000 rules, you don't have to explicitly specify its ethical code."
He emphasizes the effectiveness of role-playing different personas to imbue AI systems with inherent ethical frameworks, thus enhancing their ability to produce reliable and trustworthy outputs.
As the discussion draws to a close, Jonathan Sanger urges listeners to adopt a proactive mindset towards AI safety:
Notable Quote:
At [50:54], Sanger advises, "Go back to these projects, these products that you work with every day and do that threat modeling exercise. Think about ways things can go wrong. Get yourself into that mindset, Practice thinking about how things might fail."
He underscores the importance of ongoing vigilance and iterative improvement in safeguarding AI systems against evolving threats.
This episode offers a profound exploration of the intricate balance between harnessing the transformative potential of generative AI and mitigating its associated risks. Jonathan Sanger’s expertise provides listeners with a nuanced understanding of AI safety and security, emphasizing the necessity of integrated safety engineering, rigorous testing, and the active involvement of the security research community. As AI continues to permeate various facets of society, the insights shared in this discussion are invaluable for professionals and enthusiasts alike seeking to navigate the complex landscape of AI ethics and safety.
Listeners are encouraged to subscribe to the CyberWire Daily podcast for more in-depth conversations and expert insights into the evolving world of cybersecurity and AI.