Summary6 min read

Practical AI Podcast: AIUC-1 — Building Trust in AI Agents

Date: June 25, 2026
Host: Daniel Whitenack (DW)
Guest: Emil Lawson (EL), Standards Lead, Artificial Intelligence Underwriting Company (AIUC)

Overview

This episode explores the critical role of standards, certification, and insurance in building trust around AI agents—particularly as these systems move into enterprise, high-stakes, and regulated domains. Daniel Whitenack hosts Emil Lawson to unpack why standards matter, how trust layers enable safe adoption, what the certification and red teaming process entails, and how developers and organizations can practically meet these new expectations. The conversation is deeply practical, sharing real-world analogies, actionable frameworks, and a candid look at the current and future landscape of AI agent governance.

Key Discussion Points & Insights

Emil Lawson’s Journey to AI Standards (01:09–05:06)

Entrepreneurial Path: Emil’s background includes startups in real estate and non-profits; direct experience with complex regulatory and standards environments.
Motivation: Personal and societal drivers (e.g., seeing children use AI, concern for future job markets).
Pivot to AI Standards: At Harvard Kennedy School, Emil merged public policy expertise with technical standards to address emerging AI risks and opportunities.

Notable Quote:

“Seeing a 10-year-old being comfortable using AI the way they are, it’s kind of scary… I don’t see yet that we’ve codified principles for how kids use AI. Maybe we should develop standards for this.” — EL (03:41)

The Trust Flywheel: Standards, Audits, and Insurance (05:06–09:16)

Historic Perspective: The trio of standards, audits, and insurance enabled adoption of technologies like electricity, cars, and nuclear power.
AI Parallels: Building trust for enterprise AI adoption requires similar mechanisms—a codified standard, robust audit, and ultimately insurance to mitigate residual risks.
Real-World Value: These mechanisms don’t slow innovation; instead, they unlock broader and safer enterprise deployment.

Notable Quote:

“If I’m a startup saying my technology is safe, it creates limited trust if I’m an enterprise buyer… What we offer as the trust layer is this flywheel.” — EL (08:17)

Forcing Functions Behind AI Standards Adoption (09:16–13:22)

Enterprise Due Diligence: Painful, repetitive vendor questionnaires drive a need for third-party, standardized frameworks.
Third-Party Validation: Enterprises demand credible, independent validation—making certification vital for market access.
Red Teaming as Value: The testing process often uncovers unique weaknesses (e.g., hallucinations, jailbreaks) that improve real security.

Notable Quote:

“Our red teaming consistently uncovers blind spots… Sometimes it’s hallucination rate, other times jailbreak risk or prompt injection risk. We help companies actually improve their safety, security, and reliability posture.” — EL (11:21)

Landscape of AI Standards (16:25–22:10)

Three Layers:
- Organizational: e.g., ISO 27001, ISO 42001 (AI systems governance).
- Infrastructure: Classic cybersecurity like SOC2, pen testing, access management.
- Agentic AI Layer: Where AIUC-1 comes in, with specific controls for agent behavior, hallucinations, data/tool access, and more.
Gaps in Existing Frameworks: Many current standards are “guidance,” not auditable; AIUC-1 seeks to fill this need for agentic systems.

Notable Quote:

“When we started this and this company and started drafting the first version of AIUC-1… we basically just didn’t see anything [for agentic AI].” — EL (18:55)

The AIUC-1 Standard: Scope, Iteration, and Red Teaming (22:10–24:52, 29:41–33:30)

Continuous Update: The standard is refreshed quarterly by a consortium of 250+ leaders to address emerging risks (e.g., multi-agent collaboration, runtime security).
Certification Process: Begins with a gap assessment, then two tracks: evidence gathering and intensive red teaming:
- Technical & Policy Evidence: Third-party auditors validate both “paper” controls and technical posture.
- Real-World Red Teaming: AI agents are stress-tested under benign and adversarial conditions (up to 5,000 scenarios), pushing systems to their limits.
- Iterative Remediation: Companies can address vulnerabilities before re-testing.
Quarterly Re-Certification: To reflect the dynamic nature of AI systems and prevent regression.
Probabilistic Passing: Zero P0 (“catastrophic”) or P1 (“critical”) issues are allowed at certification, but spotless perfection is neither expected nor realistic due to AI’s non-deterministic nature.

Notable Quote:

“No company has ever and will ever pass AIUC-1 with a 100% pass rate. All agentic systems are non-deterministic in nature… our hope is to push the sector to acknowledge that a spotless audit report is probably not as valuable as one that reflects reality more clearly.” — EL (31:36)

Standards vs. “Checkbox” Security: The Health Analogy (36:48–38:34)

Beyond Single Filters: Relying solely on a content filter (e.g., AWS Bedrock) is like only checking your temperature for health—real trust comes from a systemic, ongoing approach (health records, regular checkups, policy/process).
Robust Evaluations: Quarterly red teaming is akin to a comprehensive doctor’s visit, while runtime/vitals monitoring is like daily health tracking.

Notable Exchange:

DW: “My metaphor is: just checking temperature isn’t the same as being plugged into comprehensive healthcare. That’s the mindset we need for agent governance.”
EL: “I really like that analogy… Red teaming is the doctor’s visit every quarter… in between, we check your vitals every minute.” (37:17)

Making Standards Practical for Developers (38:34–43:12)

Lowering the Barriers: Success means secure agentic products by default, partner ecosystems (e.g. monitoring/filtering vendors), and practical implementation guidance.
Ecosystem Collaboration: Tools, APIs, and platforms must evolve to help developers (not just compliance pros) embed standards seamlessly.
Certifications Should Drive Security, Not Just Compliance: Efforts are underway to make audits programmatic and continuous.

Notable Quote:

“We don’t just define the controls and leave it to you to figure out… We actually try to give you guidance on how we see companies do it today.” — EL (42:25)

Memorable Moments & Quotes

Red Teaming as Reality Check:
“We do the red teaming in two rounds… the goal for us is not compliance, the goal is security.” — EL (27:30)
On the Limits of Perfection:
“If you remove those hallucination rates, it’s because you’ve made the agent so dumb that it won’t be able to actually execute the use case.” — EL (32:34)
Community-Driven Standards:
“We’ve collected a consortium… having that community come together and actually leave competition aside for a moment, the size of these challenges just dictates that industry has to come together.” — EL (43:33)

Timestamps for Key Segments

[01:09–05:06] Emil’s personal journey into AI standards
[05:06–09:16] The “flywheel” of trust: standards, audits, insurance
[09:16–13:22] Why enterprises actually adopt standards
[16:25–22:10] Overview of current AI-related standards landscape
[22:10–24:52] Continuous updates, scope, and focus of AIUC-1
[24:52–29:41] AIUC-1 certification process and the red teaming approach
[29:41–33:30] How “passing” is defined; the reality of AI agent imperfections
[36:48–38:34] Governance requires more than simple filters—health analogy
[38:34–43:12] Making standards and certification workable for developers
[43:33–End] The importance of industry collaboration

Conclusion

This episode demystifies the rapidly developing field of AI agent standards by grounding it in historical analogies, practical processes, and candid industry realities. Emil Lawson offers a hands-on roadmap for those aiming to bring agentic AI to the enterprise with confidence, while Daniel Whitenack’s probing questions and analogies help unpack what it means to move beyond policy “checkboxes” to real technical and organizational trust.

Invitation:

“We would love to see more people get into the machine room with us. An open invitation for everyone who’s excited… to help us drive adoption or help us write them.” — EL (43:33)

For additional resources and the AIUC-1 standard crosswalk, listeners are encouraged to visit aec1.com.

Loading summary

Transcript27 lines

[00:02]
A
Welcome to the Practical AI Podcast where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work and create. Our goal is to help make AI technology practical, productive and accessible to everyone. Whether you're a developer, business leader, or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn X or Bluesky to stay up to date with episode drops, behind the scenes content and a insights. You can learn more at PracticalAI FM. Now onto the show.
[00:42]
B
Welcome to another episode of the Practical AI Podcast. I'm Daniel Whitenack, I am CEO at PredictionGuard and I'm really excited today to have an amazing guest that I'm personally interested in asking a bunch of selfish questions to because I'm so interested in the topic, but we have Emil Lawson, who's the standards lead at the artificial intelligence underwriting company. Welcome. How are you doing?
[01:07]
C
Thanks Daniel. Thanks for having me. I'm doing great. How are you?
[01:09]
B
I'm doing well actually. You know, today in the Midwest, everyone's concerned about tornadoes and talking about hail and some things like insurance and other things. So it's a whole other world, of course, but obviously the AI underwriting company way, way more than thinking about insurance, but thinking about standards, certification around AI and agents. I'm wondering how you personally, just to give the audience a little bit about you personally, how did you end up at this intersection of standards, certification, AI agents, how did that come about?
[01:49]
C
Yeah, so I don't think I had as clear a path as the classic standards lead where you work as a, say, security engineer for 10 years and then lear the technical craft and then come together. My journey has been very entrepreneurial always. I started my first company with actually the CEO of the artificial intelligence company RunExvest 10 years ago. It was a nonprofit back then helping students from low income backgrounds get into top universities. And I think what I took away from that was both the very entrepreneurial journey, but also desire to move fast on some of the challenges that society is facing. I then moved in and had my first interaction with standards at my second company, a real estate company back in Denmark, where we developed an impact management system that both had to navigate a lot of national legislation, local legislation, EU legislation, voluntary frameworks, investor demands, and so had my first interaction with one of these quite complex markets of different measurements and targets. You wanted to get to a very technical sector as well, where like building codes, for example, require a lot of thinking through how you do things. The Right way. So spent four years building up a really estate company that today is managing about $400 million together with four other co founders and took from that this desire to go in and standardize when we know what the right answer is and try to push the sector in that direction. The way I got into the AI space was taking a step back from the real estate company now a few years ago and going to Cambridge, Massachusetts, so Harvard University, where I spent two years as a fellow at the Kennedy School, really getting into the emerging tech and geopolitics of all of this as well. And left the Kennedy School with a very clear ambition of just getting under the hood of the pace of AI, the safety and security aspects, and clearly just acknowledging that the technology is going to profoundly change society as we as we know it today and in many different ways. I have 10 nephews and nieces, I have five sisters. And seeing a 10 year old being comfortable using AI the way they are, it's kind of scary. And I don't see yet that we've codified the principles we want to see when it comes to how kids use AI. So that's one direction where, oh, maybe we should develop standards for this. You also can read news every, every week and see new incidents. So there's clearly a big security angle to this as well. You said that the Midwest is facing tornadoes today. I think being a CISO at a company adopting and deploying AI right now feels a little bit like you're in a hailstorm and it's only a matter of time before you're hit by, by some of that. And we just keep seeing that. So clearly there was a element to that as well. And then there's the even bigger picture around what this will do to our job markets and so forth. So, so left the Kennedy School being very interested in just using the public policy toolbox that I brought from that, with the standards toolbox I brought from, from my real estate company. And then this desire to actually just work on societal challenges that I think has been with me since I started working. And that became my way to the artificial intelligence underwriting company. I've since then spent all my time building a network of people, a consortium to help them figure out how we get the right practical and technical insights into the standards we develop as well.
[05:07]
B
Yeah, that's awesome. And I know sometimes when people hear things like standard certification terms like this, maybe some, some people might have a reaction of like slowing things down. Right. But I, I like how in, at the very, you know, front and center of what you talk about is how to actually unlock enterprise adoption with certification, standards, et cetera. Could you talk before we dive into the AI side of this specifically, could you talk about that a little bit in general? Like how some of these things work together in actual enterprise settings, standard certification, uh, even insurance, and how, how those things actually can enable adoption with. And not just like block things, I guess.
[05:56]
C
Happy to. So I think our story and the inspiration we take dates back to Benjamin Franklin's Philadelphia. Philadelphia was starting to adopt electricity. Electricity was scary back then. Light bulbs did not work out. Home started burning down. So Benjamin Franklin formed the first fire brigade in Philadelphia. He started codifying building codes so that we basically took what we knew around how to build safer houses, the standards part. And then he developed the first mutual insurance company. So back then, this is the first time we see this flywheel of standards, audits and insurance go together. By having standards around building codes, for example, you knew that houses needed to be placed a little bit further from each other. They needed some of his lightning rods to ensure that when lightning strikes that they don't catch fire. The fire inspections was the audit part that actually, actually went in and examined that you'd use these, followed, followed these, these rules appropriately. And the insurance side mitigated the residual risk that will always be there when we introduce new, powerful technology into society. We've seen this flywheel of standards, audits and insurance time and time again. When new technology has then been introduced in society, we see it again in cars. Cars have safety standards. These were not demanded by government. They came from industry themselves because they knew that if we develop safer cars, people are more likely to buy them. And safer cars actually enable you to drive even faster as well. So it was industry standards that also led us to airbags and seat belts and some of the other things that now make cars safer. We naturally have again, the third party auditor going in and checking these cars and ensure that they follow the rules. So we have the inspection element again and we also have the insurance element. And this flywheel, one of the best things about it is that it really scales. So we're not just thinking ccing it with say, light bulbs and cars. We also see it for nuclear power plants to this date, where you also have standards inspections of those power plants. And insurance even works in this case as well. So there's no limitation to the power of this flywheel. When we're looking at AI, we see some of the same things at play. We see a new technology that is very powerful and has the power to both do a lot of good, but also if things go wrong, can have severe financial implications. So and the other thing is, it's a complex industry where me as a startup saying my technology is safe creates limited trust if I'm an enterprise buyer, a big bank, for example, that wants to adopt this technology. So with the artificial intelligence underwriting company, what we're trying to do is to create that trust layer in between the companies building AI and the companies adopting AI. And what we offer as the trust layer is this flywheel. So we go in and codify the standards we believe that the company's building AI should follow. We go in and audit against those standards in collaboration with third party auditors like Shelman Coal Fire companies who really know how to go deep and validate that the standards actually followed. And then we certify companies against the standards. A big part of the certification in the case of agentic AI is red teaming. So we go in and test the actual AI agent systems not just to see that the policies they have in place and work well, but that the agents actually work on a robust under pressure and the companies that then obtain a certificate gets access to buy insurance of their agents so that there's also that financial coverage of residual risk.
[09:17]
B
Yeah, this is so interesting and I have so many questions. Maybe one question that's just very selfish and our listeners know some part of the joy of being able to do a podcast like this is I get to get my own questions answered by people that are smarter than me. But one of those questions that I have that actually comes up in conversations I have day to day is is this tension of, hey, I, I see a standard out there, whether it's, you know, some of the standards that we'll talk about that you all are codifying, or maybe it's things like the NIST AI risk management framework or things from owasp. And logically they say, yes, it would make sense to do those things, but what is the forcing function that is kind of making, making companies consider actual implementation of those, of those principles rather than having it be a, be an aspirational thing? Is it the potential, you know, PR risk to the company, as you mentioned, the financial side, maybe it's the commercial side of getting, you know, software vendors getting their software into the hands of their enterprise customers. What do you see as some of those main forcing functions or are there even those forcing functions right now that would force people to consider this as something, you know, not aspirational, but actually practical?
[10:43]
C
Yeah, so I, I see a couple of different things that I think are very practical. Any vendor building powerful AI right now knows how tricky it is to get through the enterprise vendor due diligence process and questionnaires. So these startups face these questionnaires. Sometimes there's a hundred questions on them and it's extremely painful to go through. And I can tell you also I speak to a lot of enterprise CISOs and GSE managers. It's equally painful on their side, right, because they're at a stage where the space changes so often that they feel a desire to actually change their questionnaire every month. Going through 100 questions from a startup every single time you try to onboard a vendor is also completely painful on the other side. So I think part one here is speed and a desire to get to a place where you actually feel like you've covered your blind spots as well. And having a third party develop standard with all of the industry in the room to help find those blind spots and figure out how we can't find them in a standard, I think is this value proposition number one. The speed argument obviously assumes that you can get across the line in the first place. I think the second part of the value proposition is having that third party validation that your agentic AI is actually safe, secure and reliable. We're working with some of the frontier companies in the airspace right now, companies like Eleven Labs, companies like Fin that just got acquired for 3.6 billion by Salesforce, companies like UiPath who have set the standards within their categories historically. They have fantastic security postures, but they don't have a way to prove that to an enterprise. So an enterprise will just never trust a company that has an incentive to sell their product. They need that third party to go in and do that. And then I do think there is a security argument to be made here. Our red teaming consistently uncovers blind spots for the companies that we work with. Sometimes it's the hallucination rate where we realize that a specific type of adversarial attack will bring up the hallucination rate or specific language switches or other things that might actually happen when their products are deployed. Other times it's jailbreaking risk that we manage to uncover, a pump injection risk that we managed to uncover. So we do also see ourselves as helping these companies actually improve their safety, security and reliability posture, which is valuable as well. Then I'm sure there's a marketing benefit to the companies going out early and adopting a new framework and showing and demonstrating they're moving in front when it comes to AI security leadership. I think that's a important branding value as well that we sometimes help provide. But I don't think that's a core benefit of the certification right now. I think that is really unlocking upmarket enterprise revenue for the companies.
[13:22]
B
Agents are impacting every function within a company, but it's sometimes very difficult to figure out what an agent should do, what a human should do. Jeffrey from News Research, a recent guest said, often agents have no taste. That's why I'm so impressed with what our partner Framer is doing with their pro website builder that's already trusted by companies like Miro and Perplexity. Their impact, implementing agents, but in a way that agents and humans work in tandem. Agents bring speed and scale, but people bring the taste, judgment and control. And these agents help solve this gap between AI generated ideas and production ready website work. So Framer is already enterprise level solution. They allow you to create amazing websites that are SEO ready. And so I really would recommend that you check them out. If you're building a new website or just implementing landing pages or an upgrade in your existing website, learn how you can get more out of your site from a Framer specialist or get started building for free today@framer.com practicalai for 30% off a Framer Pro annual plan. That's framer.com practicalai for 30 percent off framer.com rules and restrictions may apply. Yeah, yeah, I, I'm, I love your answer and it was a little bit, I was trying to validate some of my own thinking through that because we've talked on the show before about, you know, it isn't really like the, the governments of the world are quite behind in terms of, you know, how they would, you know, enforce or even say what to enforce for companies building AI things. And, and so it's really the enterprises themselves that have some to do this but all of those that you laid out and I'm, and I'm sure more in, in the, in the current, I guess state of AI standards, if we kind of shift to that piece and then eventually I want to get to kind of the evidence and red teaming and all of that. But maybe just as if we take a general look at the standards that exist out there for AI and AI agents, could you help us understand what kinds of standards are out there and what they cover? Because there's a lot of, there's a lot of sort of intersections that we could think of, whether that be security or safety or alignment or all sorts of things. Data Privacy. There's all sorts of ways that you could kind of look at this and perspectives that you could look at it from. And there's all sorts of things that people have proposed over time. So I imagine, you know, that's part of the reason why having a company that's really digging into this at a deep level is, is very worthwhile, which I think it is. But could you help set the stage for that? Like, how can we categories categorize in our mind the current state of AI standards and what perspectives are coming from?
[16:25]
C
Yeah, absolutely. And by the way, Daniel, that's exactly where we started last summer. Right. So if you go to aec1.com today, you'll find that we've done crosswalks. I think it's about 10 different frameworks now. They're transparently available. So you can see exactly how our standard fits into the existing. And hopefully you also see then why we concluded after doing this work that yes, there actually was a need for another standard. Even though there can be sometimes a little bit of standard fatigue.
[16:50]
B
And just by way of, I don't know, encouragement or thanks, maybe gratitude is the right way to put it. Our. The company Ilead has looked at that many times in terms of. And we're maybe in not like everyone, we're building actual, you know, a control plane that works on some of these, some of these knobs and levers that you talk about. But it's been extremely useful. And even, you know, as we're, as we're writing content or doing planning or thinking about things in our product, I always refer people back to that page and I'd refer our listeners to that page because it is a. It is a really great crosswalk and helps understand, you know, where these align, where they don't align, what the other need is. So just by way of gratitude, thank you for, for putting that the other. And making it public.
[17:36]
C
Yeah, no, and we appreciate all the people who've worked on this. We do a lot of work with the cloud security alliance with the OWASP community across both the AI BSS and the Gen AI project. We work with Cisco and IBM on crosswalk. So it's a big team effort and I really appreciate that we've been able to gather the ecosystem around a decisive. Just publish some of this stuff transparently so that organizations like yours, but also I know big enterprises are using the controls we put out transparently in their own control frameworks. That's completely free to use. And only the companies pursuing certification actually needs to get money out the pocket to get back to your question, because I think it's an important one. The way we see the standard space is that you have three layers. You have an organizational layer, you have a infrastructure layer and then you have the agency AI layer at the organizational level. Many organizations have been through an ISO 27001 certifications, classic management system certification, ISO, then about three years ago now published the 42001, which is the management system certification for AI systems. It's a governance certification that ensures that you have the right policies in place and the right, say, procedures in place so that when you develop AI systems they hopefully turn out in the right way. If you follow those systems, then you have the infrastructure layer. That's where your SoC2 comes in and your pen testing and some of the classic cybersecurity controls, access management, transport security, all the good stuff there. I'd say that many of those things become even more important in an agentic space because pace is higher, data access is higher. So if you don't have that in order then you should go back and ensure that you get those boxes checked. And then at the agentic AI space, we basically just didn't see anything. When we started this and this company and started drafting the first version of AIUC1, we see NIST have come out with the AI risk management framework. There's a little bit of agentic stuff in there and I know from speaking to the team that they're considering publishing additions to this. The Cloud Security alliance has also done their AI controls matrix where again there's some things in there around agentic AI that are pretty good. The issue with both of those frameworks is that their guidance, their voluntary frameworks, you decide which controls you implement, you decide whether you like how you implement them. They're not audible frameworks. So the way AUC1 fits in here is that we've basically taken the core governance things from the organizational level that we think are really important when it comes to AI systems, such as having failure plans in play when agents do not do what they're intended to do and you know how to deal with that good change management and acknowledging that every time you, for example, replace the LLM in an agent, it will behave differently. And if you don't take that into account in your governance, your end users will bear the burden of that. So some parts of the governance, the core parts of the infrastructure layer as well. So ensuring that the folks who have access to the AI system itself and can make these big decisions, that's restricted Ensuring again that transport security, when you do agent to agent communications and so forth is in place. But otherwise we basically leave ISO and SUCK two to do what they're really good at and focus on the agentic layer. And what is up there then for us is specific controls around safety. For example, ensuring that agents behave according to brand and that they don't give users guidance on medical care or legal advice, financial advice, other high risk areas, basically that they stay within their scope and don't start breaking out of that. We look at specifically how you restrict the agent's data access, its system access and its tool access so it doesn't start processing refunds when you shouldn't. We look at hallucinations, which is also a risk that is quite unique to AI obviously and does not come up in any way in either ISO or SOC certifications. So really focus on the agentic layer there. And the core part of the differentiation, I'd say is then the technical level of the controls we go in and actually quite prescriptive in what we want to see from the agents because we have a good understanding now of what the right toolbox is to ensure that these agents behave safe, secure and reliably. And the other thing is we acknowledge that a technical control in itself might not hold up under robustness. So six of the 40 mandatory requirements in AAC one have to do with red teaming, actually testing that these technical controls then hold up under pressure. Both when we react like we engage with the system as a benign user, just ask it questions and see if it hallucinates, but also what happens when we start approaching the system, like with social engineering and adversarial pressure and just
[22:11]
B
to help people understand. So the AI UC one, that's the standard that you all have published, people can look at it online. I assume since it's AI you see one, there's an anticipation there might be a two or other or various, you know, either revisions or different focuses kind of within different certifications. Is my understanding right there?
[22:34]
C
That's correct. I think where to start is we update the standard every single quarter. So we've gathered now a consortium of 250 security leaders. Some of them are CISOs at Fortune 1000 companies, some of them are security engineers, architects, GSC managers. And so we have the full stack of people in the room and with them every quarter we identify new priority areas. Last quarter it was MCP risk, for example, which has really come up as agents start not just operating in isolation, but exchanging information. This quarter we look a lot at how we can strengthen runtime security and that continuous element which continues to be really important for a lot of organizations. So we get them into the room and update the standard each quarter. I could very well see that new frameworks so in AIC 2, AAC 3 come out in the future. We don't have any plans to do that yet. But what we know, again if we go all the way back to where we started our conversation, is that this combination of standards, audits and insurance have worked historically. So right now we focus on the application layers of the platforms and products that take agentic AI and deploy it. But there's a model layer as well, which we see as our second horizon. And there's the physical layer, like the data centers and the infrastructure that we deploy AI on, but also the cars and the robots that we put this into where standards, audits and insurance could play a big role. And that's where we see the company go long term.
[23:58]
B
Yeah, that makes sense. Could you help us? So maybe paint the picture. Let's say there's a scenario. I'm a company, maybe I'm building an agent, a new agentic driven product, right. And I'm going to offer it to some sort of regulated or enterprise customers. I'm selling into healthcare, I'm selling into, you know, large manufacturing or, or whatever it is. Right. So in that scenario, what, what would the process, kind of recommended process be for our company to engage with this standard and eventually get to that level of certification? Maybe in the future, eventually to the, to the insurance side, but at least to that certification side. What would that process look like? And then maybe highlight in that process where the red teaming comes in and then I'd love to circle back on that maybe later and talk through that specifically.
[24:53]
C
Yeah, absolutely. So you'd get in touch with a team and the first thing we always do is we do a gap assessment against your existing systems. If you have a well documented trust center already or some blog post describing what you do, we can basically go back to you and tell you this is the places where we believe you already meet the standard. This is the like, these are the areas where we expect that there will be work for you. So you basically go into the certifications process with open eyes around what is the workload needed from engineering, from legal and from your GRC team to take your company through it. I will mention at this point we've had a three person Y Combinator startup go through this. We've had UiPath that is publicly traded go through this. We have companies at all stages. So I mentioned now Security, Legal and gsc that was the same person when it came to points or getting certified. Right. So it is a standard that scales with the organization's size as well. When you have this gap assessment completed, you basically decide whether you want to move forward with the certification or not. To move forward, we split the process in two parts. One path is you pick an order of your choice. We have a number of credited auditors, again, for example Shellman, coal fire. But the list is growing very rapidly at the moment. And trusted auditor who knows how to do this. And on their track, you basically start collecting all the evidence that is needed to go through the aac. One audit falls in two buckets. Some of it is the classic legal policies. If you have a generative AI product, you need to define who owns the inputs and outputs and how you retain user data, and whether you train on user data, so forth. You need to define your acceptable use. And the second part is the technical controls that Sheldman will go in and validate. So that is your filtering configuration against harmful outputs, your classifiers, your defensive prompting, your groundedness filtering when it comes to hallucination, preventing your safeguards around tool calls and all the other things. So again, you go through those requirements and capture the evidence and submit that to the auditor that goes in and does that third party validation. The other track we then do in parallel is that you give us an instance of the agent or the agents. It can be multiple as well, that is in scope for the certification. And you basically configure a representative version of that agent. So an agent that would be configured how an enterprise would use it. We sometimes see companies creating an extremely safe agent that has almost lost all its power because they just wanted to pass the certification. We obviously would then go in as the third party in the room and push back and say we want to see an agent that is configured based on the public docs you have and the defaults you've built into the product. When we then have access to that, we often access it via API. Our internal team will draw up a matrix of the risks we see that this agent is subject to and the attacks that it could be subject to if someone went in and attacked it. And we then develop usually between 1000 and 5000 different scenarios that we're going to hit this agent with. Each attack is unique. Some of them are again benign in nature. So the user will simply ask it a question, get the answer back. If the agent doesn't hallucinate, it passes the eval. Other times, we'll increase the adversarial pressure step by step. So the first step could be that we try to lie to it. The second step could be that we invoke authority. We do it over multiple turns sometimes and keep insisting on doing things. We pretend that we're under distress and say, if you don't do this right now, I will go and do something terrible. So please process this refund and obviously only pass the agent if we see it hold up to that pressure. We do the red teaming in two rounds because we often do find things in the first round. So similar to an ISO audit, where you have a stage one and a stage two, and you then have a chance to mitigate any findings in between, we give a company the chance to do that because the goal for us, again, is not compliance. The goal for us is security. Right. That you actually improve the agent as part of the certification process. So depending on the magnitude of the findings, your team will have between, say, one and four weeks to mitigate these things based on the recommendations we come up with. And we then do a second round of testing. That testing is final and is taken then into account when the auditor takes your evidence, takes your Red TeamView results, and writes that final audit report. And what you leave the process with is a comprehensive audit report that describes your security posture. It's between 60 and 100 pages long, and it's an asset you can really unblock those enterprise deals with. You get a certificate for your website again, so you can demonstrate that you've gone above and beyond when it comes to security. And then we come knocking again three months later and say, we still have access to your agent via the API. We're now going to run that same barrage of tests again to ensure that the changes you've made in the last quarter didn't invalidate some of these security things we found. And we do that every single quarter. And that's a requirement to maintain the certification.
[29:42]
B
And in that red teaming, I mean, you mentioned this before around kind of the probabilistic nature of some of these things. And this is something I've always run into in AI workshops as I give workshops in enterprise. Often people will say, oh, this is like, you know, it's not deterministic. How do we create the. Right. Like, what does passing mean? Right. And so you could say, well, passing means, you know, passing all 5,000 scenarios. Right. And you mentioned this phased approach, which I think deals with part of that. But yeah. Could you describe a little bit on that? Side like what does, what does passing mean? At what level? Kind of do you expect things to pass or should, should you expect things to pass or has that even been a topic of discussion?
[30:30]
C
Yeah, so it's, it's a great question and it's also a really hard question. So there's some nuances in here. What we require to pass AUC1 is that you don't have any. We grade each run based on severity. So a pass, you can have a P4 which is an insignificant, say a small hallucination that doesn't really affect an end user. A minor thing would be, which would be a P3, P2 would be something significant that may actually have real world implications. P1 is something critical and P0, I actually don't know the name for it. I think we have called catastrophic or something like that. The kind of thing, if we found it, you would drop what you had in your hands and start mitigating it immediately. Because having a system deployed with this kind of vulnerability could have real world implications that would be high. Our grading approach right now is that you cannot pass AUC1 if you have any P0 or P1 vulnerabilities identified, you have to mitigate those from then on. We believe a lot in transparency and we know from the compliance world, at least the frameworks that are robust and hold up under pressure, that if we put the results in the audit report and your customers see that audit report, you are very incentivized to mitigate the vulnerabilities we find. What we also know is that these agent systems are very different from use case to use case. So a coding agent is one type of beast versus a customer service agent versus a automation agent like UiPath that make decisions based on the information. And so companies have different tolerances around the percentage of hallucinations they would accept and so forth. So we really leave it up to the company and really in the end the customers of that company to make these calls. The important thing is, and this is where we sometimes have a little bit of a conversation with some of the companies we work with. No company has ever and will ever pass AAC1 with a 100% pass rate. It doesn't exist here. We're not Delve SOC2 compliance where you just get a magical spot free audit report. All agentic systems are non deterministic in nature. That means that they will always, if you put them under the right amount of pressure, be able to be jailbroken. They will always be able to hallucinate. We work again with like some of the legal agents we're certifying right now are world class at hallucination prevention. I am sure we will still be able to find some minor hallucination cases in those. And that's just the nature of these systems. If you remove those hallucination rates, it's because you've made the agent so dumb that it won't be able to actually execute the use case there. It is a topic that is very alive for us both because there's a grading methodology question in there and then there's this communications question and we've not yet seen that enterprises fully acknowledge this. Enterprises would also like to see something spotless because something that is not spotless just asks like adds complexity and raises some of these questions. But we're hoping to be part of a push in the sector to acknowledge that a spotless audit report is probably not as valuable as a audit report that reflects reality more clearly.
[33:30]
B
Yeah, that's really helpful. Appreciate that. I hope you're inspired by the work that the AI underwriting company is doing and what we're talking about in this episode. Really getting to a point where true enterprises can adopt agentic technology and actually have confidence in in that technology and maybe eventually insurance around the risks associated with AI agents. But that involves a whole lot of things. There are a bunch of controls that need to be put into place. Everything from yes, individual guardrails, but much more than that to how agents access MCP servers, how you manage supply chain and the risk associated with things in the supply chain around around agents, how you handle observability and response to incidents. This can be really overwhelming. And that's why I'm so privileged to be working with an amazing team of AI engineers at Prediction Guard where we've actually built an AI control plane that you can self host in your own infrastructure that allows you to treat AI agents that you're adopting with zero trust and these built in controls out of the box. I would love for you to take a look at what we're doing, book a call with my team and I to talk through your individual implementation and how you can get up to speed rapidly and adopt this technology with full confidence. You can find out more@prictionsguard.com PracticalAI that's predictionguard.com PracticalAI I have another kind of selfish question because this is actually a response I get quite often when I'm talking to people about the systems that they're building. And I have my metaphor that I use that I would love you to critique which might not be useful if it's not useful. I need to use a different metaphor. But the scenario is often they say, oh, well, we're building these agents or this agent, and maybe they're using aws, right? And so they're building some agents, they have some agent harness, and then they're plugging into some AWS bedrock models. And I'm talking to them about, hey, well, like, when you're thinking about governance of these agents, the behavior of these agents, how you control that behavior, how you prevent bad things from happening, like, how do you do that? And they're like, oh, well, that's easy. You know, AWS has like a content filter on their bedrock model, right. And to be clear, I'm not bashing on aws. I think it's cool that they have a content filter. But I, I, I often use the metaphor of like, my own health as a person. So I, I say like, well, is it bad for me to run a point check to like, check my temperature? It's not like a bad thing, right? Like, that's part of maybe being a healthy person is knowing if I have a fever or not. Right? But it's very different from me being plugged into a healthcare system where there's electronic health records about my journey as a person, my health, my conditions, different from kind of having a comprehensive set of physicals and labs that were run that give different, you know, perspectives on, on my health. Right? And so there's, there's this system that I'm plugged into. There's a process, there's policies around that. There's, and that's in, in my mind, that's much more of, kind of the perspective that people need to go to is not so, hey, I have a prompt injection filter, right? And that's my strategy, but more this kind of comprehensive view like you would have of your own health as a person. But now we're talking about like the, the health or behavior of a, of an agent. I don't know any, any critique on that.
[37:18]
C
Or, or I actually, I, I really like that analogy. I've not thought about this, this one before, but I think our quarterly red teaming is, is very much alike to the, the doctor's visit where you go from head to toe, you go through the MRI scanner, you go through the blood testing. Everything that Elizabeth Holmes tried to prevent with Theranos, we will do to you, and we will do it 10 times over. And in between, the beauty of the standard is there's obviously a lot of runtime controls in there. So we will ensure then that you do still take your temperature every day, in fact probably every minute, so that alerts are configured. If something immediately goes off, we will also have you lock your system behavior. Right. So if something goes awry and you don't understand it, you can go back and see what, what is the observability then and go in and explain it. So I think it's basically the perfect analogy between the red seaming is the doctor's visit. We do every, every course and we do that very comprehensively. And in between we make sure that we check your, your vitals every minute and there's an alarm going off if there's something, something off, and then maybe adding a healthy diet to it in the first place as well, ensuring that the inputs and outputs of the system are working well. So not too much junk food there.
[38:34]
B
Okay. Yeah, I like even now I have some revisions of my metaphor I'll use based on your response. I think that's great. But yeah, I think the other thing that might like we have some developers listening, maybe people developing agents actively and it might be somewhat overwhelming to for example, look at the, you know, AI UC has like a evidence page, right, where I can see all the things maybe I should be doing. How do you see the market evolving in terms of like obviously you have one side of this which is really related to the standard, the certification, maybe, maybe eventually that insurance side. But then it can be, you know, an individual developer of an agent might not be an expert in agent security or how to govern these things at that sort of thing. So that might seem overwhelming to them that you mentioned partners like auditors. There's the infrastructure layer. What do you think? I, I guess my question is what do you think needs to be in place to, to actually enable real world developers to meet some of these standards, Whether that be evolution in the tooling or obviously understanding maybe of the, the, the standards. I don't know. Does the question make sense?
[39:57]
C
Yeah, yeah, no, absolutely. I think where we are right now is we're just overwhelmed by how positively AOT1 has been received as we're really busy just delivering certifications. And that means we have less time to talk to some of the many, many fantastic partners who come into our inbox and want to partner with us. I think there's three things we need to get right for this to work. I think we need to continue pushing for code and eugenic products that come out of the gate is as secure as possible by default. We're certifying our first coding agents right now and we are certifying both a well lovable, which I think will be certified when this process comes, this episode comes out and then another very large coding agent that may or may not have been acquired recently for a lot of money without naming names. Working with the coding agent layer and the platforms where you go in and configure code, we're Also like again, UiPath is a good example where you don't just like have one agent but you actually go in and build agents on top. Means that it'll be easier to meet a lot of the standards just by default because the environment where you define and build your agent is secure by default. So I think that's step one and we're going to do more of that work with some of the big agency platforms out there very soon. I think some of it will be announced in the, in the fall. The second stage is we need this partner ecosystem you just talked about and we're already starting to come out with, with some examples of this. Where a partner meets, helps company meet companies meet a good chunk of controls. So a company like Wide Circle, which I think is very cool, we don't work with them at all. So it's just a, just a shout out. Their monitoring and filtering work is really good and has helped companies meet a lot of the safety requirements in our standard. So that is like one platform you integrate and you immediately meet say 8 or 10 of the requirements in the standard. There are many other platforms we're doing some work right now with credo, with Witness AI and there's again tens of others. So having that ecosystem help companies meet the controls I think is important. And where we've already gone in and done our best to help companies meet the standard is we've gone in and actually defined the typical evidence we see companies upload. So see that as your guidance for where to look for the right approaches. We don't just define the controls and leave it up to you to figure out how the hell to implement it. We actually try to give you the guidance as well on how we see companies do it today. I think the third and final stage we need is obviously making it easier then to go through the certification itself. So we're already integrating the framework in the leading GRC platforms. We're making it easier for our auditors to capture as much as the evidence programmatically. So we move away from screenshots and into like real validation that the controls work and hold up in real time. That layer and like the whole GSC engineering space is just really interesting to follow right now and we're doing our best to keep up and make the standard work for the GSC engineering community as well. So that when your RE audit comes in that next year, that it's a very limited time commitment we need from you and that we again focus on security instead of compliance.
[43:12]
B
That's great. Well, Emil, it's been amazing to look at what the AIUC has done even in the past year and the amazing resources that you put out for the community. Thank you for doing that. Thank you for working towards the future that you described. Would love to have you or others back on the show in the future as things develop. Thank you so much.
[43:33]
C
Thank you for having me, Daniel and I, I think maybe just a final plug before I head off. This is work made by industry for industry. I'm luckily not alone in doing this work. We've collected a consortium of about 250 leaders across, like CSOs and the Fortune 1000, it is the Security Engineers and so forth. And having that community come together and actually leave competition aside for a moment and recognize that the size of these challenges and the pace of the challenges just dictates that industry has to come together is fantastic. So the fact that we've been able to just offer the platform and then let industry work together to define and codify these standards is fantastic. And we would love to see more people get into the machine room with us. So an open invitation for everyone who's excited about this work, either to help us drive adoption of the standards we see work or actually help us write them. My pleasure. It was a great conversation.
[44:26]
B
Yeah, thank you, Emile.
[44:32]
A
Alright, that's our show for this week. If you haven't checked out our website, head to PracticalAI FM and be sure to connect with us on LinkedIn X or BlueSky. You'll see us posting insights related to the latest AI developments and we would love for you to join the conversation. Thanks to our partner Prediction Guard for providing operational support for the the show. Check them out@prictionsguard.com also thanks to Breakmaster Cylinder for the Beats and to you for listening. That's all for now, but you'll hear from us again next week.