
Formal methods are a branch of mathematics and computer science focused on proving the correctness of systems, and they have long promised a more rigorous foundation for software. However, their complexity has kept them confined to a small community of...
Loading summary
A
Formal methods are a branch of mathematics and computer science focused on proving the correctness of systems, and they have long promised a more rigorous foundation for software. However, their complexity has kept them confined to a small community of specialists. That is now changing. As agentic AI systems take on increasingly autonomous roles. The question of how to define, enforce, and verify what those agents are allowed to do has become urgent, and automated reasoning is emerging as a critical part of the answer. Byron Cook is a VP and Distinguished Scientist at aws, a professor at University College London, and a Program Manager at darpa. He founded the Automated Reasoning Group at AWS over a decade ago, where his team built the foundations behind products like IM Access Analyzer, VPC Reachability Analyzer, and Bedrock Guardrails. In this episode, Byron joins Sean Falconer to discuss how automated reasoning works and why it scales so well with AI the rise of neurosymbolic approaches that combine formal logic with large language models, what it means to formally specify agent behavior using temporal logic, and why the convergence of agentic AI informal methods may represent one of the most significant shifts in how software is built and verified. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
B
Byron, welcome to the show.
C
Thank you very much. Thanks for having me.
B
Yeah, absolutely. Looking forward to this. So we're going to probably cover a lot of ground today. You've had a pretty rich history, but one of the things I wanted to start off with was so you're a VP and distinguished scientist at AWS. You've been there over 11 years. You're also a professor at UCL and a PM for DARPA. So my first question I guess was like, how is all that possible? That seems like a lot of jobs to balance. How do you manage wearing those three different hats simultaneously?
C
Cognitively, they're kind of the same job, right? I think in each role I'm trying to identify really big things that other people can't wrap their head around that maybe they don't think are possible. I'm trying to move the seemingly impossible to the possible and then trying to de risk that. So either trying to establish basis for confidence or prototype stuff. And that often involves getting people in the orbit of the scientific world and the startup world and the university world to find the right people with the right tools to prototype something and then simultaneously write some sort of report or some sort of PowerPoint document or something in the language of the institution to help them appreciate what's possible to give your leaders a lever to pull on to prioritize these kinds of things. And so it's all kind of the same. It's sort of like the difference between Spanish and Portuguese. You're using slightly different nouns and verbs, but largely they're in the Indo European family of languages. And so you're kind of doing the same thing in these large organizations. They all kind of speak the same language. And then even if I'm at Amazon, I at any given time have three or four things going on. So I'm sort of context switching between those things Anyways.
B
Was that something that you were always interested in trying to tackle these types of problems that seem impossible?
C
I saw my father work between institutions and basically see the community as his stakeholders or his customers or the people he was working with and sort of cross organizational boundaries. And then I like in my life on working on cars or doing music or these kinds of things, people sort of are floating in and out of organizations and you're sort of just trying to solve problems together. So I've always kind of seen the organizational boundaries as relatively porous. And then I got really interested in logic. And it's a very small crowd of people. So there's maybe 3,000 people in the world who specialize in this area. And so we're always at different institutions, different places, meeting up with each other. And then it's kind of like bands you tend to like with three or four people realize there's something really cool to go after. You go after it for two or three papers, and then kind of after that, then people often kind of drift off and do the next thing. And so to me, it's very fluid and the. Or the particular details of the organization don't really matter. I'm just usually be in the place that's sort of the best place for me to be in to help the community figure out how to solve problems.
B
Yeah. And one of these things, these. One of the kind of seemingly impossible things that you were able to kind of show was potentially possible was around essentially you built a tool that could automatically prove whether certain types of programs could terminate and making a significant contribution essentially to the halting problem. I guess. What made you believe that something like that was solvable, given that conventional wisdom and even Alan Turing said otherwise.
C
Yeah. So there is a theorem that the problem's undecidable. That's a very meaningful result. That means that you cannot build a tool that can take any program in as an input and always itself terminating with an answer and always giving a right answer. You have to cut a corner somewhere. And so you can cut the corner by saying, we'll relax the restriction that always has to give an answer. We'll be okay with maybe the tool not giving the answer from time to time. And then we'll just say, well, maybe we can get that the occurrences of it not giving an answer down to some level where we're actually still getting value from the tool. And so that's sort of reframing the theoretical result from Girdle and Turing is fundamental and absolutely staggering. But on the other hand, there's lots of things that are unsafe that we do all the time. Driving is unsafe, but we get in the car and if we get to our destination, then we know we got there safely. And so the insider around that area is really interesting. I was actually working in the Windows kernel team, and I was carpooling with a person who's a developer on the Windows kernel team. And when we were talking about things that should hold the kernel, he kept using the word eventually. Like if this happens, that eventually should happen. And I realized that was like, that is the termination question. Termination and the word eventually. There's a theoretical result that sort of connects them saying they're essentially the same thing. And so I realized that was an important problem. And then I began thinking about what tools could we put together from existing pieces. We already had to do something quite quickly for device drivers, and then that was successful. And then that turned into reasoning about biological systems and it turned into reasoning about non blocking concurrency and so on.
B
Yeah. And was the original focus on device drivers, was that partly motivated? Because at that time, device drivers was one of the biggest culprits of essential failures in operating systems.
C
Yeah. So in the 1980s, if your computer crashed, you just like, well, it was sad. You lost your little word processing document or whatever you're working on. But as we went to data centers, the difference between crashing and not crashing was sort of the difference between the customer choosing your product or not. So it became important to root out those crashes. And it turned out that in the early 2000s, 85% of crashes in the operating systems were actually due to device drivers. So it became an important problem to work on.
B
Wow. Yeah. Yeah. Talk about. You saw that as a high roi, I guess, for the business.
C
Right. And they were really neat because the device drivers were not actually that big. You know, they're like 80, 200,000 lines of code or 80,000 lines of code. You know, they'd have like 60 loops. The data structures they were operating over were actually relatively simple. They were cyclic, doubly linked lists. So in many ways it was actually a slightly. It was an easier problem than the more general problem. So then we were able to do some really neat things that kind of were attuned to that domain, which sort of allowed us to bypass some of the. Not theoretically, but we practically bypassed some of those challenges around undecidability.
B
Yeah. So in a lot of ways you're bounding this unbounded problem Right. To a more specific class of the problem.
C
Yeah, you can't solve the problem in general. So you're going to have to identify niche areas where while it's still my timeout from time to time, it's not going to time out enough to cause sufficient pain to the person using it for their niche domain. And the only hope is to get as close as you can to the person with the understanding of the niche problem to kind of understand the dynamics of that problem. So that happened again with reasoning about genetic regulatory pathways, where I was like, oh, I see. So usually they have this shape and usually they're of this size. Usually they're asking this kind of question. So actually we could do something that 95% of the time it's actually going to give them an answer. And maybe 5% of the time it says, sorry, I can't answer the question, but okay.
B
So you've founded the automated reasoning group at AWS like over a decade ago. What was that original pitch to the Amazon leadership for them to be excited and want to believe in this?
C
Let's see. So I, I had kind of exhausted. I mean, I'd written all the research papers, you know, I was like, how much more can I do in this ivory tower research environment that I was in at the time? So I was sort of looking for a place to go where I could maybe have more breathtaking impact with the work I was doing. And so I actually sort of, I went and talked to a bunch of potential employers and kind of figured out which one I would want to work for. And I think that AWS was pretty happy to have me because they, it was pretty risk free. Right. Like I was going to join. They were only hiring one person. I report directly, the ciso. If I think if it didn't work out, then I would exit. I would imagine that would be the discussion that they were having. How risky is this hire? And then what Amazon, I think does a really good job of traditionally is incrementally investing in things that are working. So they look for signs of direction of Change on things that are going well. And they have a whole language around year over year change and so on. So they look for bets that are going well and then they continue to feed the bets and double down on bets. And so it's a whole calculus of what to double down on and how to prioritize things. So once I got in the system, then Amazon just naturally went through its mechanisms of prioritization and it just grew like a weed really, you know, given
B
that like AWS runs critical infrastructure for millions of customers. Like going back to some of your work with like the drivers and sort of proving whether they can run successfully or not. Like, are there similar class of problems around IAM policies and network configs and things where if you mess those up, it can go really bad? As someone operating this type of infrastructure, is there a way to essentially apply some of your work to those areas?
C
Yeah, so that's what we did. So when I first joined, I identified, I went and chatted with a bunch of people and looked at a bunch of documentation and kind of identified the top 20 security concerns or issues or close calls and so on and then prioritize them by what I thought the kinds of things that I had done in the past could help on. Identified four or five of them, hired a few of my friends and then we kind of went after them. And so yeah, you identified two of them. So that one is reasoning about the semantics of AWS policies. Another one is the virtual networks, because if you get those wrong, then you're going to have a really bad day. And then we also worked on like low level code like virtualization and cryptography. And then we also worked on like the distributed systems protocols. So how do you, for example, maintain secrets across a distributed system such that they're highly available and durable and so on. So the thinking around protocols and distributed systems was another area. And then each one we went after with great gusto. And then the idea is that if any one of those projects were successful, then we'd be successful fortunately. But unfortunately they were actually all successful. So then I spent two or three years really holding on to many high voltage wires as I was trying to get each of those projects kind of. They were all quite hungry to move to the next level. And I kind of found myself to be the bottleneck on the communication around how to grow those things and how to get them into the right place. So it took me a couple years to get out of that.
B
Yeah. How's that work? I'm assuming your teams and yourself especially when you were essentially an IC functioning in this role until you hired some of your friends and so forth. But you're sort of doing this as this research project, I assume, and then maybe you're doing the proofs, maybe there's a POC involved. How do you go from that essentially to this thing running in production and serving millions of people?
C
So one piece of advice I got when I first started was the person was like, hey, in the previous employers, you've gone for these moonshots, right? These crazy big things. But what you want to do here is you want to find a moonshot ladder. You want to find every quarter, every six months, you want to deliver something that if people don't have any context on your bigger picture while you're doing it, what's possible? In fact, you probably don't even want to tell them the bigger picture, deliver that. But then you know that each piece, each delivery sort of adds up to the bigger moonshot that you're going after. And so focus on it that way. And Amazon is a very sort of show me kind of place where people kind of believe it when they see it. So rather than talking about what you can do, deliver something maybe a lot smaller than you normally would, build trust through that and then iterate, iterate, iterate. And so the projects that we were very successful on were those that we were able to break up the bigger moonshot into a much smaller set of steps. And for those that we weren't able to, they languished. And it was only once we'd sort of built trust and built momentum on some of these others that we were able to come back to those bigger things and take them on again.
B
That makes sense. Just even my experience, having worked at a variety of different startups, when you kind of start in a new role, a lot of times it's like, how do you build to something big, but also how do you show value as soon as possible? And it's much easier to do that with a smaller, more constrained project. And then as you build trust, people kind of believe in maybe the bigger thing and the longer term. Deliverable.
C
Yeah, exactly.
B
Is there any challenges around, from an engineering perspective, where when you're thinking about formal verification, you're doing that in the lab that doesn't necessarily need to be as fast as it might need to be when you're actually delivering this type of stuff in production, is there ever a challenge there where the formal verification, perhaps in the lab environment is slow? And then how do you sort of engineer it to something that can run in production at scale.
C
There's actually another problem that's bigger. I'll answer your first question, but then I'll address the other question that's actually more on my mind. So in the lab, you're often relatively limited to the resources you have. You have maybe a couple grad students and a few laptops or something like that these days, maybe you have a bit of cloud, but you don't really have the sort of resources you would have. So it turns out there's not been a lot of science, but the science is really interesting in growing on how to distribute formal reasoning. So there's a thing, for example, called mall mono, where they distribute propositional satisfiability solving. It's really powerful. And if you have a cloud scale, like a team to operate a cloud service, and you have the sort of resources you have at AWS to run big machines, then suddenly actually it's much, much faster. So the performance of the reasoning isn't the big gotcha and it's getting faster and faster. And I'm not worried about that. What is a problem though, is scaling the. So whenever you're doing reasoning about systems, you a have to articulate what you want to hold of the system. Like, is it all data at rest is encrypted? Is it we never log credentials? Is it like, what property do you want to hold of the system? And you have to do that in a way where you get the details right and you have to do it in a relatively rigid setting. You have to use a temporal logic or something like that to express it. And there's a very limited number of people who can do that. And then the other pieces that you have to be tasteful about what you're not going to prove. You're not going to go and prove the correctness of the language runtime, the compiler or the microprocessor the program is running on. So you have to basically build assumptions that the reasoning relies on and making those decisions and articulating those. It's a lot about how do you do the encoding such that the tools work. And again, that's a fairly small set of people who can do that. So I'd say the big challenge. And then culturally, your average person who knows about formal reasoning is not going to be very Amazonian, if you will. There's a big cultural gap between people who are relatively rigid logic. When they looked at a lot of opportunities in the world, they said, no, I'm going to do mathematical logic instead. So what interests them and what drives them is really different than what you're going to find with a person who would join Amazon because they're interested in the cloud or other dimensions. So culturally they're a little bit different. So bringing all those pieces together, finding a manager who can run scientists who understands Amazon but also understands the science community, and then another big one is finding people who can drive the tools or who can do the encoding. And thankfully the problem around the encoding with AI tools these days actually is getting dramatically easier. So with the idea of neuro symbolic AI where you combine formal reasoning with the neuro inspired techniques like transformer models, suddenly you have ideas of auto formalization and there's a whole bunch of new building blocks that we can use to really scale this activity out. And so that's actually turned out to be tremendously helpful quite recently.
B
Yeah, I wanted to get into the neural symbolic AI which you mentioned there, but I guess for the audience that's probably not necessarily super familiar with it. What is sort of the definition of that area is the neural essentially just you're leveraging LLMs in this case and that kind of is serving as the neural layer and then you're combining that with just sort of more traditional symbolic reasoning.
C
I mean you're probably going to pull five experts in the area and they're probably going to argue about the definition. But broadly speaking, in my mind it's the combination of a symbolic where you're pushing symbols to represent the infinite. You're like X less than Y is the set of all pairs of integers, if you will, such that the first is less than the second. Right. So that's an infinite set. So when you begin talking about formulae with variables and you're sort of pushing symbolic formula, you're doing symbolic reasoning. So that's the symbolic of neuro symbolic. And then the neuro is these. Yeah, these sort of, I mean LLMs is absolutely the classic. That's the big hot thing right now. Right, for sure, but more statistical or Bayesian thinking. And then when you combine them you can get, depending on how you combine them you get different advantages and disadvantages. And so it kind of depends on you can use formal reasoning to synthesize training data to make your language model, for example, much more powerful. So actually a lot of model providers do that today. They use the Lean theorem prover for example, and synthesize a lot of true and false theorems. And then we'll train over that data and then you'll find that the language model is actually better at like recipe design and so on. And then the other extreme is you could sidecar a formal reasoning tool on the output of a language model and then either ask the language model to express it in the formal language or do some sort of translation. That's kind of the other extreme. And then there's sort of middle grounds where you try and integrate symbolic reasoning into the inference procedure itself and you see a lot more of the two extremes and not as much in the middle.
A
Right now you know Fidelity is a financial services leader, but did you know that inside Fidelity is a community of technologists working together to shape the future of finance and tech. Fidelity is always investing in tomorrow. From emerging tech to cutting edge tools that will transform what comes next. Their technologists are encouraged to keep learning so they can expand their skill sets, explore new ground, and stay ahead of this rapidly evolving industry. And right now, Fidelity is hiring technologists to join their team. Fidelity technologists get the best of both worlds. Startup energy that's grounded in the stability of a financial institution. That means support, resources and amazing benefits. Bring your skills to a culture where you're empowered to dream big and build the tech that drives an organization and makes a real impact on people's lives. Find out more@tech.fidelitycareers.com that's tech.fidelitycareers.com Fidelity is an equal opportunity employer in mobile application security. Good enough is a risk. Guardsquare uses advanced multi layered code hardening techniques and automated runtime application self protection and mobile application security testing combined with real time threat monitoring to deliver the highest level of mobile app security. Discover how Guard Square brings all these together to provide mobile app security for your Android and iOS apps without compromise. At www.guardsquare.com every AI team eventually hits the same wall. The models are solid, the infra is solid, but the data coming in is hours old because the pipeline is batch when it should be streaming and nobody's had time to fix it. That's not a modeling problem, that's a pipeline problem. Estuary gives you CDC batch and streaming in one platform. 200 connectors live in hours, not weeks. Your AI is only as good as your pipeline. Estuary.dev.
B
i mean, from a, I don't know, big picture perspective, if you're trying to achieve AGI or something like that, it seems like you would need at least both sorts of approaches where LLMs are very good at a certain class of problems. And then these more formal reasoning things that you can define as correct are going to be best sort of used for a different class of Problems, they're kind of complementary. So if you can put them together, it feels like that would be a necessary step, reach some sort of more general intelligence across a variety of different problem sets.
C
I'm very much an armchair cognitive scientist. I'm not a practicing one. So I am a specialist in my sort of branch of AI, if you will. So you'll want to take what I say with a grain of salt. But I like your characterization that it's a necessary but not sufficient condition for whatever AGI means. But breathtaking AI, that's hard to know if this is a human or not. I would think you need both logical and the transformer type stuff. The thing I'll say is that for years the propositional satisfiability solving is this classic NP complete problem. The tools for propositional satisfiability solving have far, far, far exceeded human capability, like years and years and years ago. And they get better and better and better. So like you can take the entire design of a nuclear power plant or railway switching or whatever, and with millions of variables, millions of connectives, ands and ors and so on, and show there are no satisfying assignments or find satisfying assignments in those spaces and you could collect the entire humanity and ask them to solve it, and they wouldn't be able to solve it in that time. So it's like in a sense it's far exceeded human capability long ago, but they were extremely brittle to use to get the answer. Required you to understand logic. And then if you go to satisfiability, modular theory, solver, Lean or Isabel, these kinds of tools, you needed even more and more expertise just to ask your question. Yeah, that was typically the real bottleneck for using.
B
Yeah, so does pairing that with an LLM where you can have like a natural language interface and these that I assume significantly sort of democratizes the access to it.
C
Yeah, there's all kinds of tricks you can pull now. Like you can call a language model multiple times to do translation from natural language to logic, and then you can use a theorem prover now to prove equivalents of the different attempts. So the more you do that, the more confident you got the translation right. And then now you can, once you've got that into a formula, you can walk the formula like you can use the propositional structure or the logical structure in the formula to identify interesting corner cases and then articulate them back in natural language. So you can help someone who doesn't understand logic know what the formula is saying. And then you can go from natural language to logic and then now we've got those pieces of equipment. There's just so much you can do around, like if you're agentic workflows, if you're ruling out bad behaviors from agents and then you decide later on you don't like the formula that describes the safety range, you can then modify that formula and then go replay all the agentic actions and see what would have happened and just all kinds of stuff you can do that was impossible before.
B
Yeah, I mean, given that historically the hardest part of some of these formal methods was kind of like the human bottleneck of writing the specs, how is this transforming some of the, even the things where you found success in previously at aws, around the sort of formal verification around policies and things like that. Is it changing that kind of work as well?
C
So one choice we made when I started the Automated Reasoning group, I remember it was like Ken McMillan as another scientist in my space, just before I joined Amazon, I had told him I was going to join Amazon and I was having coffee with him and he joked, man, I'd love to get out of program verification. And so when I actually that idea stuck with me. So when I joined Amazon, the policy reasoning and the network like VPC Reachability Analyzer, that's an aws product, and IM Access Analyzer or S3 block public access, those are all based on these tools that are actually combinatorial reasoning. So they're actually, they are decidable, they're not undecidable. And so there the real trick is to do combinatorial reasoning. And there the tools are like propositional satisfiability or satisfiability modulo theory solvers. And they're just unbelievably fast and no LLM will ever beat them. But what the LLMs do now is they actually call those tools. And so that hasn't really changed. What has changed is rethink about programs. Because now if you have loops and you're now worrying about termination of the loops, or you're worrying about finding an inductive invariant over the loop, or you're worrying about concurrency, so you're looking for rely guarantee constraints. Now the algorithms, the procedures that existed before were pretty brittle and now you can just ask the LLM when the LLM's writing the code, you can just say can you find the inductive invariant for me at the same time? And it does a pretty good job. And then what we've seen with recent advances in the models like in the past year is like really breathtaking. So we whereas before we had five humans who could drive these theorem provers. Those humans are now really excited because they're able to let the models go and do the proof search for agentic tools, go find the proofs for them and they run thousands of these jobs at the same time. So we're getting, it's not only 10x or 100x, it's like 1000x productivity gains from that small seat of individuals who can drive these things. And the amazing thing is because these tools like Lean Theorem Prover, they are yes or no tools, right? You can pass them very complicated things and they will say yes or no. It's like the scaling is unbelievable as opposed to, I don't know, poetry or other areas where the yes, it's not as easy to say yes, no.
B
So it seems like there's a couple different or there's a spectrum of approaches where you could combine like an LLM or reasoning model with these kind of formal methods where it's like verifying AI output or even some of the stuff where the LLM can rely on the formal method and sort of trigger that to handle certain sort of logical reasoning problems or even using AI to make the formal methods easier to use and sort of democratize the assets out of those that spectrum. What part of that I think excites you the most? Where do you see the most opportunity?
C
Well, I mean there's just nothing but open field for the opportunity. Actually all of them. It's just absolutely unbelievable. I just really didn't think it would happen in my lifetime that previously I had like four or five friends of mine who I'd worked with in the past and different institutions and I brought them to Amazon and basically their job was to sit at a terminal and drive these tools and they were the neuro. They had read all the proofs and understand all the tools and they could run these things and suddenly it's just like you get kind of unbelievable capacity from them. So I actually don't see a ceiling for that productivity. The thing that I'm probably most interested in intellectually is agentic safety. I think of agents as allowing us to scale things that where we only had sociotechnical mechanisms before. So now you have a choice of socio technical or these kind of agentic tools and there's a disadvantages and advantages to them. Right.
B
What do you mean by socio technical?
C
So a socio technical mechanism is a mechanism where you are inserting people at certain stages as a protocol to raise the bar or make decisions. Right. So like the police is a Socio technical mechanism. Right, You I don't want to go to jail. And so therefore there are certain things I won't be doing because I don't want to go to jail and. Right, and my employer can call the police. Right? So whereas agents aren't afraid, they don't fear jail whatsoever.
B
That might be the actual responsibility of the engineer. At this point, if the AI is writing the code, you can't send the AI to jail, but you can send the person who sent the instruction to jail.
C
Right? So that's where this formal stuff comes in. Right. So if now what we're doing is if we ask the human to look at the output, if we say, hey, you're responsible for the output always in some sense you haven't really scaled the person and now they're just going to hit cognitive overload. Imagine they have like a thousand agents writing code on their behalf and their entire job is just to read the output of agents. They're going to miss things for sure. So what you really want is you want to actually write down, whereas maybe we weren't before because there wasn't an advantage to doing it. Actually write down what you want to hold. So what is availability? What is confidentiality? What is integrity? What is privacy? What is sovereignty? What is durability? What are these concepts and can you write them down in a linear temporal logic or computational tree ctl? What is the formalism for which you would want to express these things? And then can you use these auto formalization techniques to get them right? And then now can you, in an agentic execution environment, before allowing the commands or programs that have been synthesized by the agent's LLM, allow them to go execute, can you now go statically check that those things won't violate the confidentiality, integrity, durability, you know, availability rules, and then allow it to go. And if you can get that right, then now agents just become unbelievably powerful. For us, it's a very humanistic view because it's still, it's still saying it's a more declarative programming model, right? You get to say, you know, we in natural language, in a room can decide, this is what availability is, this is what integrity is, this is what confidentiality is. And then if things happen in the future that we don't like, then we can refine that definition together. And then one could imagine best practices where they're open source and like, oh, this is what these other enterprise organizations use for confidentiality. We should copy those and then we can for compliance we can now talk to the auditor and say, well, here's our definition of confidentiality. And so then I think really it sort of unlocks a lot of a scaling dimension that's just previously unheard of.
B
So the first problem is that if everybody's running, you know, a ton of different agents, like at some point people are going to miss something because they're context switching all the time. If you're doing this for engineering, you're spitting out a ton of code. No way. You can hand review every single line of code and make sure that it's right. And even if you did, you still might make mistakes. But now you're sort of shifting some of that problem to trying to do like spec this out beforehand in terms around what the expectations are. But are you sort of shifting that problem to. You kind of have to be able to predetermine all the situations where the agent might get itself into trouble. Right. So how do you start to formalize that and also, I guess, bound that problem down to something that doesn't take forever to essentially define.
C
So the answer is yes. And there's a quote. I like this related. So Einstein apparently said everything should be as simple as possible, but no simpler. So, yeah, ultimately we are still having to decide as a society or as a leadership team or as a family, what's allowed and what's not allowed. And actually, organizations at scale tend to articulate these things down right. Amazon, we have the leadership principles. And that really helps Amazon scale a lot. When I'm going through an uncertain time or something's happening and I can't just go call my boss every time I'm confused, how am I making those decisions? I rely on the leadership principles to give me guidance. And in an agentic setting where you don't have human judgment, you will want to be rather more concrete. But then you can use symbolic formulae to describe the infinite set of things you're okay with and not okay with. And then using the formula, you can then find the interesting witnesses of what's allowed and not allowed to show them to a committee, to the society, to leadership team, or to a family. These are the kinds of things that will be allowed or not allowed. How are you feeling about that? And kind of run those things ahead of time. I think you can't make it simpler than that. You ultimately do want humans to set the policy on kind of, what is the context in which we're going to let agents just rip? And what are the boundaries for which we don't want them to Cross for like, you know, we don't want them to send money to certain countries or buy certain kinds of stocks. And I think it's an exciting opportunity. I think it is the sort of scaling that you got glimpses of in really large successful organizations. But it's kind of a way of thinking that a lot of people haven't thought about before. But it's, but it's an opportunity now that for scaling yourself but your family and organizations.
B
So a lot of times, either whether you're a model provider or you're building some sort of agent system, you'll use things like the system prompt to put in certain guardrails around like, hey, this agent should be able to answer certain types of questions. So this sounds like a more sort of formalized version of that where it's not just about the formalization of the spec, but also when the agent does something, being able to go back to the spec and essentially check in a formal fashion whether you're allowed to do that thing or not. Is that a fair characterization?
C
Yeah. So my mental model is definitely answers for which there are right or wrong answers. Right, like on family Medical Leave act or New York City Department of Buildings, there are questions for which like, no, you are not allowed to put the air conditioner there. Or you know, like you in a two family building, you do need sprinklers. But there's all kinds of, like when you look at the codes, it's like the conjunction of codes from different years. And often there's conflict between them. And so often these kind of rule systems that we have in society often conflict. And so now's a really great time to find all those things, identify the conflicts, decide on the answers in a fair and equitable way around what are our answers going to be and then allow agents to take action or humans to take action. But I definitely have in my mental model those questions for which there are right or wrong answers. Like, should we send this money to here? Does this violate your tax strategy? Those are right or wrong answers. And then there's stuff that's more like, is that a bad thing to say? Is that a bad word? Or these things are a little more on the statistical spectrum. And I think that you could probably use some of these same mechanisms. But I have a harder time wrapping my head around is this a good poem?
B
I think that's the hard part is the things that are subjective. Actually, I've written a little bit about this where I think one of the reasons why we've seen such success in engineering Sort of really being the tip of the spear with not only agents but models in general is because when you code you have a lot of built in guardrails in terms of whether something works or not. There's deterministic ways. If you generate code, you could compile it, you can run it against tests. Those are some form of verification of whether there's correctness behind what you generated. But if I'm generating a poem or even a, I don't know, a press release about a product launch, the quality controls are much more subjective and it's hard to essentially put some sort of formal verification around it to determine whether it's good or not. And then the feedback loop in engineering is also very fast where you could compile it, pass the test, even run it, and then if errors happen, feed the error back into the system self correct and generate new code and do that in essentially a loop. But it's hard to do that in a lot of domains because by the time you know something's wrong, the space between generation and knowing it's wrong could be a really long period of time.
C
Yeah, I agree. I think I'm fortunate to be working on the side of the spectrum I'm on because I'm just enjoying huge benefits. Whereas in music or poetry it's harder. I will say that there is a boundary that's not 100% clear and there's probably pretty exciting work there. It'll be more statistical in nature. You'll be like, I'm relatively confident that this discussion, even if it's translated into like some, like Vulcan or you know, some language that's not natural, that's not English or something like the argument that would be made still ultimately roots out to chemistry. And then chemistry is a formal system. So I think in some applications there's, if there's a formalism hiding down somewhere, then there's probably something we can do using this. But yeah, I mean for me it's on mathematics or reasoning about correctness of programs or reasoning about policies or security. It's like we're well on the side of the spectrum that we're happy about.
B
Yeah. And I think a lot of probably B2B type enterprise problems fit into that class structure versus the just open ended kind of you ask anything and get any possible answer.
C
The addressable market for answering questions about correctness according to law and tax strategy across like agentic financial systems is so big that it's worth going after even if we can't answer questions around poem being a good poem.
B
Yeah. Healthcare probably similarly.
C
Yeah.
B
So with an agent there's going to be at least sort of two big safety concerns or trust concerns with those. One is that the agent might take action or access data that shouldn't or is unexpected. And then also it could hallucinate some answer to some question or something like that that also leads to a misleading result. So is what you're talking about possible to kind of potentially address both of those problems or does it bias towards one more than the other?
C
So in the space of chatbots or that kind of thing, we have a product called Automated Reasoning Checks and Bedrock Guardrails. It helps you formalize your domain like Family Medical Leave act or your vacation policy or your travel policy. And then in an inference time setting helps you remove incorrect statements and only produce correct statements and can explain why the answers are correct. And that allows not only for you to remove incorrectness due to hallucination when you're dealing kind of with your customers, but it also allows you to understand your rules because there may be cases where your rules were wrong and the customers like that can't be right. And so you can use the feedback from the customers to identify actually where you we got things wrong. Which is actually fairly common when people kind of bring their policies in from the text file to now being used in an LLM based situation on the other one around. Basically where like we regularly analyze, using automated reasoning tools and entire AWS services, looking at where data flows to and from. So we can, we can track, you know, we don't track the data itself, but we can track the flows being using statically analyzing both code and the infrastructure of AWS to show where data flows to and from. And then that allows us to make pretty high fidelity decisions around risk, around what information we're using to make decisions. And so I think that's an area that the broader IT community can use too, right? If you're building agentic generative AI based solutions that plugs into databases, data lakes, you know, these kinds of things, you can analyze your service and determine where data is flowing to and from. And then if there's certain pieces of information that shouldn't be contributing to the answers, say in a medical domain, like if you don't, if you don't want the agent to leak other customers, pii, for example, then that's the sort of thing you can establish and maintain as an invariant using the automated Reasoning tools.
B
So one of the things that you said earlier was that there's only maybe two or three thousand people that kind of specialize in this area of automated reasoning. And it seems like in recent years automated reasoning has kind of broken somewhat into the mainstream, I guess. Why now? What has changed in particular for this to suddenly have sort of more visibility as an area that's important for, you know, building successful software.
C
I'll give you the short term answer and then as I show a bigger context. So I mean it's agentic AI. Like you know, 10 years ago, if I had described what I did on a podcast like this, I think most people very, a much smaller set of people would track even what we're even talking about. And I think what's really amazing about generative AI is that as a society we're going through an exercise, sort of art of the possible exercise. Right. There were tools that use natural language that were very easy to call. You just went to a website and people are like, oh wow. And they sort of realized that computers could answer questions, very deep questions and reflect and so on. They weren't always right, but they could do it right. And so then it's a small delta from that to like, how do I make sure these answers are correct? Which is kind of what automated reasoning has always been sold.
B
Right? Yeah.
C
But before you were trying to talk about, imagine a computer could tell you and they're like, I don't even know what you're talking about. Right. Like, so we just, out of the gate. The presupposition of the work was just like lost on the vast majority of people. So it just wasn't, it didn't click for people. And so like it was a very small set of people whose brains were sort of like wired to be interested in that problem. And that's. So then people just kind of self selected. So I think that has changed. The bigger picture is that the inventors of computing were all formal methods people. Right. Like Alan Turing wrote a paper called Checking a larger chain which shows you how to prove a correctness of a program. And von Neumann was a proof. All of these people from the 1950s and 40s were mathematicians who knew how to do proofs. And very clearly in their mind they understood that programs mapped to Turing machines or like automata. So like it was abundantly clear. It was like not even worth mentioning.
B
Yeah.
C
And it was kind of only when the IT industry, like when computers became like commercial mainframes and then mini computers and so on, that the branches of the mathematical view of, of programs and how to reason about programs and then the sort of IT industry kind of divided. And so you'd meet people kind of didn't like you. You meet formal reasoning experts, particularly in Europe, who have like never written a computer program in their life. They know how computers work, but they understand the mathematics of Turing machines. And then on the other hand, people who write computer programs all their life but don't even know it's possible to prove correctness of program. So the, the worlds have kind of divided. And then in AI you saw the same thing, right? So AI, like one of the first applications of one of the first IBM commercial computers was actually to automate the reasoning from Principia Mathematica, right? So automated reasoning was one of the first non military applications of computers. And then AI. John McCarthy coined the term AI. He was a formal reasoning person. So in the very beginning AI was sort of the mixture of these Bayesian cognitive, logical mix of those ideas. And people were very excited and it was very heady days. But then you had this sort of series of AI winters where government funding would dry up for these areas. The response was for people to get much rather more specialized. And so what happened is the automated reasoning folks, like the people in the logical view, found very niche applications that they could solve with their kinds of tools and were successful there. And that's kind of how I grew up. And then other people went to, you know, recognizing numbers for depositing checks. And so there was another set of applications for which some of the other techniques were going to work better. And so the areas kind of grew apart. And what's happened with generative AI and now agentic AI, is that the money, but also the, you know, the disruption, the opportunity has really kind of brought all the communities, the communities are kind of coming back together and there are certain members of the, of the, those crowds that are bilingual that can talk between all those communities and they're just worth their weight in gold.
B
So yeah, it's interesting. I think this whole area is bringing back a lot of things that have had periods of time where there was a big research backing behind it. I think ontologies certainly have been something that I did my PhD in the knowledge engineering space with ontologies and worked on ontology mapping problems. And then no one was talking about that for like a decade. And then now it's come back in a big way. There's a bunch of different things and even in distributed computing as well, where if you have agents that are making all kinds of network calls, you're going to run into distributed system problems. And that's really, I think brought to the forefront the idea of like, you know, durable execution and durable functions and all this type of stuff that historically has been a little bit more, you know, niche. Certainly companies invested in that. But it wasn't like everybody's problem and now suddenly it's like everybody's problem at Amazon.
C
It's funny because, you know, like sometimes, you know, very big names will join Amazon and they'll hear about the automated reason and be like, oh, we tried that at my company 20 years ago, it didn't work. And we're like, well, you know, maybe should we have maybe that was 20 years ago. So like things change and I think it's like music. You, you see ideas, like old ideas influence or fashion. Right. Things come back and they get integrated in different ways. And so, yeah, so pendulum swing ideas go out of fashion, come back into fashion. And it's actually frankly good because I think it's good, it's good to get a mixture of ideas and people confused about them and some people really advocating for some and some people advocating for other and then finding combinations and that it gives diversity of thought, which I actually think helps us solve problems in better ways. So it's pretty exciting actually.
B
Yeah, absolutely. I mean, I think sometimes someone with a different perspective can break through in a problem area that in a new way that no one's ever really considered before because the people who work in that area are kind of like locked into a certain way of thinking.
C
Totally. Yeah. If you had put me on that speech to speech translation problem with mathematical logic, I would have never gotten anywhere, but I would have kept insisting I had the right way to do it.
B
Right.
C
I would have like, give me another year, give me another, another year. Absolutely. Would have never solved it where it
B
would have just keep hammering that nail. And it would eventually for software engineers that are listening and interested in starting to apply automated reasoning or learning more about it. Where do they begin? What tools? I don't know, books, practices would you kind of recommend for them?
C
The ground is moving kind of under our feet. So the advice I would have given a few years ago is actually a little bit different than what I'd give right now. We've open sourced something called Strada, which is a intermediate representation for programs that allows you to represent Python and allows you to represent Java and Rust and so on. But it unifies to the logical representation which ultimately the semantics are in something called Lean. So Lean, there's a theorem prover called Lean, there's many theorem prover, interactive theorem prover, so Isabelle Hollite, Lean, so on. But Lean is the one that many mathematicians have sort of clicked to. So that's the becomes the lingua Franco for them and because of language model training. So language model providers discovered this data and began training over it and saw the value of this data. So then they've now begun to synthesize data using Lean. So for example, Deep SEQ did that reinforcement lending together with Lean is very powerful concept. Yeah, Lean has kind of become a relatively standard tool in that space. So I would go use Lean. It's a mathematician's tool and it requires you to traditionally to hold its hand. It's not, it's automated reasoning is a sort of almost a misnomer. Right. It's like very far from automated. I mean, it automates the checking of the argument you found. Right. Whereas these tools, like a propositional satisfiability solver, like catechol, is fully, fully, fully automated, but it's only solving the combinatorial problem, whereas Lean is solving an undecided, is trying to solve an undecidable problem with your assistance. But what's really amazing these days is, you know, like the language models are really good at running these tools. So if you have Lean together with a language model now you're in business now you can do some really interesting things. And so if you want to reason about programs, one way to do that is to translate the programs from Rust into Lean via Strada, which we've open sourced and then to Reason and Lean and then one could imagine, you know, products are coming, there's startups and then aws. We're all kind of building products in this space. So there will be various products and you know, I'd love for people to use the AWS ones, but there's, there's other startups that are building stuff too. So.
B
Yeah, I mean, that's amazing. I think that's one of the things I'm most excited about in terms of like, I don't know, these days that we're living in right now is sort of the accessibility suddenly to all these different areas. You essentially have a way to get access to information that you never had access to. It reminds me of like, you know, when I was a teenager and the Internet became, you know, something suddenly did not have to go to a library to look up, you know, every information and just be able to look that up on the Internet. Even in those early days where there's a lot less websites and stuff like that was just amazing and really opened up the world for me at that time. And I think we're seeing that again.
C
Totally. Yeah. When I look at like, you know, I used to work on Volkswagen motors, like air cooled 1960s when I was in high school, that's what I did. And the things I had to do to learn how to fix problems when I didn't know what I was doing were amazing. And then when I look at my son and how he can use online videos to learn how to use equipment, music equipment, so it's incredible. And then the language models are just like another step forward. It's pretty incredible.
B
Is there anything else you'd like to share?
C
People should follow me on LinkedIn. I try and post if they're interested in stuff they're hearing here. I post, try and post just a few things, but they should be of kind of interest to people who are tracking formal methods. So. I'm Byron Cook. Yeah, I don't know. Probably not. No, I'm just keep doing it.
B
All right, yeah. Byron Cook, I'm going to follow you right now. There we go. Awesome. Well, thank you so much for being here. I really enjoyed this. This was fantastic.
C
Yeah, it was a pleasure. Thanks for your time.
B
Cheer.
A
Sam.
Date: May 19, 2026
Host: Sean Falconer
Guest: Byron Cook (VP & Distinguished Scientist at AWS, Professor at UCL, Program Manager at DARPA)
This episode explores the intersection of formal methods—a rigorous approach from mathematics and computer science for verifying software correctness—and their emergent role as essential guardrails for agentic AI systems. Byron Cook shares his extensive experience leading the Automated Reasoning Group at AWS and discusses how scaling, productivity, and accessibility of formal verification are being transformed, especially by integrating neurosymbolic approaches (combining large language models with formal logic). The conversation also dives into practical implementation challenges, future opportunities, and the societal impact of bringing these advanced techniques into mainstream software engineering.
[01:45–04:26]
Quote:
“Cognitively, they're kind of the same job... trying to move the seemingly impossible to the possible and then trying to de-risk that.”
— Byron Cook [02:14]
[04:26–08:31]
Quote:
"You can't solve the problem in general, so you're going to have to identify niche areas... get as close as you can to the person with the understanding of the niche problem."
— Byron Cook [07:51]
[08:31–13:28]
[13:28–16:43]
Quote:
“With the idea of neuro symbolic AI... suddenly you have ideas of auto formalization and a whole bunch of building blocks we can use to really scale this activity out.”
— Byron Cook [15:52]
[16:43–18:35]
[20:39–24:02]
Quote:
“There’s all kinds of tricks you can pull now... Once you’ve got that into a formula... you can walk the formula to identify interesting corner cases and articulate them back in natural language.”
— Byron Cook [22:44]
[24:02–26:42]
Quote:
“Those humans are now really excited because they're able to let the models go do the proof search... it’s not only 10x or 100x, it’s like 1000x productivity gains.”
— Byron Cook [25:43]
[26:42–30:09]
Quote:
“What you really want is to actually write down... what is availability, what is confidentiality... and then, in an agentic execution environment... go statically check that those things won’t violate the rules.”
— Byron Cook [28:12]
[30:09–34:20]
[34:20–39:29]
[39:29–41:28]
[41:28–45:10]
[45:20–47:51]
Quote:
“If you have Lean together with a language model now you’re in business–now you can do some really interesting things.”
— Byron Cook [47:17]
“Cognitively, they're kind of the same job... trying to move the seemingly impossible to the possible and then trying to de-risk that.”
— Byron Cook [02:14]
“You can't solve the problem in general, so you're going to have to identify niche areas... get as close as you can to the person with the understanding of the niche problem.”
— Byron Cook [07:51]
“With the idea of neuro symbolic AI... suddenly you have ideas of auto formalization and a whole bunch of building blocks we can use to really scale this activity out.”
— Byron Cook [15:52]
“What you really want is to actually write down... what is availability, what is confidentiality... then, in an agentic execution environment... statically check that those things won’t violate the rules.”
— Byron Cook [28:12]
“Those humans are now really excited because they're able to let the models go do the proof search... it’s not only 10x or 100x, it’s like 1000x productivity gains.”
— Byron Cook [25:43]
“If you have Lean together with a language model now you’re in business–now you can do some really interesting things.”
— Byron Cook [47:17]
Byron Cook and Sean Falconer articulate a vision in which formal methods and neurosymbolic AI offer the next leap in reliable, safe, and scalable software engineering—particularly as AI agents take on greater autonomy within critical systems. Recent advances are turning what was once niche, theoretical expertise into accessible, practical tools, empowering more engineers to bring rigorous correctness and safety to the next generation of AI-driven software.
End of Episode Summary