Leo Laporte (159:06)
You are. You're my. You're my foil. Leo, that was. Thank you for the question. So Sean, who has boasted his master's and his PhD in this area, is an extremely. In an extremely good position to appreciate the advancement of AI. So he continues writing before I get into the technical details, the main takeaway from this post is this. With O3, LLMs with O3, LLMs have made a leap forward in their ability to reason about code. And if you were and this is what I want everybody to listen to and if you work in vulnerability research, you should start paying close attention. Once again, the guy's got his master's and his PhD in this in automated Use of Vulnerability and exploit domain, he says. If and if you work in vulnerability research, you should start paying close attention. If you're an expert level vulnerability researcher or exploit developer, the machines are not about to replace you. In fact, it is quite the opposite. They are now at a stage where they can make you significantly more efficient and effective. If you have a problem that can be represented in fewer than 10,000 lines of code, there is a reasonable chance O3 can either solve it or help you solve it. Okay, now the reason I wanted everyone to understand something about Sean's pedigree was so that we would understand the weight of his statement. He lives and breathes this stuff. He's been experimenting with automated vulnerability discovery for years and he's telling us to pay attention here because something significant just happened again in AI, he writes, let's first discuss 778, a vulnerability I found manually, which I was using as a benchmark for O3's capabilities when it found the 899zero day, he wrote. 778 is a use after free use after free vulnerability. The issue occurs during the Kerberos authentication path when handling a session setup request from a remote client to save us. Referring to CVE numbers, he says, I'll refer to this vulnerability as the Kerberos authentication vulnerability. I'll refer to it as 778. Sean's posting then shows us about 15 lines of code and expl, you know, for specifically this thing that he found and he explains exactly what's going on there. It's not necessary for us to understand the details for this, but we want to understand its nature, which Sean explains by writing this vulnerability is a nice benchmark for LLM capabilities because it is interesting by virtue of being part of the remote attack surface of the Linux kernel. Yikes. It's not trivial and it requires a figuring out how to get session state equals SMB2 session valid in order to trigger the free b realizing that there are paths in KSMBD Kerberos 5 authenticate that do not reinitialize session user and reasoning about how to trigger those paths and c realizing that there are other parts of the code base that could potentially access session user after it's been freed, he said. While it is not trivial, it is also not insanely complicated. I could walk a colleague through the entire code path and in 10 minutes and you don't really need to understand a lot of auxiliary information about the Linux kernel, the SMB protocol, or the remainder of KSMBD outside of connection handling and session setup code, he said. I calculated how much code you would need to read. At a minimum, if you read every KSMBD function called along the path from the packet arriving, you know, the external attack packet to the KSMBD module to the vulnerability being triggered, and it works out to about 3,300 lines of code. Okay, so we have the vulnerability we want to use for evaluation. Now what code do we show the LLM to see if it can find it? My goal here is to evaluate how O3 would perform were it the back end for a hypothetical vulnerability detection system. So we need to ensure we have clarity on how such a system would generate queries to the LLM. In other words, it's no good arbitrarily selecting functions to give to the LLM to look at if we can't clearly describe how an automated system would select those functions. The ideal use of an LLM is that we give it all the code from a repository, it ingests it and spits out results. However, due to context window limitations and regressions in performance that occur as the meaning quality that occur as the amount of context increases, the this isn't practically possible right now. Instead, I thought one possible way that an automated tool could generate context for the LLM was through expansion of each SMB command handler individually. So I gave the LLM the code for the session setup command handler, including the code for all functions it calls, and so on, up to a call depth of three, this being the depth required to include all the code necessary to reason about the vulnerability, he said. I also include all the code for the functions that read data off the wire, parses an incoming request, selects the command handler to run, and then tears down the connection after the handler has completed. Without this, the LLM would have to guess at how various data structures were set up, and that would lead to more false positives. In the end, this comes out at about 3,300 lines of code and he says around 27,000 tokens and gives us a benchmark we can use to contrast O3 with prior models. If you're interested, the code to be analyzed is available here as a single file created with the files to prompt tools. Everything by the way that he's talking about is on GitHub. For anybody who wants to play, the final decision is what prompt to use. You can find the system prompt and the other information I provided to the LLM in the dot prompt files in a provided GitHub repository. The main points to note are First, I told the LLM to look for use after free vulnerabilities. So Leo, essentially what you are suggesting. Second, I gave it a brief high level overview of what KMSMBD is. I'm sorry, KSMBD is its architecture and what its threat model is. And third, I tried to strongly guide it to not report false positives and to favor not reporting any bugs over reporting false positives. He said I have no idea if this helps, but I'd like it to help. So here we are. He said my entire system prompt is speculative and that I haven't run a sufficient number of evaluations to determine if it helps or hinders. So consider it equivalent to me saying a prayer rather than anything resembling science or engineering. Once I run those evaluations, I'll let you know My experiment harness executes the system prompt N times and he said n equals 100 for this particular experiment and saves the results. It's worth noting if you rerun this you may not get identical results from from me as between running the original experiment and writing this blog post. I had removed the file containing the code to be analyzed and had to regenerate it. I believe it is effectively identical, but have not rerun the experiment. Okay, here's his results. O3 finds the Kerberos authentication vulnerability that is the thing he found manually initially in the benchmark in eight of the 100 runs. In another 66 of the runs, O3 concludes there's no bug present in the code, thus a false negative and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs. Claude Sonnet 3.5 does not find it in 100 runs at all. So on this benchmark at least we have a 2x to 3x improvement in O3 over Claude Sonnet 3.7, he said. For the curious, I've uploaded a sample report from O3 and Sonnet 3.7. One aspect I found interesting is their presentation of results. With O3 you get something that feels like a human written bug report condensed to just present the findings, Whereas with Sonnet 3.7 you get something like a stream of thought or a work log. There are pros and cons to both. O3's output is typically easier to follow due to its structure and focus. On the other hand, sometimes it's too brief and clarity suffers. Okay, so far we have Sean using a previously known zero day to test various models ability to independently rediscover the vulnerability that he already knows exists. And OpenAI's O3 model does this better than either Claude Sonnet 3.5 or 3.7. But even so, the O3 model only detects the vulnerability in 8 out of 100 tries. It misses it 66 times and cries wolf about the presence of non existent vulnerabilities 28 times. So what about O3's actual true discovery of that previously unknown vulnerability? Shawn writes having confirmed that O3 can find the 778 Kerberos authentication vulnerability when given the code for the session setup command handler, I wanted to see if it could find it if I gave it the code for all the command handlers. This is a harder problem as the command handlers are all found in the source code file SMB2PDU C, which is around 9,000 lines of code. However, if O3 can still find vulnerabilities when given all of the handlers in one go, then it suggests we can build a more straightforward wrapper for O3 that simply hands it entire files covering a variety of functionality rather than going handler by handler one at a time. Combining the code yeah, combining the code for all the handlers with the connection setup and tear down code as well as the command handler dispatch routines ends up at about 12,000 lines of code which is 100k input tokens. And as before I ran the experiment 100 times, O3 finds the original 778 Kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens so we see a clear drop in performance. But it's but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar but novel vulnerability that I did not previously know about. There it is. More interestingly however, he said in the output from the other 99 runs he said, I found a report for a similar but novel vulnerability I did not previously know about. This vulnerability is also due to a free of session user, but this time in the session logoff handler he said I'll let 03 explain the issue. So here's O3 speaking now while one KSMB worker thread is still executing requests that the session user I'm sorry, that use session user Another thread that processes an SMB2 logoff for the same session frees that structure. No synchronization protects the pointer. So the first thread dereferences freed memory, a classic use after free that leads to kernel memory corruption and arbitrary code execution in kernel context, which you know would chill the blood of any Linux kernel developer. The O3 model labels that as the short description, which it then follows with a totally useful and detailed breakdown and description of the problem that it detected. After showing us this in his posting, Sean continues writing Reading this report, I felt my Here it is. Reading this report, I felt my expectations shift on how helpful AI tools are going to be in vulnerability research. If we were to never progress beyond what O3 can do right now, it would still make sense for everyone working in vulnerability research to figure out what parts of their workflow will benefit from it, and to build the tooling to wire it in. Of course, part of that wiring will be figuring out how to deal with the extreme signal to noise ratio of around 100 to 50 in this case, but that's something we are already making progress with. One other interesting point of note is that when I found the Kerberos authentication vulnerability, I proposed an initial fix. But when I read O3's bug report above, I realized this was insufficient. The logoff handler already sets session user equals null but is still vulnerable as the SMB protocol allows two different connections to bind to the same session and there is nothing on the Kerberos authentication path to prevent another thread making use of session user in the short window after it has been freed and before it has been set to null. I had already made use of this property to hit a prior vulnerability in ksmbd, but I didn't think of it when considering the Kerberos authentication vulnerability. So he actually got a hint from what he saw O3 the way O3 was fixing the other problem, he said. Having realized this, I went again through O3's results from searching for the Kerberos authentication vulnerability and noticed that in some of its reports it had made the same error as me. In others it had not and it had realized. And again, I hate that word, but okay, that setting session user equals null was insufficient to fix the issue due to the possibilities offered by session binding. That is quite cool as it means that had I used O3 to find and fix the original vulnerability, I would have in theory done a better job than without it. I say in theory because right now the false positive to true positive ratio is probably too high to say. Definitely that I would have gone through each report from O3 with the diligence required to spot its solution. Still, he says, that ratio is only going to get better with time. Sean then finishes by offering up his conclusions, writing LLMs exist at a point in the capability space of program analysis techniques that is far closer to humans than anything else we have seen. Speaking of OpenAI's O3, he said, Considering the attributes of creativity, flexibility and generality, LLMs are far more similar to a human code auditor than they are to symbolic execution, abstraction, interpretation, or fuzzing. Ever since GPT4, there have been hints of the potential for LLMs in vulnerability research, but the results on real problems have never quite lived up to the hope or the hype. That has changed with O3, and we have a model that can do well enough at code reasoning, Q and A, programming and problem solving that it can genuinely enhance human performance at vulnerability research, O3 is not infallible. Far from it. There's still a substantial chance it will generate nonsensical results and frustrate you. What is different is that for the first time, the chance of getting correct results is sufficiently high that it is worth your time and your effort to try to use it on real problems. So I have a link at the end of the show Notes for anyone who wishes to see all of Sean's posting and even to replicate and duplicate his work. He's provided everything required to do that. As Sean observed, GPT4 was an ineffectual tease for this level of, dare I say, code comprehension. But his experiments showed that O3 has come a long way from GPT4. Imagine what will be what will have in another couple years. Some slowing of progress was inevitable, but there's no doubt that significant advancements are still being made. And I will assert again that it only makes sense that AI ought to be eventually able to do a perfect job at pre release code function verification. Once we're able to release vulnerability free code, it won't matter whether the bad guys also had the ability to use AI for vulnerability discovery because there won't be any vulnerabilities left for them to discover and exploit. You know, we're not there yet. But as the Magic 8 Ball said, signs point to yes.