Summary7 min read

Podcast Summary: How I AI

Episode Title: How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead
Host: Claire Vo
Guest: Brian Grinstead, Distinguished Engineer at Mozilla
Date: June 22, 2026

Episode Overview

In this engaging episode, Claire Vo interviews Brian Grinstead, a key engineering leader at Mozilla Firefox, about how his team harnessed AI agents—specifically Anthropic's new Mythos model—to not only uncover but help fix nearly 500 security bugs in Firefox, including issues that had gone undetected for over 15 years. The conversation offers a mix of technical insights, practical workflows, and real-world results, aiming to demystify how AI-powered agentic workflows are transforming large-scale software maintenance and security.

Key Discussion Points and Insights

1. The Scale and Challenge of Firefox Security

Scope of Firefox: Firefox is an immensely complex browser, with tens of millions of lines of code and responsibility to support web applications built across decades ([03:03]).
- “We have to render the entire web when the site was built in 2000 or just pushed up yesterday.” — Brian ([03:06])
Security Complexity: Prior to their most recent breakthroughs, bug reports—especially those powered by earlier AI—were often noisy, unactionable, and burdensome for maintainers ([04:28]).
- "It's cheap to just paste in some C code into a chatbot and get back something that's wrong. But there was no way to actually verify whether that was true. And that all changed..." — Brian ([04:37])

2. From Basic Chatbots to Powerful Harnesses

Defining the "Harness": The harness is the custom system or workflow that allows an LLM to interact with code, run scripts, test hypotheses, and verify results. It's not just an LLM chat—it's an AI plugged into real tools and infrastructure ([05:58]).
- “The harness is a way to give an LLM tools to achieve some goal... you just need to give it access to the right tools for the job.” — Brian ([05:58])
Key Tools Used:
- Bash/file access
- Browser testing
- Fuzzing infrastructure
- Analysis & verification subagents
- Patch generation and validation
  (All orchestrated via simple yet powerful cloud infrastructures)
Simplicity is Key:
- Despite diagrams and academic flourishes, the essential flow is straightforward; it's analogous to human bug-hunting, just automated and tireless ([13:00]).
- “It goes through a verific verifier sub agent... It has access to... probably like a dozen tools... then it goes into a very classic like bug fix pipeline.” — Claire ([13:00])

3. Harness in Action: Finding and Fixing Bugs

Relentless Looping and Verification:
- Agents are instructed to iterate on possible exploits or bugs within a narrowly defined scope until results are achieved or exhausted ([07:13], [10:23]).
- "The ability to take an agent and give it a very constrained problem surface area and say exhaust every attempt at this is really powerful." — Claire ([10:28])
Guardrails are Critical:
- Verification subagents and human input loop back to prevent the AI from introducing or fabricating vulnerabilities ([12:13]).
- "You need to give it some grounding to say don't go off the rails in this particular way, because it absolutely will." — Brian ([12:13])
Real Results:
- The system uncovered ancient bugs, such as a 15-year-old XSLT bug, by using the agent like a code archaeologist—tracing file renames and code movements over decades ([19:00]).
- "I asked Claude code go figure out like semantically when this bug was introduced. And I was like watching it do git commands I didn't even know existed.” — Brian ([19:07])

4. Workflow & Replication: Building Your Own Harness

Not Model-Specific:
- Firefox’s harness started with the Claude Agent SDK, but is expanding to support others like OpenAI’s SDK and third-party model-agnostic frameworks ([14:58]).
- "My intuition... the vendor provided harnesses as the underlying infrastructure is probably the best way to go... But you also want to make sure that you're running against a variety of... models, harness techniques and prompting.” — Brian ([14:58])
Demystifying Complexity:
- Even V1 can be built in an hour by just wrapping cloud code with a prompt; V2 involves using agent SDKs with basic tools and verification ([35:43]).
- There’s heavy reuse of existing infrastructure (e.g., bug pipelines, fuzzing, artifact storage).

5. Prioritization: Where and How to Point the AI

Scoring for Impact:
- Simple LLM-based scoring ranks files by likelihood of containing security vulnerabilities and accessibility from web content ([36:18]).
- "You could build a very similar heuristic scoring mechanism where you say, go take all my components... and give me a prioritized list of components to improve..." — Claire ([38:01])
Manual & Heuristic Feedback Loops:
- Frequent analysis of agent logs (with LLMs and humans) helps tune prompts and exclude fruitless patterns ([13:45], [12:13], [41:03]).

6. Human in the Loop: Collaboration and Review

Patch Quality:
- AI often proposes hyper-specific patches; expert human engineers generalize and ensure architectural cleanliness ([28:56]).
- "You wouldn't expect the complexity of [open source] to change just by nature of you introducing agents.” — Claire ([30:55], [31:39])
Mobilizing the Team:
- The scale of actionable reports enabled rapid cross-functional mobilization; over 100 engineers contributed fixes during the incident ([41:03]).
- "It did require some reprioritizing. Everybody was very tired as we've sort of gone through this, but also really motivated and mobilized in particular because you were getting this very actionable reports." — Brian ([41:39])

Notable Quotes & Memorable Moments

On Agent Exhaustiveness:
- "The ability to take an agent and give it a very constrained problem surface area and say exhaust every attempt at this is really powerful... our cognitive energy declines over time in a way that agents don't."
  — Claire ([10:28])
On Code Archaeology:
- "I asked Claude code go figure out like semantically when this bug was introduced. And I was like watching it do git commands I didn't even know existed."
  — Brian ([19:07])
On Guardrails and Verification:
- "We would tweak the prompts on the analyzer agent... and that would come both from our own analysis of the agent trace, but also feedback from the engineering teams."
  — Brian ([12:13])
On Human Value:
- "Humans are not out of the loop. Humans lives are just a lot better. Even though the volume was very high, the quality was also higher and the actionability was higher."
  — Claire ([41:39])
On the Model vs. Harness Debate:
- "Of course, of course it's both [the model and the harness]... I think there's so much to still innovate on the harness side and the pipeline side."
  — Brian ([42:53])
On Security Doom or Hope:
- "I'm cautiously optimistic... our goal is not to have a bunch of bugs that are hard to find. Our goal is to have zero bugs. And so I think that these tools... actually get us closer to that world."
  — Brian ([43:59])

Important Timestamps

Scope of Firefox & Early AI Struggles: [03:03] – [05:32]
What is a Harness: [05:58] – [06:59]
Harness Workflow & Tools: [07:13] – [13:45]
Model/Harness Tech Details & Agnostic Build: [14:58] – [17:13]
Breakthrough Bug Discovery Stories: [18:49] – [23:01]
Open Source & Human Review Loop: [25:13] – [31:39]
Prioritization Scoring for Huge Codebases: [36:18] – [41:03]
Mobilizing the Organization: [41:03]
Model vs. Harness Influence: [42:53]
Optimism vs. Doom in AI Security: [43:59]
Real-World Workflows (VS Code Demo): [32:39] – [35:43]

Techniques & Workflows You Can Apply

Build a Simple Harness: Start with an agent SDK (e.g., Claude or OpenAI) and basic tool integration for your codebase.
Introduce Verification Loops: Add subagents to check for false positives and reduce noise in results.
Use LLMs for Prioritization: Score files or components by risk/impact before unleashing agents.
Leverage Existing Infrastructure: Plug your harness into current bug pipelines, fuzzing, and artifact tracking.
Continuously Tune with Feedback: Optimize your prompts, workflows, and verification steps based on logs and team input.

Final Reflections and Takeaways

AI coding agents can tirelessly and systematically uncover deep, hidden issues in complex codebases when guided with thoughtfully constructed harnesses and verification mechanisms.
The interplay between agentic automation and human expertise yields significantly better outcomes—volume and quality both rise.
The harness/agent infrastructure is not as daunting as it appears; with minimal tools and orchestration, even small teams can dramatically improve software reliability.
Open source teams, in particular, need to understand and adapt these workflows due to scale, diversity, and supply chain risk.
Defining clear success/failure cases is critical for any domain applying agentic AI—not just security.

Further Information

Brian's Advice:
- Get started with a basic agentic harness—you likely already have the pieces in your DevOps pipeline.
Try Firefox:
- "We're doing really great on fundamentals, you know, things that people care about, performance. We're talking about security today, doing a lot of new feature work and in particular for this audience, I think have sort of an independent mindset around AI..." — Brian ([46:35])
Open-source tooling and workflows referenced are being published for public experimentation.

For more episodes and resources, visit: howiaipod.com

Loading summary

Transcript60 lines

[00:00]
A
Firefox has tens of thousands of source code files and tens of millions of lines of code. It's not possible to say one shot, go find all the potential bugs in this project. It's way too much context for the model.
[00:12]
B
I think people really underappreciate the relentless tedium that an agent will go through.
[00:17]
A
Anybody who's done this kind of what I call archaeology, it's really hard to do. And this is something that the coding agents are great at. I asked Claude code go figure out semantically when this bug was introduced. I was like watching it do get commands I didn't even know existed.
[00:31]
B
And the ability to take an agent and give it a very constrained problem surface area and say exhaust every attempt at this is really powerful. Again, not because human intelligence couldn't identify similar issues, but actually are like cognitive energy declines over time in a way that agents don't.
[00:52]
A
Our goal is not to have a bunch of bugs that are hard to find. Our goal is to have zero bugs. And so I think that these tools, as us and other defenders are starting to apply them, actually get us closer to that world.
[01:06]
B
Welcome back to How I AI I'm Clairvaux product leader and AI obsessive here on a mission to help you build better with these new tools. Today I have Brian Grinstead, distinguished engineer at Mozilla Firefox, who's going to take us behind the scenes with their experience with anthropics new but not yet fully released model Mythos and how they solved almost 500 security bugs by rolling their own harness. Which isn't as complicated as you think. Let's get to it. This episode is brought to you by workos. AI has already changed how we work. Tools are helping teams write better code, analyze customer data, and even handle support tickets automatically. But there's a catch. These tools only only work well when they have deep access to company systems. Your copilot needs to see your entire code base. Your chatbot needs to search across internal docs. And for enterprise buyers, that raises serious security concerns. That's why these apps face intense IT scrutiny from day one to pass. They need secure authentication, access controls, audit logs, the whole suite of enterprise features. Building all that from scratch, it's a massive lift. That's where work OS comes in. WorkOS gets you drop in APIs for enterprise features so your app can become enterprise ready and scale upmarket faster. Think of it like Stripe for enterprise features. OpenAI, perplexity and cursor are already using work OS to move faster and meet enterprise demands. Join them and hundreds of other industry leaders@workos.com start building today. Brian, welcome to how I AI I'm really excited about this episode because I think you have one of the most impactful stories in AI engineering right now. So tell me first, let's take a step back for people. What is the scope of the product you're working on and how challenging is what you pulled off?
[03:03]
A
Yeah, thanks for having me. So I work on Firefox, which is a production web browser. It's very large and very complex. We have to render the entire web when the site was built in 2000 or just pushed up yesterday. And so there's all sorts of performance, security and complexity that we deal with. And so we recently in the last few months have been using new agentic scanning techniques to find a deluge of security bugs inside of the product, as have many teams as we've started to get better harnesses, models and techniques for actually getting those bugs fixed.
[03:44]
B
Yeah, and what, what caused me to reach out and ask you to be on the podcast was this chart. Everybody saw this maybe on X around the timeline where the Firefox security bug fixes by month just spiked in multiples in, in April. And the headline of a lot of post around this was Mythos. The definitely real model that other people have access to, just not, not not us. Normies has unlocked a incredible amount of discovery and fixes on edge case or complex security bugs. But I think the story behind the story is actually a little bit different. Can you tell us about kind of how you got here and how much of this was model and how much of this was work you did internally?
[04:28]
A
Yeah, that's right. So I think like a lot of open source projects through 2025 we had been dealing with almost like unwanted AI bug reports. And you know, the shape, you've seen it, if you see you get a doc and you can just tell it's from an AI, you know, it looks very nice and professional, but you would get halfway through and the engineer is looking at this and saying that's wrong. Right. And so there is. It's sort of asymmetric costs on project maintainers to receive a thing. It's cheap to just paste in some C code into a chatbot and get back something that's wrong. But there was no way to actually verify whether that was true. And that all changed like I would say in February of 2026. And I think a big part of that story is the harnesses, the themselves just getting better, definitely improving and upgrading the Model helps in a couple ways like it, it has better hypotheses about where to start with the bug. It does a better job of making test cases. But I think the harness itself and plugging it into a pipeline of getting the bugs fixed is actually a little bit the story behind the story.
[05:32]
B
Yep. So for folks that have heard this word over and over again, I know the engineers that are watching probably know what you're talking about, but just define for us or show us what you mean by harness and how you built something custom to your product, your team that unlocks your ability to get actual throughput on security bug fixes, not just a bunch of unactionable kind of slop reports or things you couldn't validate.
[05:59]
A
The harness is a way to give an LLM tools to achieve some goal. And so if you think back to chatbots before they had any custom tools or anything like this, it's almost a brain in the jar. You know, you're just chatting with it and it's chatting back that has almost. There's no harness around that. These days, the chatbots are sort of blurring in a little bit with the coding clis in terms of what they can do. But this is giving access to tools like running bash scripts, opening a browser, measuring whether you were able to create a security issue, and so on. And so this is the actual mechanism that we use and have customized to find issues. In Firefox, this is largely like, you could almost imagine, just cloud code is a harness. Right. And so you build custom prompting and some custom orchestration around it to plug in with your particular systems, but it's actually a reasonably simple wrapper around it. You just need to give it access to the right tools for the job.
[06:59]
B
So show us what, what tools you decided were necessary in your harness and walk us through where you felt there was real leverage in building something custom versus using off the shelf cloud code or off the shelf codecs.
[07:14]
A
So here's an example that's sort of a flowchart, a little bit of how our custom harness works. And I'll say up front, like, often when I see these flowcharts, you see them in academic papers too. It makes it look really complicated. There's all of these boxes and arrows, and I think really it's simpler than it looks. I would say Firefox has tens of thousands of source code files and tens of millions of lines of code. And so it's not possible to say one shot, go find all the potential bugs in this project. It's way too Much context for the model. And so we have to do some initial sort of scoring to indicate which files do we want to actually point this thing out? We can talk more about that later. But it eventually kind of comes up with some prioritized list of which files to target or even functions in certain cases. And so that goes into this what we sort of a main agentic loop. But you can almost think about this as like a quad code session or a codec session, where you have some custom prompt that says, here's a checkout of Firefox. Here's some tools to kind of look around the code base. Here's your target file within that file. We kind of lie and we say, we know there's a security bug in this file. You have to go find it, basically. And it will just start working its way from the code to reason about, how do I get into this code from a web page? So it's some evil web page. How could it actually call this line of code? And it's interesting to watch. It'll kind of think and move its way around, but ultimately it will come up with HTML test cases, basically. And we plug this in to our existing tooling infrastructure that we've had for decades to do fuzzing, for example, where you can pass a test case and get back a report as to whether there's a potential memory safety issue that will then get feedback from that tool whether it succeeded or not. And if it didn't, it goes back and starts again. And it can start many, many times and for a very long time. Sometimes it will end up and say, nope, couldn't find anything. Other times it will say that it found something, and we ask for it to come out in a very structured format so that we can pass it on to the next phase, which is verification. And so we've already kind of verified that there is a crash because we got the signal out of our fuzzing build. But sometimes the agents do just wonky things. For example, it might set a pref that was only ever meant for testing, and no user ever sets. Or I've even seen cases where the agent changes the code to introduce a vulnerability so that it can exploit it and achieve its goal. And so we have another agent that's kind of looking at it and saying, like, does this look right? That usually approves it, it sometimes it does reject it, and it kind of sends it back to do more work. But by the time that this happens, we have almost no false positives on the system, which is fixing that Kind of slap problem that we talked about at the start. And it's very well prepared to go into the rest of our bug pipeline as we continue to work. We added a patching agent which is meant to kind of generate a plausible fixed, verify that that fix has resolved the security issue and all of that. It just gets written into a pretty simple cloud orchestration system that writes it out to a storage bucket for consuming later in the rest of our pipeline.
[10:23]
B
And so I just want to take a step back because I very recently did an episode on Goal, these sort of like goal and outcome loops that you can put harnesses. And I did an example using codecs, but they're available over the place, you know, RALPH loops, all this sort of thing. And I think people really underappreciate the relentless tedium that an agent will go through. Right. And the ability to take an agent and give it a very constrained problem and surface area and say exhaust every attempt at this is really powerful. Again, not because human intelligence couldn't identify similar issues, but actually our like cognitive energy declines over time in a way that that agents don't. Right. If you asked a human to say like try 150 different things against this and look at it every single time and make an evaluation, it would be very, both very time inefficient and exhausting. And so I just love these ideas of these like relentless loops on agents. The other thing that I want to call out, and again I did this, this recent episode on Goal is putting a guardrail, whether that's a verifier subagent or a constraint on these goal style loops is really important because of exactly what you said. I gave this example of let's say you're trying to reduce P95 latency on a page. Well, you could remove every latency introducing feature from that page. You could actually like take it away and the agent would be like, look, I made the goal, it's much faster now. But you don't have that guardrails of like nothing from a product perspective can change. You can introduce new code. And so I was curious as you went through this verifier sub agent loop, did you then feed that back in into the prompting of the analyzer agent in to guardrail against common patterns that you saw.
[12:13]
A
And 100%, I think that you need to give it some grounding to say don't go off the rails in this particular way, because it absolutely will. And I think we have, I would say from analyzing the logs both sort of manually, as I would our team would see them go by. But also using LLMs to analyze the logs, say what are some common patterns and problems that we're seeing come out of this thing? And then we would tweak the prompts on the analyzer agent sort of after the fact to improve that. And that would come both from our own analysis of the agent trace, but also feedback from the engineering teams. And they would say, oh, this was a terrible bug. I didn't like this. In a really tight loop with the team that's working on the product to make sure that all the threat model and the way that we were giving this back to the team was useful.
[13:01]
B
Yeah. And then I want to call out for folks because again while there's a bunch of boxes on the screen, it's actually not that complicated. It's a analyzer loop. It goes through a verific verifier sub agent. It has access to. It looks like eight, eight, you know, probably like a dozen tools. Right. Key tools that are really important. File search, how to build the package, bug tools. And then it goes into a very classic like bug fix pipeline which is, you know, generate the fix and then put it through a verification pipeline and ship it. So this actually isn't terribly complicated and this is probably how your human system would work in an ideal world. You've just been able to encode it in, in this harness.
[13:46]
A
Yeah. We're kind of fortunate because we've been running you know, a browser for a long time. We have a bug bounty system. We're used to receiving external reports. We have an internal fuzzing team who's always finding bugs either from manual inspection or other automated tools. And so that's another thing that has really helped is if you have the existing pipeline in place and you're letting the agent plug in almost as if it was a person doing it. You're not inventing many things at once.
[14:13]
B
Yeah.
[14:13]
A
And it can focus on the one thing that you've told it to do and to do relentlessly.
[14:17]
B
Yeah. And I am, I tell this to folks all the time. Is like the revenge of the devex team, which is teams that have already invested in developer tooling in automations are just so much further ahead because all those tools can be leveraged at much higher velocity by these agents. And so um, I'm going company to company and telling people like please, please, if you haven't already, now is the time to invest in developer tooling because what's good for the agents is very good for humans as well. And, and vice versa. I'm curious, is this loop, again, like, is it model agnostic? Did you all use a specific SDK? Did you like, kind of like artisanally craft the whole thing? How did you actually build this? People are always curious.
[14:59]
A
It's a very open space and there's a lot of options. So it's a good question. And I think it's also moving very quickly. So the initial version used the CLAUDE Agent SDK, which is essentially, it's a wrapper of the CLAUDE code cli, where it runs it in a special mode where it's streaming out JSON, but it gives you nice programmatic hooks for like a Python or TypeScript project. We have been exploring the best option for adding codec support and you can do that in a couple different ways. One is you could have codec CLI, you could have the OpenAI agent SDK, or you can move to like a third party harness that's meant to be model agnostic. And my intuition on all of this, based on some initial testing, is that the vendor provided harnesses as the underlying infrastructure is probably the best way to go. And they're probably doing post training and other things using those harnesses to make their models work best in them. But you also want to make sure that you're running against a variety of, of models, harness techniques and prompting because as defenders, you need to be sort of scanning the landscape for any one attacker might be trying to do something weird or with a different model and it's going to actually just find a different bug.
[16:12]
B
Yep, yep, great. So that's, that's helpful. So, yes, again, repeating for people. This is, this was my intuition of what you were going to say. Which is the Cloud Agent SDK or the OpenAI Agents SDK? There are some third party frameworks or harnesses you can use PI as one. I hear people are loving a bunch right now, but I think your intuition matches my intuition, which is because these model provider harnesses are so tuned to their particular models, you actually have to run both, especially in a security environment, because that is exactly what your attackers are going to be around. And they do, they do have. They spike on strengths both from a model perspective and harness perspective on different things, and will very likely identify and fix different things. So, okay, you've sold me. We should all build our custom harness, you know, not just for security issues. Again, this is like a very particular use case, but there are all sorts of use cases where a custom harness would be very effective inside in particular large code bases. Let's talk about how one of these actually runs.
[17:14]
A
Ultimately, the sort of infrastructure to plug all this stuff together is very shared across many needs from like triage, bug detection, bug fixing. You're sort of standing this thing up in its own environment. You're giving it some goal and it needs to be some grounded goal. It is going and running and then it's giving you some artifacts to plug in further down your pipeline, whether that's an issue tracker or a pull request and so on. And so it's interesting how much overlap I think there is with this. And some of I've seen, you know, projects that are designed more for bug fixing that look very similar to this. So this is your standard kind of vibe coded dashboard here that shows basically a bunch of runs. And no, this is mostly fabricated data. This isn't the real, you know, Firefox runs. But what we have done here is I've set sort of, we send them off in patches and so set sort of sets of files that are related or we want to do some evaluations and so on. And so what I have done here is we're actually taking 10 real bugs and these are bugs that we, we opened earlier than we normally would have, security bugs that had recently shipped. And we did that like for exactly the thing that we're doing now, which is to sort of help, help make people aware of how this works, how can you apply it as defenders and sort of help understand that this is real. So what we've done is we've taken those exact bugs and sort of pulled the actual traces for them to dig into here on the show. So a couple that I think were pretty interesting. We start with this legend element. So this is an HTML element that you can use for organizing forms.
[18:50]
B
And I have to go back really quickly to the blog post because this one caught my eye. Is this the one that was like 2015 or 20 years old?
[19:00]
A
Yes, I think that was, that was an XSLT bug.
[19:04]
B
And we found that number two though, that's 15 year old bug.
[19:07]
A
Yes, exactly. So if you notice, our bug IDs for these new bugs are 2,025,977. So there's many, many bugs in Bugzilla. If you find a bug that is in the six digits, it's a very old bug. And so it was kind of funny for this exact XSLT one. I wanted to say, like, when was this bug introduced? And anybody who's done this kind of what I call archeology is it's really hard to do. And this is something that the coding agents are great at. So I would say, when was this bug introduced? Well, the file got renamed three years ago and so you can't just do a git diff. And then actually this blob moved to that file. It's very annoying work. And I asked Claude code go figure out like semantically when this bug was introduced. And I was like watching it do git commands I didn't even know existed, to go kind of taking notes as it was doing it, what it was doing. So really interesting. That's how we had gotten that 20 year old number. And so a lot of these have been around a long time. We have a bug bounty program that would pay people to find these bugs and sort of, they're very hard to discover. And that's part of what makes it so notable. And so if we look at like this legend one, the tool that uses to do browser evaluator, it tried 14 times. And so kind of logistically what this looks like is it says, okay, I'm looking at this element, but because web IDL is like a description, I need to go find the C implementation. And it just works through like you would see Claude code or the codecs do. And it would come up with some theory, it would look at some function and say, huh, I think you've told me that there's a bug in this similar to what you said, come up with a hundred variations of it. Maybe it's this problem or that problem, and it will try it and it will keep trying it. And it tries 14 times or 13 times and it fails. And then finally the 14th time it hits and it found it. And the great thing is not only does it come up with this sort of analysis, which I would love to go spend a couple hours on each of these and do a deep dive on how exactly this works and why it matters. But this is like the shape of the reports that I was complaining about us getting in 2025. The thing that makes this different is that we have this. And so this is like a really kind of complicated HTML page. This is what browsers have to deal with. People making pages like this and they're like creating the element, they're setting what's called an expando property on the DOM node, which is like an attribute but not an attribute. It removes the element, it does some cycle collection, blah, blah, blah, and at the end it creates heap use after free, which is this is exactly the sort of shape of a bug report that we send onto our engineering Team,
[21:48]
B
I want to go back to the very beginning when we were reflecting on the complexity of this product. And I was just thinking, we've decided to rewatch Silicon Valley in my house. So I'm watching these and gilfoyle had an html5.5 shirt on. And I was just reflecting on what you said. Like, you have to render the web, whether it was written 25 years ago or yesterday by an agent. And so the, the breadth of vulnerabilities that can be introduced in an HTML page in JavaScript is. It seems almost insurmountable. And so I want to reflect for people who are maybe not watching is you have this, this agent, this harness that not only can come up with hypotheses on what could create a vulnerability in a file or a subset of a file, and you not only get a document of this is where I think the vulnerability comes from, but you actually get a rep. A test HTML file that then replicates that bug in production, where you can prove it actually creates. Creates. The issue is that have I tied that all together correctly?
[22:56]
A
That that is exactly the thing and that is the thing that makes this approach different from previous attempts.
[23:01]
B
Yeah. And so I just want, you know, engineering leaders, engineers out there, senior engineers out there to just think about this process, which is, you know, kind of incepting your agent to believe that there is something wrong with your code, whether, whether it's a security bug or a functional bug. Right. Like, I know something's broken here. I know this is suboptimal for, from a performance perspective, maybe is another example of this being able to run a loop of hypotheses on it and then actually create a test or re. Or recreation artifact is a really powerful loop. I don't want people to miss as they zoom out into. Into the agent.
[23:33]
A
Yeah. And just I could add one thing on that for your projects. I think that one difference is. So we, we've actually open sourced the, the sort of tooling that we use for Firefox, I think just yesterday for, for some of this. So for security researchers who wanted to test it for our case, we have to what we call a very crystal clear task verification signal. And so we have this fuzzing build that uses an address sanitizer. And it's like, you win or you lose, you pass the file. And we can tell you ad often if you have a web app, a distributed system of some kind, it may not be so crystal clear. And so you need to think really hard for your project about your threat model. And then how would you like to verify whether it's true or not. Like, could be like a test case or it could be. I think that's actually something to that as you're thinking about applying this to your own project. That is a really important aspect of it.
[24:22]
B
And what I think is most people just, we haven't gotten in the. We haven't built the muscle memory of how to articulate success cases, how to articulate failure cases so crisply. And it's. You really benefiting from this, this previous art here, which is you've done the work up front. And so you have that already. And I, you know, just got off the call with somebody in this is not engineering and design. And I said, this is the moment where you're actually gonna have to write down what good design is and how you might quantitatively evaluate. Evaluate that or qualitatively evaluate it. And so I think the skill of being able to crisply articulate, test and measure outcomes, whether they are security outcomes, quality outcomes, softer outcomes, is becoming a hard skill people have to develop.
[25:13]
A
Yeah. And I think to point you've made previously, like, it's actually just great for the whole project to have that defined.
[25:20]
B
Yep. Yep. This episode is brought to you by medaview. Because who says hiring has to be fair? Every founder, hiring manager, and recruiter I speak with feels the same pressure. Hire the right people as fast as possible. But recruiting is brutally time consuming. Alignment is hard, and the competition for great talent keeps getting tougher. That's why teams like Riot Games, Brex, GitLab and Replit, plus 5,000 other organizations use Metaview, the agentic recruiting platform, giving high performance teams an unfair advantage in hiring. It works by giving you a suite of AI agents that behave like recruiting coworkers, finding candidates based on your exact criteria, taking interview notes, reviewing every inbound application, gathering insights across your hiring process, and helping you identify the best candidates in your pipeline. Don't let your competitors out. Hire you Mediview customers close roles 30% faster. Get started with Mediview today and get your first 100 candidates sourced for free at Mediview AI Howiai okay, love this. We'll share this MCP in the show notes. If you want to help the project, please, please dive in. So, and I'm looking at this example. You know, you're seeing turns of 9 turns, 10 turns, 14 turns. Finding the result and getting that, that whole package. And then does that run through a, you know, sort of a bug fix pipeline?
[26:49]
A
Right. So originally it did not. And so Actually many of these were early finds and they don't. But one, one of these is an interesting one to look at there, which is this NSool contents. So this is a pretty complex bug where it found we have an in process sandback sandbox technology called RL Box that's meant, it's meant to help us wrap kind of shrink wrap around third party dependencies so that if there's a vulnerability in that code, it can't leak out to Firefox. And this was a really complicated find and it, it has tons of artifacts and sort of came up with it. But the interesting thing is the fix itself is the proposed fix is very simple. It just said, oh, you were asserting this, you should have been asserting that in terms of input validation. We did start to basically have this patching agent run on every fix and the cool part is you're in the loop so you can actually just apply the patch, build Firefox and confirm that that same test doesn't crash anymore. And so that's great. And so if we go look at the bug, these are basically, this is basically a dump of that bucket that we were just looking at all those files and if we sort of received it by the team, there's some discussion and then we have sort of like, yep, this looks like a real issue. The fix looks good, but actually we should check in a few other places. So one of the things you'll see with agents, and I'm sure there's harness techniques, the models will get better is they get laser focused on the task you've given them. And so if we go look at the actual bug fix that landed here, it's pretty much what they said. We're checking sort of this, you know, this, but also we're checking the same thing in like three other places. And so that's where sort of the expert engineers in every single subcomponent, whether that's JavaScript, media, DOM layout, graphics, we have people who are like world class browser engineers who were working on this stuff and we'll look at the fix and say, oh, that looks pretty good or oh, this is like completely wrong. And we, we, we of course we use that feedback to try to improve the patching system. But I think we're pretty far off from having a kind of magic button that produces landmark patches.
[28:56]
B
Yeah, fair, fair warning because I have used the Codex Security product, which I actually think is, is quite good. And it does, it helps you develop a threat model. It goes through and scan your code and comes up with issues and Patches. The problem that I found was exactly this. And I do not have millions and millions of lines of code. I have hundreds of thousands of lines of code, which is it will get laser focused on the specific patch. It'll say for this bug, this is the patch, but it doesn't do the next level of like go categorically find similar issues across the code base and then come up with an architecturally clean global fix for this class of, in this instance is security bugs. And so I have found that that's a piece missing in the existing security tooling. And it does take like an engineer that kind of knows to some extent the code base or knows the structure of the code base to identify some of those. And so I do think this is the next step in some of these harnesses, which is for any one fix taking the loop and saying we've identified this issue, go in similar parts of the code base and identify if we have this issue systematically, then zoom back out and articulate what the fix is overall as opposed to the point fixed, and then ship that and then close all the, all the related issues is a path that I've been doing manually and now as seeing this, just thinking about how to do that more systematically.
[30:21]
A
Yeah, I think that as you said earlier, a lot of these are kind of converging in terms of like the needs, whether it's detection, patching, and obviously I'm expecting the harnesses and models to get better at this. I do think we're pretty far out from a web browser scale and complexity project being able to be sort of autonomously developed and we actually have requirements for having people who write the code and review the code, but we're able to use these tools to help accelerate that quite a bit.
[30:55]
B
Well, and I mean if we just take this to the meta level and part of having an open source project like this is it is very large to maintain, it requires, you know, the community to maintain something like this and you wouldn't expect the complexity of that to change just by nature of you introducing agents. I think in particular open source projects, we'll have to think about how we integrate agents into it, how that intersects with the community. And I do think they are the most complex and often longest standing, you know, code bases that we have out there. And so it's interesting to hear you say, you know, I don't think overnight we're just going to turn this, this repo over to the agents and we're all, we're all happy either on the security side or on the product side, right?
[31:40]
A
Yeah. And I think on the security side with, with open source supply chain is such an interesting and important topic around this. I think you, you have to work with every project in. You know, there's a lot of important projects. Firefox depends on many, many just core Internet infrastructure supply chain. And every project has different needs and preferences and threat models and things that they care about the way they want, the bugs, where do they work. And there's a sort of human connection and network problem involved there where as we, we found many bugs in supply chain and we, we have personal connections with a lot of those projects. And so you're kind of working your way in in a way that is, I think less automatable than many people would would hope. But I think it's the reality of how this is going to have to just get deployed across the industry.
[32:27]
B
Okay, this is amazing. I think you and I could talk about this all day, but let's show what this looks like at the individual engineers, you know, desktop. How, how would you actually interface with this as, as an engineer?
[32:39]
A
We'll pull up VS code here and sort of. There's a couple aspects of the, the harness here that I wanted to show off and make really concrete to, to bring home the point that it's not too complicated. So we have part of the demo here. I have a patch applied to a local build of Firefox in a Docker container that introduces a really obvious memory safety issue if you're a C developer. And so in this patch I wanted to show a couple different approaches where we started and where we got to. Inside of this Docker environment, we have a, a simple script here that with some prompt that says you're looking for a memory safety issue, read the file and analyze it. And so we're not giving it access to any tools. We just say look at the file and find the problem. You can e. You can run this with both CLAUDE and codecs pretty easily. There are command line arguments like dash P you've maybe seen with CLAUDE that will. It's basically designed to be run by another program, not a, not a human. So if we said in this case, run with claude, you're going to see this kind of ugly JSON streaming out. But this is actually what's happening under the hood. When you have an interactive cloud code session, it's reading a bunch of code. It's sort of, it's probably pretty quickly converge on the problem in this file and it's going to write essentially like a markdown Report. And so this is just like the very basic primitive building block. Like you could build this and run this yourself in an hour with Claude. You can also run it with codec, there's a codec exec command. Similarly, have it output JSON, you can have another program that's consuming that. This is sort of an alternative to using an agent SDK. So just another way to do it. And so that's a very simple kind of how do you just run this thing and find a problem in a source code?
[34:35]
B
Fine.
[34:36]
A
I think as we are running this on less trivial security issues, we found that we needed the actual harness that we always describing earlier. And so we'll have an example running that here, where it will basically do the same loop. It's using an agent SDK, but it's going to do. It has access to all of these tools and so it's going to go through and read. This will take a while, so I'll switch over to a completed job here. But it'll basically read, it'll do some tests, it'll run this HTML page as a tester and then it'll say, yep, looks good. This is actually the verifier sub agent has now returned some structured JSON, saying, yep, I approved it. Here's why this is a problem. Here's sort of exactly how you would exploit it. Here's the steps to reproduce and here's the security impact that at that point you have the results, you have the bucket. We could go and put this in our bug tracker system and give it to an engineer. It does spin off a separate agent now to actually go and fix the bug. And so it will go on and create it. They'll Firefox and verify that the crash went away.
[35:43]
B
I think this is very straightforward. Again, I think what you're demystifying for, folks, is you can build. I mean, V1 is literally just running cloud code with prompt. Right. It's not. It's not very fancy. And then, you know, V2 is running an agent SDK with a set of like, very useful tools and a sub agent that runs a verification loop at the end. You know, my question for you is, with all these files, with all these lines of code, how are you prioritizing where you point the agent? Because as you said, you can't just go like, here's my code base. Find the security issues, we know they're there. How are you actually prioritizing where to look?
[36:19]
A
Yeah, and that's one of the things we've run into with skill Some of the sort of prepackaged skills and workflows sort of assume that you can canvas the entire repository at once and find all the. And you just can't. And so with Firefox, I would say, I would say with a small project, I think that's, that's plausible and probably a simple way to start. So what we are doing is a really simple sort of LLM judge here where we sort of say you're a security expert. Here's the different kinds of files we're looking at. C files, IPIDL files, web IDL files, a little bit of detail about each. We sort of copied out some of the details that we have on our existing security bug classification program and basically give me two scores. So one score is how likely do you think there's a memory safety issue? And another is how easy could you access this from a web page? Because we have a lot of code that is not running ever in the content process at all. It's, it's doing operating system integration things. And so we can just run that. You know, it was very, very simple. You could, you could come up with something yourself on your own project very easily. We just go and run that out. That's going to generate basically like a scores report and then that will, we have that right of markdown thing and it'll say, okay, like document cpp. That is a huge file. It is directly accessible by web content. That is like a very high score. You should definitely run that. We'll, we'll then like plug it in with different signals. Like how many times have we run this file before? Does it found duplicates? Was it able to find issues? But really, really simple heuristics. And this is an area where I'm, we're actually actively working to improve this, but it's enough to get you kind of started.
[38:01]
B
I just think this is so, so clever, very smart. And even for folks that don't have the code base at the scale of Firefox, or maybe don't have the same threat model vulnerability surface area as your product does, I do think this idea of taking here all the time, like how do I attack tech debt in my Monorepo? How do I prioritize these things? I can't just say like fix, fix tech debt. I think the ability to go through your code base and prioritize areas for an agent to triage and fix, whether that is security, whether that is performance. Honestly, when I was thinking this, I was thinking for product managers and designers, you could build a very similar Heuristic scoring mechanism where you say, go take all my components, my front end components in my web app and my product analytics and give me a prioritized list of components to improve from a user experience or conversion rate perspective. And then go apply best practices on design on conversion rate. So like there's just so many ways you can take this like LLM scoring of a prioritization of your code and then apply a very specific level of fix to it versus saying like go all over my code base and make it convert better. And so I want folks to really think about how to come up with a score to prioritize things, especially if you're working with a large monorepo. Because there's so many ways that this just very specific tactic is useful for folks.
[39:24]
A
Yeah, it took sort of longer than I would have liked to put this in place where it's sort of like, well, I think these files might be good ones to do. And I was like, oh, done. We should like have these things get scored. I think we had. We've also seen like you could imagine doing this with commit scanning. Right. So if you have a newer project with not much code, instead of scanning the existing files structure, you actually want to look at individual commits and score those commits and then run them through a pipeline. Or we have active work on performance as well where of course you have a performance benchmark, it gives you a score and you tell the agent your job is to go make that number go down and it'll go come up with all kinds of performance optimizations. And it's actually the same idea. It comes up with some proposals. You have a kind of verifier or judge that produces it, that gets into a pipeline that the engineering teams can look at and prioritize. And it's a pattern that I'm seeing kind of repeated across many domains.
[40:19]
B
Yeah. And the other thing, I think people, you know, kind of tell themselves that AI bug fixes, AI code is almost like limitless and free. And therefore you can like cover the earth with, with AI code. But one, budgets shall not allow. Two, there is actually a time cost to shipping, reviewing, verifying AI code. And so you cannot go completely prioritization free, especially when you're looking at the kinds of fixes you need to verify. And they're taking 14 loops to even get to a yes, no, I do think this like pre. Pre prioritization is a very clever use so you can allocate compute appropriately to the highest, highest impact. Impact things.
[41:04]
A
Yeah, we have like just to give A sense of, you know, we showed the graph earlier. This was a sort of incident response level event within Mozilla where we had a slack channel with, you know, almost 100 people. I think we had 100 engineers land fixes as part of this initiative. And so it would be, hey, we found 60 new bugs. Let's pull in like this team and that team and the other. And there's, there's, I think a lot of work. It did require some reprioritizing. Everybody was very tired as we've sort of gone through this, but also really motivated and mobilized in particular because you were getting this very actionable reports.
[41:40]
B
Amazing. Well, just to. For people that have made it this far, I just want to repeat what we've gone through so far. You, you know, shipped almost 500 security fixes in one month. A lot of that may have been model. A bunch of that was harnessed. The harness is not that complicated. Anybody can replicate it. It's really a goal or like RALPH style loop against a presumed problem verification sub loop. A bunch of tools can be run directly by an engineer. You're reusing a lot of the SDKs provided by these model providers and you're prioritizing the files you go after so that what you're looking at is, is a high priority to fix. And then you are mobilizing a team around this new way to work. And humans are not out of the loop. Humans lives are just a lot better. Even though the volume was very high, the quality was also higher and the actionability was higher. This is so generous of you all to share. I think so many engineering teams in particular are going to get a lot out of this work. Before we let you go, I'm doing a quick lightning round question and we'll get you back to all these AI bug reports. My first question, inquiring minds have to know, is it, is it model or is it harness? Was it Mythos or you know, like if you had to do a split, where do you, where do you think the. This huge unlock, this magic graph came from?
[42:54]
A
Yeah, of course, of course it's both. I think on the split percentage is a. Is a tough question. We have seen examples of being able to point the harness with many models, even like not the latest frontier ones and being able to find pugs and so just that makes me think, you know, take a cheap answer and say sort of 50, 50. Like I think there's so much to still innovate on the harness side and the pipeline side. Like there's this feeling of having sort of 30 ideas of every one thing you did, and then after that you had 30 more based on what you tried. And that is to me a signal that there's just a ton to do here.
[43:33]
B
Still amazing. And then my second question, which is I feel a lot of anxiety around this, just given kind of our experience, supply, like Internet supply chain over the last even three months. Are you, as somebody who is helping steward one of the largest open source projects, are you a security doomer in the age of AI or how do you feel? How should, how should we feel about this? Should we be scared or should we feel hopeful?
[43:59]
A
You know, I'm cautiously optimistic, which compared to a baseline is probably much more optimistic than many people. I think that the reality is these are bugs that have existed for a very long time and what was gated before was just on discovery. It's really hard to find these bugs, but our goal is not to have a bunch of bugs that are hard to find. Our goal is to have zero bugs. And so I think that these tools, as us and other defenders are starting to apply them, actually get us closer to that world. And it is going to be a bumpy road, I think, for, for some time as these are getting adopted and we, you know, different projects are going to have different depths of bugs as well. So it definitely there, there is reason to be concerned and nervous. But I think also I would say I'm, I'm generally optimistic about how this could turn out.
[44:51]
B
I love you. You've given me confidence. You've made me, well, less nervous. Also, you've given me more tools. So I feel like I am empowered to go solve some of these problems myself with similar frameworks. The last question I have ask everybody when, when AI is not doing what you want. When it is, it is just, it's, it's not giving. What is your prompting technique and how do you bend AI to your will?
[45:18]
A
I would say for like pure chatbots, I'm a very boring user and sort of, I'm, you know, pasting docs in and saying, give me feedback and then I'm manually porting that feedback back into the doc. I just like to have control over the process and use it as my own exploration and learning. I think there's a lot out of that from just a writing standpoint for coding, I am like, it depends on what I'm doing. I found, and I think this was somewhat subconscious, but like, if I'm doing something creative, maybe I'm building like the dashboard I was showing. I'm Much more positive. And I said, you know, oh, this is so great. Like, let's try three other ideas, you know, and then if I'm doing like a system administrative thing, you know, figure out why this VM died and it's not doing what I want. I'm like, come on. Like, I've also found, like, sometimes on the code, if it puts something really silly, I will just copy that block, paste it back in with the word really at the bottom and it'll figure it out.
[46:13]
B
I have found myself lately, honestly, like steering in codex and being like, no, what you are doing is crazy. Like, please, please stop. This is not good. I love this. Well, Brian, you and the whole team have been so generous sharing this publicly, you know, showing us behind the scenes how you got this work done. Where can we find you and how can we be helpful?
[46:35]
A
I'm not on my. I'm not on much online. I'm on LinkedIn and I have a website that I occasionally post to. I do a lot of work, you know, in open source projects and so I'm active on GitHub. I think people should use Fart Fox. I think a lot of people probably switched to Chrome when it came out. And honestly, for good reason. Like, it was a better product at the time and it took some time for, for us to catch up. But we're doing really great on fundamentals, you know, things that people care about, performance. We're talking about security today, doing a lot of new feature work and in particular for this audience, I think have sort of an independent mindset around AI and so choosing whatever provider you want. You're not sort of combining your browser vendor with your, you know, with your AI model provider, you know, open source project. We have great team behind it. So I just say give it a try and I think you'll be happy with it.
[47:28]
B
And, and I will hype you up and I'll give people who are listening to this podcast a reason why, which is, I guarantee, guarantee you, nine out of 10 people listening on this podcast, their browser is their number one or number two memory suck on their laptop right now because it is being used by humans, is being used by agents. It's worth doing a little competitive shopping and seeing what your experience looks like. Since we're all so dependent on the browser for not just our web work, but also our AI work. Brian, this has been so great. Thank you for joining How I AI.
[47:58]
A
Thanks, Claire.
[48:00]
B
Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube or even better leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show@howiaipod.com See you next time.