
AI coding tools have dramatically accelerated the pace of development, and the bottleneck in the software development lifecycle has shifted to code validation and testing. However, the conventional tools and workflows that QA teams have relied on were ...
Loading summary
A
AI coding tools have dramatically accelerated the pace of development and the bottleneck in the software development lifecycle has shifted to code validation and testing. However, the conventional tools and workflows that QA teams have relied on were not designed for a world where a single engineer can generate thousands of lines of code in a day. SmartBear is a software quality platform spanning test automation, API, lifecycle management and observability. The company recently launched an AI native qa platform called BearQ, which deploys autonomous agents that explore web applications, learns their structure and behavior, and authors and maintains test cases continuously. Fitz Nolan is the VP of AI and architecture at SmartBear and the CO founder of Reflect, which is a web testing platform acquired by SmartBear in 2024. In this episode, Fitts joins Kevin Ball to discuss why web UI testing is uniquely challenging, how BearQ's multi agent architecture coordinates exploration and testing, why test data management becomes a hard distributed systems problem at scale, and what agentic development means for the future of qa. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup and organizes the AI in Action discussion group through Latent Space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website Kball LLC.
B
Fitz, welcome to the show.
C
Kevin, thank you so much for having me. Glad to be here.
B
Yeah, I'm excited to have this conversation. Let's start with you. So can you give us a little bit of your background and how you ended up at SmartBear, what you do there?
C
Yeah, sure. So I was a CS undergrad and then I went right from there to a PhD program for computer Science at Yale where I focused on distributed systems and networking. And while I was there I did some internships at Big Tech focused pretty specifically on networking protocols and low latency networking. When I graduated there in 2014 I was very excited to take a startup job offer at a place called curelate in Philadelphia, which is the area where I grew up. So went there and worked there for about five years and met my co founder Todd McNeil who I then went and started Reflect with. And Reflect is an end to end web testing platform. Now we support mobile as well and it's basically record and play allows you to your QA teams to build out test suites and then run them automatically without needing to know how to cod. Since that was in 2019 that we started that and then we were acquired by SmartBear in 2024 and a lot of all the Reflect team is still there at SmartBear today. And reflect is just one of the products in SmartBear's broad portfolio focused on software quality and testing. So now post acquisition I've extended beyond Reflect and now I work across the product portfolio at SmartBear in the VP of AI and architecture role where I look to bring AI features and AI agentic AI workflows into the different products at SmartBear.
B
That is a great role for the times that we're in right now.
C
Yeah, yeah. There's a lot of focus on AI obviously in the last couple years. So it kind of started as bringing AI into the different products and now it's actually accelerated into even releasing AI native products as well, which we'll talk more about later, I'm sure.
B
Absolutely. Before we get into that, let's orient just quickly about SmartBear in general. So as a company focused on software quality or how would you describe the company?
C
Good question. SmartBear helps organizations ensure application integrity across modern tech stacks. So we're trying to ensure that organizations software is working as intended at speed and scale and that's ever more important in the age of AI. But you can imagine that our platform combines test automation, API lifecycle management and observability integrated across the SDLC to ensure software quality. So we also just released a new standalone product called BearQ, which is one of the products I was mentioning. That's an AI native product which extends these capabilities across our portfolio to help teams move faster with confidence, more or less. We have a big install base, 16 million users, 32,000 organizations, some big names, Adobe, JetBlue, Microsoft, and then we also have some open source components as well in the form of Swagger. And we also have some legacy products that are quite successful in governed in compliance and governance applications like Test Complete and Zephyr. So that's the overall Spiel for where SmartBear is and where your listeners might see our products.
B
Yeah, I didn't realize that Swagger was you guys. I've been using that for years. That's a good.
C
Yeah, yeah, the open source and then there's an on prem version, there's a cloud version and that's obviously the industry standard for API lifecycle management.
B
I feel like quality is a really key question right now. It's something that is getting both talked about a lot as people talk about challenges in the state of software right now. But then it's also just getting pushed on from so many dimensions with the changes going on with AI coding. So let's maybe start. So if I understand properly, BearQ is focused on web, so maybe we start with web a little bit.
C
Yeah.
B
As somebody who's like living and breathing this stuff, how do you think about the challenges of testing in web?
C
It's a rapidly changing landscape. On the one hand, you have a bound kind of on all the things that are possible in web by virtue of the web browser and the deep teeth that you have into the application running in the web browser. Maybe not to the extent that you would have had on the OS level for a desktop application. Mobile is probably pretty similar too, right? Like the platform there was built in an age where the notions of debugging and exposing these detailed informations were kind of already apparent. So it's a good domain to tackle because of that tight bound on what's possible and the control you have over the different accesses to the different parts of the operating system. So the microphone, that type of stuff. On the other hand, there's just so much explosion of new applications and people pushing web applications to do totally different things. And then there's local storage and you can run local apps. And so it's a very creative and almost infinite world of applications. So it's very exciting in that sense. Kind of a roundabout introduction, I think. You know, the big thing with web apps is that they're always connected to the network. So you get a little bit of leeway there on the network latency and the notion of a cloud application and not needing to download new software to get the upgrades. So the always on, always connected nature of it makes it a very interesting beast. Yeah, I kind of paused there. What part should we explore next? I guess.
B
Yeah, so one of the things that is always interesting to me is looking at how you validate things across the stack because as you say, some parts of the web platform are very bounded. Right. You could do very focused front end tests. You can wrap them around really carefully and you can be pretty confident in what you're doing. But then you start expanding out and you say, okay, well actually there's more browsers out there. I need to think about how does this behave, not just logically, but from a UI perspective on a mobile device versus on a laptop versus there are still some people running Firefox and that behaves slightly different. Like all of those different multi combinatorial explosion factors. That really leads me down the path.
C
That's a great point. The other one that comes to mind too is just all the state on the backend that you don't have visibility into, except for the final output result word. Right? The output token that's like, okay, token's an overload terminal. Just the result, right? You submitted a form, the backend does all this churning and all this work and then you get the okay, it worked. And so like your test has no way to validate that backend experience unless you're going to do a full stack integration. Do you have runtime visibility, monitoring that type of stuff? Can you access the database directly to confirm that that row is in fact written in the way that it appeared to be? So it's within that context that Bear queue was created. And our thought with BearQ was with the velocity of software development teams 10xing or 100xing with these AI coding agents, where is the complementary AI scale solution for quality on the output? Right. If you think about the whole sdlc, there's design, there's functional spec, there's coding, there's review, there's testing and then there's deployment, there's another batch of testing there and review to make sure things are working. And then maybe there's live site monitoring at the end of the SDLC there. And then of course it's a circle and it always feeds back in on itself. So you see, you pick one of those sections, maybe the first two. Design was AI very going to be influenced by AI, and then coding obviously is very influenced by AI, but we weren't seeing the same attention being paid to quality. So that was our attempt to insert ourselves into the AI native SDLC.
B
With BearQ, it's a huge challenge. The first parts of the cycle are speeding up dramatically, which is just building pressure and pressure on the back end. And I've been seeing that starting at code review. Everybody's trying to figure out what do I do with code review. And there are agentic tools and ways that people are using AI there, but then the steps after that, right, okay, how do I validate from a UI perspective? How do I validate that? This is actually when I connect all the pieces, like this is working exactly this quality question, like everybody's overloaded. So what do you do? How do you do it?
C
I think it has to be two things that come to mind. One is it has to be at AI scale. It can't be human in the loop for the inner loop. Obviously you want a human judgment and human oversight. But the core unit of work has to be AI, native, has to be AI driven because the core Unit of work of say, writing a function has now been taken over by the AI coding assistant on the development side. So you fight fire with fire, so to speak. Right. You have to match that velocity. So it has to be AI native as part one. The second thing that comes to mind is trust is the connection to reality. We'll say the accuracy and how do you get trust is kind of something we've chewed on a ton and we're still working on it. It's not a solved problem by any means, but the way that we' approached it is we're trying to basically take a multi pronged approach where we use the application to develop an understanding for the content of the application. And so our bear queue agents go out, they use the application, they point, they click, they use computer Vision and Vision LLMs to take in screenshots, click on things, enter content, you know, manipulate the application to learn about it from scratch simultaneously. We want to take in context from your JIRA stories from your GitHub, pull requests from your code base, from linear, whatever you use, wherever your source of truth for your designs and your functional specifications, those come into context. And then we want to basically tell you how closely do these two things match? Right. What's the drift or the gap between what you intended and the reality of your application's user experience? And then we help you drive that gap to zero, basically. So that's the two sides of the coin, I think.
B
Yeah. So that's interesting. And there's a lot of different pieces to pull on there. I want to actually start with this question of having the human out of the inner loop, because I think this is one of the things that we've definitely seen on the agentic development side. Right. The more you can give the agent an internal feedback loop, whether it's through tests or a CLI or some other way that it can do some work or do some sort of coding or some sort of thing, test that work, get feedback on what's there and what's not, and do this inner loop at AI speed, the better off you are. So what does that inner loop look like when you're talking about a QA context?
C
So if you break an application down to a set of screens, say, and a set of elements. So for example, the menu navigation bar or the profile icon in the top right corner of an application, where you click it, you've got Manage my account settings integration, stuff like that, that's sort of a component that's going to appear many times throughout your application. And so if you can have AI recognize that element many times throughout many different screens in your application. Well, now you can correlate failures and tests, like working tests across your screens with that same component. So you can deduplicate a little bit like a human would. Right. They might test that in a specific test case, they might test the functionality of the profile icon in one specific test case, but they won't necessarily have to interact with it 100 times. If you have 100 pages in your application.
B
Right.
C
Maybe in certain cases you would want that. Right. But in most cases you're going to say, okay, look, if it's working in one or two of these pages, let's assume it's working in all of them because it's the same component. So that's a place where you have, if you can have programmatic intelligence to do that deduplication for you, you can avoid the brute force amount of work that you would otherwise do. And let's imagine now that you're producing new pages at a 10x velocity in your application. If you don't have that programmatic intelligence to do that deduplication in an automated way, well then you're going to have a human in the loop problem. So that's an example of an inner loop unit of work where the LLM or the AI allows you to achieve a speed up on par with the human performance for that without needing a human in the loop. So that's just one of the examples that we see in trying to build out BearQ where we need to have an AI driving the process, but then of course producing auditable results so that at any given moment the human can jump in the judgment, the human reviewer can jump down in there and confirm that things are working as they expect, but still be able to do it at a scale and a velocity to match the software development.
B
Well, and those durable, auditable, understandable results are also really helpful for the agents to help them drive and things around that.
C
Right, exactly. Yeah, there's like my output is my input, that type of thing where it can build on itself. Now we do see, this is a very recent thing. We've seen cases where the feedback loop can actually be pathological, where the AI is taking it in and say it's consuming more and say I can still make progress here, I can still make progress. And then it sees all the work it tried to do and it's like, let me try that again. You know, I don't want to give up. My theory here is that the LLM providers have probably conditioned Their models don't ever give up, use more tokens and consume more AI as much as you can.
B
Well, it's funny because like a year ago the problem was you tell them, keep going, keep going, don't stop. But now everybody shift to token based pricing. And now it's like, stop, please.
C
Yeah, yeah, that bill is huge. So you have to have guardrails in place. And again, I don't think that you can have a human guardrail in place to check that loop. Right. It has to be an automated check. Maybe there's heuristics, maybe there's a static check. You could do or execute this function and this will tell you if you've made progress. And that might be checking my prior results or even doing heuristics. Things like if you say the same output multiple times in a row, I'm going to proactively cut off the loop with static code as opposed to the AI driven version of that. So you have to be careful on the AI output feeding the AI input. But to your point, those auditable logs are really helpful for creating more context for tracking and execution over time in a gentic system.
B
So thinking about that point that you just raised. So like one of the things that's very interesting, if I watch humans who are effective at using coding agents, they will use the agent to extract more durable tools. They'll use it to create its own CLI tools that they can use for better exploration. Or they'll create skills which are just kind of reusable context chunks usually, but may also involve more deterministic code. So to what extent are you with Bear Queue trying to extract out patterns that can then be run deterministically without having to have an AI agent doing the driving?
C
Yeah, this is such an important point to getting us back beyond the feeling. I think a lot of developers and myself included have had at times using AI to build software like you're building on sand, where you haven't quite gotten into touch with reality. Everything you're building was AI generated below it. You know, that feeling of stability, of having built on a known solid foundation of best practices, so to speak. Or you have best practices but they're ill applied. So to that end, where we see the durable reusable components in Bear Queue are things like identifying that profile component and then presenting to our user. Look, we've used your application quite a bit. We've observed your pull requests, we've observed your Jira. We think we've identified maybe 20 to 30 or more of these reusable components. This form for creating a new opportunity in your Salesforce app, or this form for modifying your profile, or this form for creating a record in your car sales dealer database. Something like that. This is a reusable component we see in several places. This is what we know about it. This is what we think to be true. These are valid inputs here. These are example inputs that we use. And this is context that you've given us for how we interact with this element. Is this correct? Do you approve this? And then we can build up reusable components that we've gotten back into touch with reality on through the form of user input. And that's where this gets back to our position really, is that the QA team doesn't go away, just like the software engineers don't go away. They just develop slightly different skills and they're managing more of these automated, repeatable tasks and they're thinking in a slightly higher order of abstraction. And so rather than thinking in the form of 100 pages that you must interact with, think in terms of these 30 to 40 components and how they interact and what's their sequential relationship or their mutual exclusivity relationship, and presenting that to the user to get that feedback, to get confirmation that we're on the right track, we've identified truth.
B
Yeah. So when you're doing this discovery process, is that black box discovery, you're not looking at the underlying code that's generating this, or is that white box where you can maybe correlate with, if it's a react app, like react components or things like that, it starts as black box.
C
So, like when you sign up for Bear queue, the first thing we do is we kick off these agents and they go in the user app with no prior knowledge other than maybe a little bit of tech context that you provide during signup. So, like, don't ever check out on the econ page for whatever reason, or if you're going to check out, use this credit card number, stuff like that, little tidbits that you can give us as soon as we've kicked off those agents, though, you can then attach your context and so you can integrate with Jira, you can connect GitHub. And so those two pieces of context then come together pretty soon thereafter. So we don't play them off each other or anything like that. It's more just we want to be able to do this from first principles, from nothing. But of course, we don't want to rob ourselves of rich, valuable knowledge that you can provide. So I would say they're pretty much equal in our estimation.
B
Yeah, it just got me thinking, right? In a well factored code base, those components that you're identifying from the outside black box style should map to components that you can see in the code base. Now they may not. And in fact there's kind of an interesting extension there where you could say, hey, we've identified 30 components, you've got 400. Maybe you need a little bit of refactoring going on here.
C
Exactly. And this is again where like you wouldn't probably expect your QA team to say, these components here are logically quite similar. Are they the same in the back end code? If not, let me open a ticket to get the dev team to make them the same. Whereas with AI building the components, I know for sure it's not going to do a great job at factoring those things. It's going to build those 400 components.
B
But that's why I had to go in that direction because like back in the day it might have been they had too few, but with AI, it's going to be an order of magnitude too many.
C
Yeah, and that still continues to be the major weakness that I see on the AI coding front is just its inability to properly factor an application, certainly from scratch, but 100% for an existing code base. It doesn't necessarily know how to fit in and reuse all the things that it should unless it gets a very strict, like very targeted input. So yeah, I think that's where on the QA side the AI can really help to try to analyze. You know, let me bring this back to the code and let you know if we're straying from our intent here of a well factored code base.
B
So let's get a little bit more granular in detail about how some of this stuff works. So you talked about, you go in, you explore the application from some sort of web browser style context and maybe we can talk about what actually is that. Is it Chrome with computer vision versus it's a headless browser versus what? But you go through that and then you're validating against some context about what correct looks like. But how do those different pieces actually work under the hood?
C
Yeah. So plain and simple, when you're in bear queue and you start up a session, we have a couple different types of tasks. So we have an exploration session. That's where we have a browser attached. We have a test run session which also has a browser attached, but it's trying to execute to a specific test description. So a test case of X number of steps that have to be performed in that browser. And then the third type is, our most general type is the QA lead type. And the QA lead agent doesn't have a browser attached to it. By default, it can use the browser of a test runner agent or an exploration agent if it's spun up in the context of helping one of those two agents achieve their goal. So the idea here is to narrowly scope the exploration agent and the test runner agent, Test runner agent. Just do the steps here in this test case. And if you can't ask for help, when you ask for help, the QA lead who has access to more context richer models or more robust models, that type of thing, the QA lead can help you get to done. But we don't want the test runner to be so complex that it's a QA lead itself. We try to like break it down into smaller things because the hope is for the majority of tests that are going to run and pass, we want to be able to do that faster and a little more cheaper than having to invoke the big expensive QA lead. So we spin up a browser for those two types of sessions. We have basically an agent running in memory in our system, and it's driving that browser to perform those steps. It takes constant screenshots to determine whether it's progressing through its goals. If it fails, it'll bring in a QA lead to try to figure that out. It'll also look at things like context for like creating valid random inputs. So let's say that you have a registration test. I want to verify that my new account registration is working. You're going to need to do a different email address each time to make sure that you don't collide with an account that already exists. So we'll generate that email address randomly and the agents can do that kind of logic. So if we spin up the browser, we control the browser with the agent in memory, and then when it's done its task or it's achieved its goal, then we'll shut down the browser and write the results to disk, so to speak. And then the agent's done. And so we run thousands of tasks a day across our customers. And any individual account might have hundreds of regression tests that we're running each day. And I guess the last piece that's worth mentioning just for this segment here is the exploration agent. As it acquires knowledge, we kick off a process to try to author new test cases. So the idea is that you're constantly in a State of.
B
I was going to ask about that.
C
Yeah, adding test cases and then archiving test cases which have become stale or which are unreliable or flaky. And then we look to replace them with more reliable tests, more accurate tests. So your account state is always in flux. Obviously. We hope that it's centered around this core understanding of what your application does.
B
Now, let's say I, as a human want to add some test cases. Do I just talk to an LLM, be like, hey, I want you to test this, this and that.
C
Totally, yeah. So it's as simple as you just pull up a QA lead and you say, hey, I want you to create a couple of tests related to the checkout functionality or a couple of tests related to the Create an opportunity functionality in my Salesforce app, something like that.
B
So you have now this multi agent coordination system that's going on. How are you coordinating that? Is the QA lead doing the orchestration or you have another orchestrator or like, how does all those pieces get wired up?
C
So we have a very async system basically built where a new task is identified and we queue that up in our system and then there's a pool of agent workers that grab the task. They identify which type of agent they need, and then they instantiate themselves into memory and then they start off running. They'll spin up a browser if they need to perform the actions and then tear that browser down. The QA lead, the system itself kind of is that orchestrator. In other words, like anyone's able to create a task, whether it's an end user or a schedule firing or an API call from a CICD system to say, hey, create a task to run this test case and then it gets spun up. So anything can create a task, but once it's created, then the system will create an agent to execute the task. And then again, certain tasks have the ability to fork or spawn another task.
B
Can you create subtasks and.
C
Exactly. And then we try to link all that up to it in the ui. So if you spin up a QA lead task to help with a test running task, then that QA lead task will be linked into the original task it was spun up within.
B
Got it. And then I'm particularly interested in this multitask coordination. What interface do you pass context around through for that? Are you just giving them a starter place? Do you have a tool available for them to fetch the original? Like, how does that end up working?
C
Yeah, yeah, this is cool because this is one of the funnest parts to work on as an engineer in the system. So the test runner spins up, it's got a browser attached to it, it has access to an LLM, let's say it hits a failing step, it spins up a QA lead, it tells the QA lead, say, here's a screenshot of the browser where I am, I have the functionality to do anything in the browser. So click, input, hover, scroll keys, whatever you want, tell me what I should do. And so then the QA lead has access to an LLM as well. And so the QA lead follows this predefined loop, you might call it a skill, where it collects context. So it takes a screenshot, it looks at recent runs for the current test case that we're executing to see what they looked like at this particular step. So it collects extra context and it also has access to basically all of the data in the account. So that includes things like custom defined resources or context provided by our end user. So this is where that case of like if you're ever executing a test and need a credit card, this is the test credit card that you should use, something like that. So the QA league can assemble that context. It can actually manipulate the browser in a non destructive way, in a non side effecting way. So for example scrolls other than infinite scroll, where like that actually fetches more data, scrolls are generally non side effecting, meaning that you can perform actions to learn more, to capture more screenshots about the state of the app without actually changing any of the state. Again within reason, there's infinite scrolls and edge case there. So the QA lead can collect more context, it can query the LLM, it can take actions and then it can send back to the test runner and it says, look, I did this, this and this. And it looks like you've gotten further in your test case. So it's not a true failure. You just needed to word your test case steps a little differently. Here's the wording for your test steps, for your updates, for your definition, it's been self healed, you're good to go. QA lead shuts down and the test runner continues on its way. So they communicate to each other over like a predefined interface, just sort of using like a relay through the database, like a pub sub.
B
Okay, got it. Interesting. So just I'm going to like echo back to make sure I understand. So you have tester agent is running and it says, oh shoot, I'm stuck, let me create a new task which is going to spin up the QA lead, it has sufficient context that the QA lead then knows my ID so it can pub sub to me, have some shared pub sub it has info about, like what test case and all these different things. Now, a QA lead has access to a whole bunch of other tools that it can use to. And based on what I'm hearing, you're using this agentic pull approach where it's basically going and fetching. It says, okay, let me fetch this from this, not preloaded with a bunch of context, but let me fetch this account info. Let me fetch anything I need. So actually one thing that, that leads me to. So if it's wanting to manipulate the browser, it's wanting to test a thing, does it do that by sending pub subs to the tester agent, having the tester agent do it, or does it have a direct route to that browser entity?
C
Yes, that's a great question. I'm loving that you're getting to this level of detail. So it goes back to the test runner, and the test runner says, I have these tools, these faculties, these capabilities that I can perform in the browser. Let me know if you want me to do something. So the QE lead sends a message to the tester agent. So the tester agent remains the source of truth for manipulating the browser. The QA leads, the thinker or the visionary, but the manual tester agent remains the owner of the relationship to the browser. Now, conversely, the manual tester does not have access to all the account data, right? What that tester just doing what it was narrowly charged with doing. And then if it fails, it can speak up and pull in a more authorized agent to then go and access more of the account. And that agent is tasked with assembling the correct context. Now, one piece I left out is we do often pass information indirectly through blob storage. So rather than the manual Tester agent sending 100 megabytes or 30 megabytes of images and data back to the QA lead, it puts that into S3, you know, blob storage, and then it gives an identifier to the QA lead to pull that down. So that's just sort of to minimize, like copying and sending. Basically that data lives in S3 when the manual agent gets it and so the manifest region. So that's kind of.
B
That makes sense. That makes sense. So I'm going to still just keep digging in the details if you don't mind, because I geek out about this stuff entirely. So when you have the QA lead, I'm going to Say remotely manipulating the tester agent. Right. It's like asking for things. Is it directly accessing the browser tools so that there's not an LLM in the loop on the tester agent side? Right. So that it's accessing through that container but not through the LLM. So the tester agent doesn't have the like history context or is it sending a request and the LLM is like doing a tool call and kind of building up this like conversation of oh, my lead asked me for this, I'm going to do this work and then send it back.
C
Yeah, very much the former. Yep. Another great question. Like makes total sense. The QA lead has that LLM agentic loop. It basically has a manifest of tools that the manual tester contract that the manual tester agent is willing to fulfill. The QA lead manipulates that as if it were its own set of tools. And then at the conclusion, though the QA lead does provide sort of like a summary, a roll up if you will, of like this is what I did, this is what you need to know. These are the updates you need to make to your test case when you're done. But it does not share the entire conversation history. It summarizes that and gives that back to the manual tester.
B
That makes sense. Right. So you're keeping the programmatic ownership of the browser is still in the process that is running the tester agent, but it's running off in this LLM loop in the QA lead agent and it's only giving a digest at the end of here's what's changed, here's what I did, here's all these things.
C
Yeah. And the thinking there is like that keeps the separation of concerns a little bit intact. Where if you have a browser session, so we have all different services for spinning up these different parts of the thing. So we have one service that does the browsers. We want that to be owned by one sort of place in memory, one worker agent, one instantiation. So that's the test runner agent. If you find that test runner agent, you can find everything that we told it to do. We also want the expensive, you know, the full on account access loop to live in a separate QA lead process. That's always the thing responsible for that has the authority, that has the expensive models. So that's always there. So. And they communicate over this pub sub. So yeah, like it's intentional to do it that way. Really the reason is if you get the manual tester, if the manual tester starts being able to Access everything in the account. Everything's a QA lead. Right. You have one agent now, and it's just. It's just, it's doing all these. And maybe we get there eventually, but the benefit to now is it's a little safer in these early innings of the AI world where, like, we don't want the manual tester screwing up and doing something in the account that we don't. We didn't intend.
B
Yeah, no, that makes sense. And I think it. It also, to your point, allows you to use much cheaper models or much. Just like, the smaller the model, the more it will get confused. If you give it lots and lots of tools, like all those different things, you can keep it scoped down.
C
Exactly.
B
This does lead to the question of what happens when something breaks? And I'm going to put a particular case on that. I'm going to say, okay, let's say your tester agent has delegated to a QA lead agent. The QA agent is manipulating your browser.
C
Yep.
B
And now, for whatever reason, the browser crashes. What happens?
C
Yeah. So in that case, again, and this is where we want the more complex, we'll say, construction of prompts, if you will, like the chain of thought, you know, like all these different prompts that are coming through in that agentic loop. We want that in one place. So in the QA lead loop, it will do. It'll do things at a couple different levels of abstraction. So at the smallest level, it's thinking, I want to try to get this form submitted. What's it going to take to submit this form? But it's also operating within the larger context of a test case, and the goal of that test case or the description of that test case. So as it's doing things, it needs to keep periodically checking back into the context of the test case to make sure, have I strayed too far from my goal or my description? And then the outermost loop is just, is this app even up and working? Is the browser still up? Is it too slow to do anything? Is it. Is there an external issue? And periodically, we need to rise to that level of abstraction. It's kind of like the human mind, how you know, at any given moment you can focus intensely on one thing, but you're doing that, and you're also breathing, and you're also, you know, feeding yourself throughout the day. And like, you know, is there a threat to your safety like that? You're kind of like different levels of abstraction. So the QA lead has to do that. And so if the app Crashes, the QA lead will rise to that higher level of abstraction and basically say, I need to stop. The test case is not working, the application has crashed. We'll pause here and we won't do any further work. And so it sends that back to the tester agent and it says, terminate your session. There's nothing else to be done. Right now the patient's crashed.
B
Are there any particularly challenging like error states or error cases to manage? Especially I keep going into Anytime you have multiple things orchestrating, like where are the edge cases that you run into
C
in that the really challenging thing that we're dealing with a lot now is that test data problem. And so it's not so much within a single browser session that we have this issue. It's the interplay of multiple sessions running at once. So an example here, right? Let's say you say, I've got this QA team and I want to spin them up. I want to hire 100 people because I have 100 test cases. I want them each to do one thing and I want to be done in three minutes. Every test case is three to five minutes long. You can do that, but you probably would give each of them their own QA testing account and so they wouldn't be conflicting on each other. One person, the most recently created record in one of those accounts is the record we just create. That person just created earlier in this test case, right? But now in the world of AI, at least today in bear queue, we don't accept, we don't take 100 QA accounts, we do the same account, but we spin up a lot of agents at once. And so when there's concurrent writes or not even necessarily concurrent, they might be sequential. But if one test case adds something to cart but leaves it there and the next test case goes to the cart thinking that it, there should be nothing in it, and it sees something in there and that throws off a problem. So I actually think there's a startup worthy problem around test data management of applications. And it's probably also related too to something that we do in bearq a lot, which is trying to build up sort of a holistic vision of the application and the organization of it. The screens, the content, the functionality, like the user experience, how you interact with it. We try to build up an application knowledge graph basically for your application. Test data has to fill a role in there. And that's one of our challenges that we're going to have to solve in our, what we call like the application model in that, that object we construct that describes your application test data and just the data relationship again, like you can't, you know, in lots of apps you figure like you can't take one action without having created a whole bunch of records in the database that do xyz, you know. And so like managing that state is incredibly.
B
I'm imagining. Right, because your agents probably try to do some amount of self correction in different places. I go to the cart, I see, oh, there's something in here. Well, let me clear that out so I can do my test. And then the other agent which has been waiting on an LL comes back and he's like, where's my thing?
C
Right, right, exactly. You're totally right. And so I think this is where people think like oh, it's AI or it's, it's computer based, it's not human based. So just spin up a hundred agents simultaneously and it's like we can do that from an infrastructure perspective, that's no problem. But the data coordination isn't there. And so I'm kind of solutioning this with you right now. I think the solution is to get. We'll run as many parallel agents as you want to give us test accounts in your applications. Then we won't, we won't be conflicting and we'll be sure that, you know, we can run at, at full parallelism.
B
There still ends up, depending on what you end up testing, there still end up being challenges. Right. If you're testing like an admin account, it's going to see different records popping in and out and other things.
C
Right.
B
I wonder if there's a world at which you sometimes you start building up a sort of data model. Right. You mentioned you build a component model essentially where you're inferring a component model. But do you start building up a data model or at least dependencies between the different data types, data types and components. Right. I add something to this form and something shows up over here. Okay. There's a linkage here. And then now you have an orchestration problem instead of a data problem.
C
Yeah, you're spot on. You're spot on. I absolutely think there's a startup worthy problem there. Just basically looking at applications. It's effectively like the data model in the underlying database or you know, if there's a NoSQL, sort, blob storage, whatever it is, just figuring out those relationships between fields in different data objects and how changing one has to have an equal reaction in the other.
B
What's interesting there coming back to the AI coding side of things is like you may be able to build up a better understanding than the engineers have. I mean, not in a well run engineering org, I don't think. Right. Because in theory we still are all understanding and reviewing things, but like there's a whole lot of code that's getting Vibe coded out there that nobody knows how it's working.
C
Yeah. And I think the key point you're making too is the relationship between those two data types may not be apparent at the code level, but it may be apparent in the application experience level. And so if you submit a form in one place and that brings up a row and a table somewhere else, those two may be pulling from potentially different places in the code through Vibe coding error. Right. You know, lack of proper factoring. And could we uncover that and could we actually have a. Almost like a source of truth for the data relationships inferred or gleaned from using the app? It's an interesting problem and I think it's a, it's a new frontier. Right. That we're all approaching here. Can you understand the design of an application better from using it than from, from authoring it?
B
It gets to this question of how do we build our understandings of what our things do? Right. Like, this has been one of the challenges that I've been talking with all sorts of people around with Vibe coding, because coding itself has multiple. Back when we were hand coding everything, it was serving multiple purposes. One of those purposes was creating an executable artifact. Another purpose was helping us build up a mental model of a system and also a mental model of our user and our user's problems. And if you're delegating it all to a machine, where do you get those mental models? Well, maybe you get it from something like your tool.
C
I totally agree with you there. I also think that we're not quite at the point yet, at least on our team, where people can author code without owning it. We make the argument. So the LLM may still type the individual lines of code, so to speak, but each line of code you commit, you own. And so we're still at that point. I like, I don't know if you saw Brian Cantrell is at Oxide Computing. He wrote this post about proper use of LLMs for software development. I've shared that with my team. I really believe in a lot of the things he's calling out there, but basically we're not quite at the point yet. We basically everyone's using AI to author their code, but. But the expectation is that, you know how it's working, and the data that gets spit out from one function, how it's used in the next function, you know, that relationship. Certainly there's lots of pressure to go faster and maybe relinquish some of that control from the engineers. But we still own the code that we author on my team. And I think that's helping to push back a little bit on this loss of knowledge, a deep knowledge about how applications are building, how the data is flowing. But you're totally right that there was like a big part of writing the code actually was getting you to keep it in your head and helping you to make sure that these relationships that you had or these invariants that you demanded were in fact being upheld and were still as you thought they were.
B
And it's not even necessarily like, I think having humans have ownership in the end is really valuable and really important. And there is a whole can of worms to dig into if you start unlocking that and or removing that linkage. But I think it is not 100% clear to me that reading code is as effective for updating our mental models as writing it is. Certainly I see people with different amounts of challenges when they're reviewing code and trying to update things and things like that. And so I was wondering, yeah, you have this outside in approach of let me infer what the components are. Let me infer maybe what the data model is, if not yet, then maybe sometime soon. Is that exploration approach, maybe with a tool in the loop, a better way for us to build those mental models?
C
I think it's possible. But even if we infer a proper relationship graph, suppose between the data, and we can explain that, that alone would still be text and probably pretty dense text. Or maybe it's a. Maybe it's a visual, you know, if you generate like a system diagram or something. So that to me is going to be consumed with the same efficacy as reading source code would be. I also strongly agree that the writing of the code, it's like the taking notes in school, writing the note down, you know, probably may help you remember the things better than just hearing it or, you know, seeing it on the blackboard. So I think there's definitely something to the act of writing code, writing lines of code yourself and how that contributes to your understanding. So it is possible that we'll need new tools in the future world. You may own it, you may read it, but are you really going to grok it? You may not. Is there some way to have kind of the external result read back to you by an agent that Helps you to understand what it's doing better. At the end of the day, that's kind of like that'll be the new normal, right? Is like, it doesn't really matter so much what the code says. It matters, does the application work as intended? And if it doesn't, let's say that the common retort to that is like, what if it fails under scale? Well, then we'll just define that into does it work as intended? Well, to work as intended, it must now support 100,000 concurrent users. And then that goes and manipulates the code in certain perspectives. Like you could imagine, the code really doesn't matter as long as you've adequately, in painstaking detail, described all of the requirements of your application. So of course at that point, that requirement stock ends up looking like source code. And so it does. Would it be more efficient to just write the code yourself? Maybe? So that remains to be seen.
B
So we're kind of moving now into this concept of like, how do humans fit into this world again? And I'd like to bring that back to bearq in terms of like, we talked a lot about the inner loop and things like this. What is the outer loop? What does the human in the loop do at this point in a qa process with BearQ?
C
So it's validating that we have in fact identified components that are important and worth reusing in your application. It's validating that the test cases we've written are accurate and important. We kind of have a common sense grounding there, but there's lots of nuance for every individual application. And so we take human input for that, are these valuable test cases, etc. We also take human input on the report side of things. So we make suggestions or recommendations for how we want to update your account. And you kind of can imagine, like in this QA team analogy, we try to maintain this image of a person, of a human user, whether they're on the QA team or it's a smaller organization. Maybe they're the VP of Engineering at a five person startup, but they poke their head through the door and they bark out an order and they can get a response or an action taken in response to their command by this fleet of agents. And so the things that a director of engineering or the QA manager or the head of quality may care about out of various organizations are the things that we hope to surface in our reporting feature. So we'll do things like, this is how many test cases we ran. This is a failure. We saw across 13 of your test cases, for example, we think it's this same underlying issue. So we've suspended further tests that are going to manipulate that component because we think that it's like a. It's an external failure. We won't waste any more time. We did this, we did that. Performance is slower now today compared to the last three days or performance is right in line as things have been. So you're good. And then the human user can say things like, well, we're doing a big push on the. I always use the profile settings as my example. We're doing a big push on the settings page. Go and create more tests there or run the test there again, but do it with a higher level of debugging, logging and so I can then mine those logs later. So we see the human basically as an orchestrator, director, an approver, that type of thing. Of these QA agents, what do you
B
have in terms of like guardrails? Like let's say I want to use. I mean, I think we've talked a lot about test accounts in places, but let's say I want to use you on my production application. But there may be pieces of it. I'm like, oh, that's risky. I don't want to actually put an agent in the loop there. Or things like that. Like what can I do to control it? Are there ways I can disable destructive behavior? Like what are the knobs I have?
C
Yeah, so the resources, or we call it like resources or context, is the open form input you can give us. It includes Jira, GitHub, et cetera. It's also just open ended instructions. So you can tell us things to do or not to do. And we adhere to those instructions throughout all of our actions. So those are always in context. You can kind of think those human level instructions. So things like don't ever visit this page or if you're on this page, immediately navigate away. That's an instruction that's always in context for all of our LLM interactions.
B
Okay, but I'm a paranoid VP of engineering. I don't want to trust the LLM to this. Are there any like hard lines I can put.
C
Yeah, so, exactly. So we don't do this yet today. We're still in the early stages, having just released. But what we envision is something along the lines of like what the session replay tools do where you can annotate individual elements or pages on your, in your application. So I'm thinking like logrocket, you can tag certain fields in logra, certain elements as Just like no record or you know, whatever. And so then Logrocket just doesn't even pull them into its DOM map. And so we would do kind of the same thing there. That would kind of get you back into reality, right? Getting back to this notion of getting out of the living and building the house on sand and instead being in touch with reality. That would be something where it's like, no, you're, you're in the code, right? As long as your engineering team has those annotations in your application, we'll honor them and we won't even see them. So you know, that would be a way to get you back into, into touch.
B
Got it. That makes sense. And so that's kind of at the UI level. Can I put in any sort of like browser level blocks or things where I'm like, okay, any path starting with this or any of these things like just don't even hit it.
C
We don't support that today, but there's absolutely no reason we couldn't. We, you know, we're spinning up these browsers, we're setting all sorts of flags and capturing the HAR file and the video and all that type of stuff. So that's a great feature actually. I'll take a note of that after the call here. I think it's a great idea basically just to intercept that call and block it, I think. Yeah, absolutely. There's no reason why we couldn't support something like that. For example, we already support setting custom cookies or setting custom local storage values in the browser. So all of that stuff is configurable.
B
That's awesome. So this kind of brings us into this sort of future looking conversation. Where do you see autonomous testing and just AI related quality work going over the next couple years?
C
You know, I think the trust question is still the biggest unknown and I think that's where the bulk of the work has to come, really. Actually, I think it's the same in the software on the coding side of things as well as the quality side of things. What's possible now is so much more than what was before. And I think we're still kind of towards the tail end of discovering the bound of what is now possible. I'm selfishly kind of hoping the rate of velocity of change like slows down a little bit here and the model still messed up.
B
That would be nice, wouldn't it?
C
It's so much to dive, digest, you know, like every week it's like something new. So I'm hoping, I think we're kind of maybe near the tail end of kind of this wave here of the last, call it the last like year and a half or two years. So now I think take a breath, see what's possible. And now we set to the work, the engineering work of building on top of this new capability. Hopefully that will be satisfying in a different way. I think there's so much curiosity and like, whoa, like mind blown, mind blowing stuff in the last couple years. But I think it'll be really satisfying from an engineering perspective again to get out of that building on sand and building on rock and starting to establish patterns and bounds and scope and best practices with these tools and around how to build software with them. That's kind of what I predict. And so that engineering work is where you'll start to build up trust that okay, we have a new way to build a bridge now, there's new materials, they're hyper, whatever, resilient and they don't break or whatever. So now we have the engineering task of what is the right way to lay the bricks to build this bridge or to, you know, to connect these pieces here. I think that's really what the focus will be for the next couple of years. And so QA will be impacted by that. I think software engineering will be impacted by that. I think it's all about building up the trust that the software is going to work with 99.99% reliability the way that the Pre AI software did not speaking about bugs, I'm talking about like gamma rays coming in and flipping a bit to break your, your non AI software. We want to try to get to that level of, of error that makes
B
a ton of sense related to that. Everybody in the software engineering world right now is kind of grappling with like what does my role look like now? What am I still doing? What is still important? What is more important now? What things am I getting rid of? What does that look like for QA folks?
C
It's a great question. I think it's really similar. I probably shouldn't be testing the basic create, update, delete, record flow in my application anymore. At a minimum, even if I don't use a bear cue, I should probably be using an agent to perform these three to five steps that I do every day as a smoke test in my application. The QA person should probably think in terms of how can I use an AI agent to do these mundane tasks. At a minimum that I'm doing these repetitive, these well known, well understood, we'll say high value, like the fact that they're Working but low complexity tasks. That's what they should be focusing on, getting out of that loop. And then I think after that they can start to think about maybe. And this is where we kind of, we haven't talked about this, but a lot of people talk about maybe the merging or the blending of roles at organizations. You kind of start to see the QA person kind of becomes almost like a junior PM or maybe a regular PM and the pm, you know, who was doing some QA work maybe now they have agents doing the QA work and so there's less of a need for them to provide perform in that role. You know, at organizations of all different sizes, you have engineers doing qa, you've got engineers doing customer support like I did for years at my startup. And you've got, you know, marketing folks who are doubling as customer success. And you've got other types of role boundaries getting blurred. So I think AI maybe accelerates that. And so in some ways it can be, it's a risk, right? Is someone going to, someone else going to perform the QA role that you use to perform in other ways? It's an opportunity. Can you now contribute in other ways in actually more valuable ways to your organization by using AI properly? So I think it's a rising of the level, it's the raising of the level of abstraction. You know, people are not going to operate at that low level anymore.
B
Yeah, I've seen it described as we're moving from T shaped people, right, where you're deep in one and shallow in a lot to like plus shaped people where like you're still deep on one but like maybe with the AI you can be like reasonably good at a lot of different things.
C
Yeah, absolutely. I think it's an amplifier. If you have talent and skill, you can contribute in more ways to an organization through the use of AI. One thing I actually wanted to mention that if you kind of think of like the golden era of SaaS is under attack, say of like the CRUD app where I built this custom app for Barbers. It's really great software that knows it was built for their, custom built for their domain and it's really great. Maybe that's under attack a little bit in the next five to 10 years because AI can build a lot of it and so the basic CRUD app is under threat. So what that means then is if you're building software, you probably want to be building something more complex and harder. And I think that's where like you'll see a lot of innovation now is in bringing software into the physical world in the form of drones or robots or whatever. And so it's like it's a raising of that complexity factor or the other thing I was going to say is dropping down a little lower. If you're working on something like a compiler, that's something that you probably need a human in the loop for a lot of those decisions maybe still not for like the individual function writing. You can still have AI for that, but there's a lot of complexity that the AI is not necessarily going to nail out of the gate. Some of those lower level software engineering tasks might increase in importance as well. To make that kind of concrete, let's imagine that your LLM, all this LLM has taken over all the building of the CRUD software, but they're all making sort of one common quote unquote best practice mistake where they do something very inefficient. If you're looking say at the lower level and you're observing how memory is allocated in these apps or you're working say in the JVM or something like that, there could be a real outsized impact to all the, to the 10x amount of software that is now being shipped and released on top of the jvm. That one, you know, compute cycle that you're saving for every loop iteration might actually be, you know, millions of dollars in savings. So that type of software is worth a second look as well, I think. You know, basically go higher or go lower, but don't stay in the crud app.
B
Yeah, for sure. Where's SmartBear looking for the next. You're talking about embedded. I have this little embedded device I'm playing. Any QA harnesses for embedded coming down or other different things?
C
Nothing in embedded. The Bear Q approach is visual by nature. So we could support something like a kiosk based app, you know, something like an ATM or something like that. I know historically there was a whole market for testing specifically kiosk tablet focused apps that was sort of underserved or maybe there were one or two big players, but very legacy players. Opportunity for some disruption. We don't have anything in the embedded space, but we would hope to extend Bear Queue to desktop, to maybe to kiosk, but certainly to mobile and obviously hope to dominate the web. So that's kind of where our head is right now is Bear Queue and visual processing. Embedded would be very cool, but that kind of aligns with my point. I think that there will be a whole market as more of this tech, more Software gets pushed into the physical world in the form of robots and toys and drones and whatever machines. I think there will then be opportunities for testing harnesses of those machines.
B
Yeah, well, and given a visual outside in black box approach, y' all will be pretty well positioned to start exploring that too.
C
Yeah. If you could tell me from the exterior that things are working, I can do these 50,000 things. It's probably the case then that the software is working as intended. As long as 50,000 things is an exhaustive list of what you wanted it to do. Right. You can kind of always prove that it does everything I want. And if it doesn't. Oh, it doesn't do X. Well, X wasn't in the list. So let's add it to the list and let's make sure it does it.
B
Yeah, I love it. Well, we're just about at the end of our time. Is there anything we haven't talked about that you would like to discuss before we wrap?
C
This was really interesting. I'm really glad we got into the nitty gritty details of kind of the multi agent passing. I think it's like, it's a fun problem to work on and I hope it resonates with your audience. So yeah, I think we've kind of covered all the stuff on bearq and all my views on AI and the velocity rate of change, that type of stuff. So yeah, all good on my side.
B
Cool.
This episode dives deep into the evolving landscape of software quality assurance (QA) in an era of rapid AI-driven software development. Host Kevin Ball (“K. Ball”) speaks with Fitz Nolan, VP of AI and Architecture at SmartBear (and co-founder of Reflect), about the challenges posed by unprecedented development speed, and how SmartBear’s new AI-native QA platform, BearQ, leverages autonomous multi-agent systems to explore, learn, test, and maintain web applications. The discussion spans multi-agent technical architecture, the shifting role of human QA, and the future of trust, orchestration, and data management in automated software quality.
"AI coding tools have dramatically accelerated the pace of development and the bottleneck ... has shifted to code validation and testing." – (A, 00:00)
"SmartBear helps organizations ensure application integrity across modern tech stacks ... our platform combines test automation, API lifecycle management and observability integrated across the SDLC to ensure software quality." – Fitz (C, 03:43)
"We also just released a new standalone product called BearQ, which is ... an AI native product which extends these capabilities ... to help teams move faster with confidence." – Fitz (C, 03:43)
"It's a rapidly changing landscape. ... there's just so much explosion of new applications and people pushing web applications to do totally different things." – Fitz (C, 05:24)
"You say, okay, there's more browsers out there ... all of those different multi combinatorial explosion factors." – K. Ball (B, 07:24)
"The core unit of work has to be AI native ... you fight fire with fire ... to match that velocity." – Fitz (C, 09:21)
"If you can have AI recognize [a] component many times ... you can deduplicate a little bit like a human would." – Fitz (C, 12:24)
Agent Types:
"The QA lead, the system itself kind of is that orchestrator." – Fitz (C, 23:22)
Coordination Model:
"The tester agent remains the source of truth for manipulating the browser. The QA lead's the thinker or visionary..." – Fitz (C, 27:22)
"My theory here is that the LLM providers have probably conditioned their models: don't ever give up, use more tokens ..." – Fitz (C, 13:33)
"We're dealing with a lot now is that test data problem. ... The data coordination isn't there." – Fitz (C, 33:12, 35:23)
"We see the human basically as an orchestrator, director, an approver, that type of thing." – Fitz (C, 42:25)
"I think the trust question is still the biggest unknown and I think that's where the bulk of the work has to come..." – Fitz (C, 46:58)
"It's an amplifier. If you have talent and skill, you can contribute in more ways to an organization through the use of AI." – Fitz (C, 51:10)
"Bear Q approach is visual by nature. ... We hope to extend Bear Queue to desktop, to maybe to kiosk, but certainly to mobile and obviously hope to dominate the web." – Fitz (C, 53:19)
"Where is the complementary AI scale solution for quality on the output?" – Fitz (C, 07:24)
"At any given moment the human can jump in ... and confirm that things are working as they expect, but still be able to do it at a scale ... to match the software development." – Fitz (C, 13:26)
"It can build on itself. ... We've seen cases where the feedback loop can actually be pathological." – Fitz (C, 13:33) “I think the trust question is still the biggest unknown ... same in the software on the coding side of things as well as the quality side of things.” – Fitz (C, 46:58)
"The QA person should probably think in terms of how can I use an AI agent to do these mundane tasks ... And then ... they can start to think about maybe ... a blending of roles." – Fitz (C, 49:10) "It's a raising of the level of abstraction. People are not going to operate at that low level anymore." – Fitz (C, 50:57)
"Go higher or go lower, but don't stay in the crud app." – Fitz (C, 52:56)
"I'm selfishly kind of hoping the rate of velocity of change slows down a little bit here ... every week it's like something new." – Fitz (C, 47:29)