
Enterprise IT systems have grown into sprawling, highly distributed environments spanning cloud infrastructure, applications, data platforms, and increasingly AI-driven workloads. Observability tools have made it easier to collect metrics, logs,
Loading summary
A
Enterprise IT systems have grown into sprawling, highly distributed environments spanning cloud infrastructure, applications, data platforms and increasingly AI driven workloads. Observability tools have made it easier to collect metrics, logs and traces, but understanding why systems fail and responding quickly remains a persistent challenge. As complexity continues to rise, the industry is looking beyond dashboards and alerts towards agentic AI systems that can reason about operations, data, reduce, toil and take action when things go wrong. SolarWinds offers solutions to monitor, understand and remediate issues across complex distributed systems. The company began as a leader in network and infrastructure monitoring and has evolved to support modern applications, cloud environments, containers and AI workloads with a growing focus on reducing operational toil. Krishna Sai is the Chief technology officer at SolarWinds. He joins the show with Shawn Falconer to discuss how SolarWinds is rethinking observability in the age of AI, what it means to design agentic systems for mission critical environments, how AI assisted programming is reshaping engineering workflows, and why the future of operations depends on building platforms where humans and autonomous agents work together. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
B
Sai, welcome to the show.
C
Thanks Sean, it's great to see you and meet you. Big fan of the show, so thanks for having me here.
B
Oh well, thank you so much. That's nice to hear. Yeah, I'm looking forward to this as well. So I wanted to start off talking a little bit about Solar Winds and kind of set the stage there because I think a lot of people know SolarWinds maybe as a single tool that they used years ago, but you guys do a lot of different things. So given where you are today, how would you describe what SolarWinds actually is today to someone who hasn't looked at the space in a while?
C
No, absolutely. If we take a step back and say how IT and Ops teams who typically use SolarWinds products have been using SolarWinds for the past like 25 years or so. Our product portfolio broadly expands three domains observability, incidence, response and service management. And to put it simply, IT and ops teams use us to help detect and remediate issues across variety of workloads in their environments, network and infrastructure, which is where we started and have been a leader for a very long time, but also applications, databases, containers, ML workloads, et cetera. Our solutions cover this from a horizontal perspective, meaning give you the ability to look at the general basic health of the typical workloads compute, storage, network, et cetera, but also vertical cross cutting concerns like performance, reliability, cost, security and so on. Right. And what happens is typically IT and Ops teams are accountable for SLAs and SLOs. Right. And that kind of drives your day to day behavior. More mature teams, of course manage error budgets at scale and they have nuances of that same dimension. But all of this is much simpler said than done. I was talking to a cio, was part of a customer call recently is CEO of a major system integrator responsible for running big managed global services for an organization. And you know, he said it. Well, he said like I'm responsible for SLAs, but honestly I can't tell you everything that contributes to an sla. Right. Which is a statement of complexity in these environments. But it's also increasingly these teams have to deal with large, you know, microservices, distributed systems, et cetera. And so complexity is very real. And so what we target is especially in the context of AI and so on. Our goal is to reduce toil. We've all been there. Waking up at 3am alert storms and getting into a war room. And the problem with that is that even today a lot of the tools just ingest a whole lot of data and show you a lot of dashboards with red lights and so on. But still finding out why something is red is still a big challenge. So we've been thinking about this challenge. So when we think about AI assisting with this, traditionally we've gone from statistical approaches, things like anomaly detection, machine learning, basic stuff, to now there's a very clear shift to agentic AI, not just in our industry, but just across the board. And so that's something that we want to focus on and increasingly index on. And the way we talk about that is we just call it Sullivan's AI more broadly, but in particular the agentic portion of it we call Sullivan's AI agent. It often gets confused with AI observability, which is something that comes up a lot. The way we think about AI observability is that as a more of a vertical use case. Right. Rather than a lot of a horizontal thing. But that's also something that we're starting to do here.
B
Yeah, So I think there's quite a bit to sort of unpack there. I definitely want to talk a lot about how the use of essentially AI agents are starting to sort of impact the types of use cases that your customers are generally interested in. But one place I wanted to start off with because I think a lot of People listening to the show and certainly a lot of businesses that I talk to are all interested in understanding what are some of the leading technology companies doing when it comes to leveraging AI. So I wanted to talk briefly about sort of AI assisted programming. So start off there. Like, how heavily has SolarWinds invested in various AI powered programming? Is that something where every engineer has now got a junior engineer in their pocket, where they're leveraging something like cloud code or. Where are you with that?
C
No, we are, we are investing very heavily. All our engineers use AI assisted coding, actually. Super interesting because as yesterday we were kind of reviewing the progress that we've been making through the year and all engineers have enabled using AI assisted, both in terms of copilots as well as agents and so on. And what we're broadly seeing is that in general we're seeing of course, increasingly, you know, increasing code being generated with AI. For sure, the percentages vary from organization to organization and how we measure it and so on. But we're seeing increased commit velocity. You know, our commit request velocity has gone up like 25, 30% north of it sometimes. But part of it also what we're seeing is for deployment, you know, frequency has gone up as an example. So we're seeing some improvements in lead times. But what we're also seeing is that tools are maturing, which means that both the acceptance rate as an example of generated code is significantly improved over the last year as both the models have improved and the agents have improved. But the shift is of course now more on the code review side. Right. Like a lot of a code review portion is still the bottleneck has moved there. So that's something that we are starting to address and look at how do we make it better and so on. But broadly, I would say the signals are positive. Of course, the concerns are the usual concerns around code quality and flaky test generation and security guidelines, et cetera. Like those things still are very much front and center. But you know, one maybe this is kind of where you are going is that when we think about agent, I mean, coding is actually a very good baseline because we've all been working as an industry for a long time and the nuances of okay, how does the application or the use of coding agents now extend to broader enterprise like software use cases is something that's front and center. And that's where we've also been. We've been playing both sides of it, which is using coding agents and building our muscle around using how agents work and so on. While at the same time offering them through our products to our customers. Right. So there's that context as well. So you know, one of the things that when I think about this notion of how do these agents apply in the enterprise software use cases? If you take a coding example, for example, right. If you ask a coding agent, you know, go add an oauth support to my function or whatever, or refactor this module to be async as an example. Like these types, types of things are very, very. Instead of taking a example of taking a single code snippet, breaking the task down, the coding agent goes and reads the repo, inspects dependencies. There's a lot that's going on underneath, editing multiple lines, running tests, so on and so forth. So this notion of setting the intent and then the system deciding what actions it needs to take to drive towards that goal to turns out to be a very good mental model to baseline on in terms of how we think about agents in the context of enterprise software.
B
Yeah, I think that's a very astute point. I think that one of the reasons I've been thinking about this a lot recently, I actually wrote an article about it. I think one of the reasons why programming has been such a tip of the spear is like you said, there's all this history associated with it, but it's also like a hard truth environment, I would say, where even though the output might be non deterministic, there's kind of deterministic ways of checking the correctness and then that becomes almost like a reinforcement learning cycle because I can compile the code, I can run it against unit tests and most environments don't have that. So I think for other sort of non coding environments to be successful at the level of coding, you need to be able to create those similar sort of deterministic guardrails where you can actually evaluate the outputs in some reasonable way. So you have confidence that it's actually generating something. Correct?
C
No, absolutely. And I think that's why that analogy kind of makes a lot of sense to me. When we try to internalize it is like what is the intent? Right. And how, what is the intent and how do you measure it? Which is why as an example, if you. A lot of systems like. Which is why in operational practices things like SLOs and SLAs come in super handy because at the end of the day what you're really trying to drive is towards a certain healthy operational state, which is actually well defined in the non agentic world, in a very human driven world as well. Because a lot of your practices, et cetera, incident response, is an example. Traditional setup or threshold fires, page goes out, engineer wakes up, engineer does a series of things, checks dashboards, pulls logs, inspect traces. It's like a very, very disciplined type of an approach that the industry itself has matured along those lines. But the way for an agentic system to kind of then, let's say, mimic it, so to speak, and to make it a lot more effective, efficient, autonomous, later on, when we talk about one approach, is to say, how do you not remove that logic or the set of practices, but how does an agentic system, say, absorb it, so to speak? Right. And that's where I think a lot of the implementation design challenges, et cetera, come in. So a system, the way we think about a typical, like an agent, as an example, right. If you have an slo, a system could be observing, raising error rates, as an example. Notice that, okay, this is isolated to a specific service, and then it correlates that to a deployment that happened 10 minutes earlier. It observes a trace pattern that happened during a previous incident, concludes that this is a bad config change or whatever. Like all of these sets of steps is very, very. I would say there's a lot of historical knowledge and actions that an agent can learn from. Right. And I think that's where I think the analogy with a lot of what happens or how do coding agents work in a coding use case versus an operational agent working in an operational use case? There's a lot of similarities there.
B
Yeah, absolutely. And you have a tremendous amount of experience in, I guess, like traditional sort of enterprise infrastructure, having worked at a number of, you know, large, successful organizations, how much of do you see, you know, building sort of agentix systems as something that's brand new versus, I don't know, like a rebranding of some of the typical things that we would do with any software application.
C
Yeah, no, that's a great question, actually. You know, and if you think about this evolution of operational services systems, if you think about it like traditionally they were monitoring systems and then monitoring just there were things that were polling and observing the state, so to speak, manual thresholds. And you act like in the first generation, so to speak. And then they evolved into things like observability, where the system expressed its state through multiple ways. And then there was a way to correlate them across these multiple signals. And then there was this concept of AIOps essentially, which is essentially using these signals and coming up with ways to correlate those signals and making decisions still very, I would say, very early on, but that's where it started. Then we had started to see cases of copilots emerging where there's assistive tooling that is sitting next to all the human decision making that's happening. And now we're starting to see the early green shoots of agentic AI, where there are agents that can actually act even autonomously at times, within boundaries and so on. So there's an evolution of how this industry has gone through all of that. And some of that is, I would say, the natural push and pull of technological evolution. But also a lot of what has made that almost an existential need is just the sheer exploding complexity and tool sprawl and data. And it's just at some point we all like realize that it's this just a human is not going to scale in terms of maintaining the health of these complex environments. Right. Like, that's the evolution that I've been seeing.
B
Yeah, I mean, we saw a similar evolution in the world of biology too. Like, there's just the sheer amount of data that exists in, in biology. Like, people started sequencing the, you know, DNA and stuff like that 30, 40 years ago at this point. So technology has served a massive role there. And a lot of people say that the 21st century is going to be the age of sort of biology because of the fact that now we have powerful enough computers, we have these really powerful models to kind of assist in the data crunching involved in essentially evolving that science. Because no one human, no matter how gifted you are, could possibly, you know, keep all those things in your head. And I think we're seeing a similar evolution in technology because there's just any complex enterprise environment, there's literally thousands of different data systems that might be sending important signals all over the organization. It's like, how do you sort of start to be able to parse that? And most businesses are sitting on these terabytes or heaps of unused data where they hold onto it because they might be useful some day, but they don't have sort of a way to unlock that use.
C
That's right. That's right. And a lot of that, if you think through, for example, if you extend that and then you ask yourself why, if you think through that in an operational context, this was an aha moment that we had probably a couple of years ago. We've always kind of loosely had this strategy, but it came more front and center a couple of years ago is that if you think through how the operational systems have evolved in digital environments, there was a monitoring observability industry which was just only focused on getting signals, showing dashboards and giving alerts as an example. And then you had this incident response or more devopsy types of environments that came where you really saw all those signals. But then it adapted to how teams were operating with incident response, being able to observe, maintain slos of services, manage error budgets and so on. And then you had the IT systems off on the side where there were very, very itil driven, service delivery driven, very, very, shall we say, structured processes that enabled enterprises to scale at these practices. But what all of that did was create all of these silos in terms of IT operations management, IT service management, et cetera. And the silos happened in organizations, but they also happened with data. And to your point, when you have all of these massive amounts of data ingestion, which data ingestion is something that we've mastered very well now, but they ended up creating all these massive data silos. So what happens is when you want an Agentix system or any kind of an AI system to work, then you have data silos that becomes incredibly complex and expensive results in you have this third wheel of separate AI systems that then ingest all the data and then having to process data on top of that. And then you have multiple dashboards and consoles. And so when you're dealing with operational situations and war rooms, there are just so many different consoles that are spread around everyone's desktop. Like all of these challenges, which it really almost became important critical that we really take another look at how these things work internally in SolarWinds as an example, in our design discussions, we use the left brain, right brain analogy, right? Where if you take the human brain, it's actually one of the most talking about biological systems, right? Like it's a most wonderful biological system for observability ever created. We walk around the face of the earth, you know, ingesting all kinds of signals through our five senses. And you know, you can be in a crowded mall with a lot of noise and someone says sean, and you're instantly like paying attention to that, right, with your name. And the subconscious seems to have this phenomenal mechanism of being able to process all those signals, decide what's important, what's not, and being able to surface that, shall we say, actionable signals to the conscious, where you as shant can say, hey, is that really meant for me? Oh, maybe that's a different shant. I can go about walking around getting cookies in my mall, so on and so forth, right? So you can do all those things. So when we think about, you know, extending that analogy to an observability use case, you have this system of observability on the one side which is optimized for your meantime to detect, so to speak. And then you have this right, brain systems which were conscious systems in the past where there were actions and runbook automations and workflows, et cetera, all optimized around remediation. And these two come together as a unified system. And then you have kind of the analogy of the subconscious and the conscious, where in the subconscious there's all this processing that's happening and the conscious is where the human element comes in. And increasingly the conscious actions which are the actionable things are also becoming dimensions or various dimensions of autonomy. Right. So this analogy of the human brain is something that we talk about a lot internally when we think about these systems.
B
How does that impact the way that you think about using agents for observability? I think traditionally, as we've talked a little bit about, observability is a lot of dashboards, it's metrics, it's maybe some level of statistical based ML to highlight certain issues like anomalies and so forth, but it was really about giving visibility into the systems for humans. But now with AI agents potentially playing a role there, it's really about giving inputs to machines, not necessarily dashboards for people. So how does that kind of think about how you would build these systems and how is Solarwinds approaching this problem?
C
No, I agree. I mean there's a lot, maybe it may help, it may be useful to maybe baseline on a couple of different things, examples of how we've seen this evolution take place and then I'll maybe I'll go into more of the design choices that we've had to make. So if you think about, for example, in this evolution of how Gen AI essentially started to help out with the problem solving, diagnostics, resolution, so on and so forth. We saw this phase where there was this co pilot phase where there was this Gen AI essentially becoming a very capable interface layer sitting next to the system, not necessarily inside it, but next to the system. So when you're debugging a production issue, the copilot then helps you summarize things and explain an error pattern and help you write a query across multiple metrics and traces and so on. Still very useful, but fundamentally still very reactive. For example, we have a couple of examples just to give you to illustrate this. We have Configuration Agent as an example, right? And the Configuration Agent is interesting because it is one of those configuration is one of those highest leverage, highest risk surfaces in modern systems. If you think about the number of outages that have been caused by poor configurations, it's pretty massive. And the other thing that's super interesting about that is if you think about DNS misconfigs, which we hear about every other week these days, are certificate expirations and overly permissive security rules and so on, what happens is a lot of them don't necessarily result in a crash, so you don't get a exception that you can go look at, but there's a subtle degradation of surface behavior. And so you don't realize that the outage actually happened because nothing specifically crashed and the system was just like executing to your configuration. What typically happens is in a copilot world, that configuration failure would be detected, discovered after the fact. An engineer would get paged and copilot will summarize all of that. But what we're increasingly doing is changing that behavior to where a config agent is continually looking at how the service itself is degraded. And then when there's a service degradation, having all the information that is required to be able to go and correlate that to a config change and then be able to make a very effective choice about whether I want to, to do a rollback or something else, right? Like bring a human in the loop, so on, so on and so forth. So we're seeing to, we're starting to, I would say, see these types of use cases increasingly now. What happens is when we think about engineering for a lot of this, there's a lot of, I would say, very important considerations that come right. And the hardest problems actually tend to be architectural in nature. And that happens especially when you're dealing with production systems, mission critical systems. How do you think about building AI software that can act in real world environments and not just a copilot? How can you kind of start to build that autonomy? And that becomes a very, very important kind of design choice. So we have this internally. We think of this as AI by design, which is not AI first or AI everywhere, which tends to get kind of misunderstood a lot. But how do you design a platform from the beginning with the assumption that AI driven components would exist, would evolve and eventually operate autonomously? Right. And I think the mistake that a lot of teams end up making is to treat agents like a feature that you bolt on after the fact, rather than. And then what you end up is you have these powerful models which people have built billions of dollars building sitting behind like super brittle guardrails. And then you have this problem where, hey, I have this best model. Why is it not giving me the results that I want? Then you realize that you don't have the basic system that is really designed for these things to act. So we have this statement that we use with the model can propose, but the platform must dispose. Right? Like meaning treat the model for what it does. It's a great reasoning component, but make sure that the platform's a safety boundary. And so when you start to build out these types of systems, then you have to have specific architectural platform components in place and these become very concrete design choices. So for example, LLM Gateways is a great example. Early on when we started experimenting, teams were experimenting with wiring logs directly to an LLM as an example. It works great in a demo, executives are super impressed. But then you start to say it immediately runs into problems the first time you try to put anything close to it in production, because cost spike, unpredictably sensitive data gets into prompts, different teams, hard code different models, and then suddenly you can't change providers, enforce policy, and so on. One of the initial design choices that we had to make was actually bring all of that together in a platform service which is an LLM gateway, which then handles everything from model selection, abstraction, pii, masking, rate limiting, auditability, and so on and so forth. So a lot of those shared concerns across expanding your EA cases are kind of isolated. So that's a very good kind of choice. The other one for us is also around. You just can't throw melt data at LLMs. Right. Like the logs thing that I mentioned about, it's one thing about generating a whole lot of logs. Logs tend to be super noisy. And so we have to think about how do you feed logs to an LLM or a model for decision making. You can't have. So you need to really think about logs. How are you going to compress them, de duplicate them, summarize them before you ever expose them to a reasoning layer in your platform. So instead of asking a model to read like a half million lines of log lines, you have a compact representation of what changed, what's anomalous, what's new. And then that's a kind of a classic systems approach that we've applied in other use cases. But then you need to bring that into something like an AI system. The same thing kind of also. Yeah, yeah.
B
So I wanted to go back to a couple of things that you said there at the beginning. So you Talked about these like copilot experiences where even a really great copilot you're still finding out after the fact that there's some sort of problem and it has to essentially wait for a user to prompt it. And it sounds like you're trying to move to a world where you have kind of these more ambient agents that are always on. And I really believe I'm a big proponent of this. Like I think that's going to be the next evolution of the use of agents. You know, there's been a lot of success in the B2C world with AI, especially with ChatGPT and I think that has been fantastic. But it's also locked a lot of companies into thinking that the only way these interfaces work is it's a chatbot. You know, it's some sort of co pilot, maybe there's an agent behind it, but it's still just a chatbot waiting for a user to ask a question. But in these like operational use cases, I don't want to have to ask if there's a problem, I want the system to know that there's a problem and then do some work on my behalf and then loop me in when human decision making is required. So it kind of sounds like you're thinking about that in a similar way. And I guess one of the questions I had is I think there is in a lot of ways a substantial difference from both when we think about chatbots in especially the B2C world versus the B2B world of solving specific operational use cases. Like if I want to constrain this to how do I figure out whether we have some sort of DNS config issue or a certificate has expired. Like the shape of that problem is very different than being able to go a chat interface where anybody can ask any unbounded unlimited set of questions. So I'm curious, how does that shape your thinking when it comes to the types of models that you might need to use or even the way that you architecting systems? Do you need trillion parameter models in that world or can you get away with something that's more like an SLM that's tuned to the particular problem at hand?
C
Right, right. No, I agree. I think the model definition is definitely very, very important and we do that. And that's why there's example of LLM gateway is a very good one where depending on the type of use case and the consumptions, you can actually make that choice at the gateway level rather than downstream consumer having to make that choice. And you're Absolutely right. Like what we initially saw is that LLMs, as they get bigger and better and so on, they're able to pretty well handle a lot of the generic use cases, right? And then marrying that with a RAG system as an example. And even there, like when you think about RAG systems, right, there's a lot of design choices that you have to make. But marrying that and then having a system that needs to be able to work in the background, all of those things need to tie together. Which is why initially bolting on a copilot on top of a set of existing APIs exposed via MCP as an example will get you going initially with that very, very basic assistive tool calling. And when you wake up at the 3am in the morning and dealing with alert storms, you have no idea where to start. Even having a copilot which has a few prompts and give you some contextual prompts based on the problem you're looking at are alerts make a lot of sense. But immediately the problem shifts to okay, what now, what next? And how much of this can I do? And that is where I think a lot of these type of ambient agents, as you call them, come into the picture. And you're absolutely right. Like there are agents where you do need, let's say, the heavier agents, like a lot of use cases. For example, even within, for example, we use A cloud models a lot. We use other models as well, but CLAUDE models a lot. And we use the haiku models for some very, very specific use cases. For example, in our ITSM product we have this agent. One of the agents that we have in our service management product is when a ticket comes in. So typically, and it gets assigned to a human agent, and human agent has to go review the ticket. And oftentimes a ticket gets forwarded 15 times before it ends up being at the right person. So when an agent comes in, an agent gets a ticket, you have all this history behind it and there's a lot of cognitive loads of. One of the agents that we have is one that'll go look at your incident data, previous incident data, and go through that whole process of managing that and generate a response, give you a context summary. So when an agent comes, they not only say, this is what's currently going on, this is a customer you're dealing with, this is the sensitivity, and here's the summary of everyone. Sean's looked at it, sai's looked at it before, and here's the summary of what they thought. And by the way, here's a suggested Response and you can quickly edit it as a human in the loop and respond to it. So these types of use cases and experiences, and if you think through this, the beautiful thing about that example is there's so many things that come into the picture. You have your entire system that is working in the background. You have the system of all the architectural choices you had to make around LLM gateways and model choices, rag systems that process that have to deal with chunking, similarity measures, embedding dimensions and so on. It has a user experience dimension of actually extending an existing use case so that a human is not completely having to deal with completely new use cases. And an agent first example of the agent's done all the work for you and is looping in a human right when you need it, when the highest point of decision is required. A lot of these design choices come together in a very, very elegant use case, I would say.
B
Yeah, I think it's a massive sort of operational efficiency win for the company too, where you don't have to spend sort of human cycles routing tickets around probably nobody's favorite job to do.
C
It's super. I mean, this was one of those initial use cases that we rolled out. But the amazing thing was the take rate on something like this was instantaneous. Right. And the MTTR goes up by 30 to 50% just overnight using a very effectively well thought out use case that comes into your natural workflow. Right. And I think that's why I'm, that's why I'm bullish on a lot of these things. Because if you do take the care to have a systems approach to this problem, rather than bolting on flaky agents on an unstable kind of an environment, we really design it, engineer it, build a platform, build these platform primitives and take a user experience first approach. Then you actually can leverage a lot of this. And talking about ROI and so on, you can see instant ROI in these types of agentic use cases right up front, while still maintaining all the concerns around boundaries and choices and so on.
B
Yeah, it's also, it's easier to bound the problem when it's sort of more use case specific. You know, if you're looking for config issues, for example, then you have more of a bounded input data set to build guardrails around versus just like, hey, someone could ask the weather support for my upcoming family vacation as well as to do log analysis on our systems logs. That's hard to build test cases for essentially.
C
That's right.
B
One of the things you also mentioned, there earlier was this idea that you don't necessarily want to just take your raw logs and fire hose it into a model and, and it's going to be hard for a model to be able to interpret it. It's probably also going to explode your token cost to some degree because you're putting in a lot of input tokens and there's just going to be a lot of noise in that data set. And I think for a lot of use cases it's not really. And I think where companies kind of struggle sometimes is because you have data all over the place, sort of the easiest thing to do is try to attach essentially the agent to be able to pull in the raw data from all those different systems. And that might work okay in the demo, but realistically you don't want the raw data. You need like a refined data set. Essentially the data set needs to be massaged into something that's kind of purpose built for the AI system. Just like, you know, you don't take a bunch of raw ingredients and call it a meal. You refine your spices and your salt and pepper, you know your raw ingredients and you cook it together and you make a meal. And you call that a meal. I think you need a meal for your data sets as well, for these AI systems to be, you know, successful. So what are some of the things you're thinking about there and how is that being reflected in some of the tools that, you know, SolarWinds is building?
C
No, I agree with that. And I think, you know, in terms of recipes, thinking about that. Right. Like you need to have a system in place. And this is where I think, going back to the platform choices and decisions that we had to make, especially for a data heavy system like ours, we tend to very, very broadly, we think about our platform in three different planes and this is not unusual. There's a data plane, which is where we deal with a lot of your, what we call as a melt data metrics, events, logs, traces, topology, ingestion and normalization. A lot of that happens in the data platform. And we have the idea of a control plane, which of course deals with all your policies, actions. And then we think about what we call as a reasoning plane, which is something where the intelligence and the agents tend to operate. And that separation I think is very intentional because it mirrors number one, how big systems like kubernetes, service meshes, et cetera are already built. But it also ends up goes back to what you were talking about, which is you're dealing with a lot of very, very Operational concerns like distributed state, partial failures, components that should never have implicit attack authority, and so on and so forth. Like a lot of those types of things come, which is why having an explicitly permission by design reasoning plane is something that's a first class choice decision, design choice that we had to make, right? And that distinction actually matters a lot because the aha moment for us was you want models and the reasoning plane to do what they're good at, which is really be expressive but not be dangerous, which is like the key distinction, right? Like you want your models to do what they're really good at. So by actually separating the concern and saying let the reasoning plane go do what it's really good at, explore hypotheses, propose actions, but the execution of those actions, et cetera, when things have required mutation or action, they always happen through a control interface that enforces things like least privilege, et cetera, right? Like that's a very, very critical, I would say, design choice that we had to make. The second one was that when we think about autonomy, don't think of it as flipping a switch, think of it like a tiered capability, meaning that having an autonomy level into every type of action is something that needs to be baked in and built in. So you can think of it as an autonomy level. You could start with something like recommend only, no mutations, right? And then you could evolve to something like execute with approval. And then later on you can go to execute autonomously, but with certain constraints, right? For certain low risk, high performance use cases. And then you can also tune it based on action type, by environment, by service, by team. So having all these types of bells and whistles as you build autonomy into the reasoning plane for your system, I think is super important. And then the other one that is very important in terms of actually realizing these things in production is to make sure that you have first class transaction traceability for all your agentic actions. Because things will go wrong and when they go wrong, you need to have the ability to go back and look at the why and what. Under whose authority did an agent do certain things? And these things are important. You don't realize this up front, but these things scale very fast. I mean, the moment you give your engineering teams the ability to say, go use these LLM gateways and use these agentic frameworks for your use cases, overnight you will see 20, 50, 100 different agents being built and contributed. And unless you have these baseline things in place is super important. Then last but not the least, being an observability vendor, it's ourselves, we always keep reinforcing the fact that you have to make observability for your autonomy a first class concern. Whether that is. And there's a lot of activity happening in forums like OpenTelemetry, et cetera now to support this as well. So these are, I would say some of those, I would say both they cross the boundaries of not just being guardrails, but also good sound platform decisions that help you achieve these things at scale.
B
Yeah, I think it's really important to have kind of that running decision log from these systems. And like you mentioned that when you turn these things on, suddenly you're going to have like I think a huge volume of views. In some ways, like I made this analogy recently and I think it's kind of rings true is it's like when banks suddenly had mobile apps and then they went from a world where in order to check the balance in our bank account you had to go to like a ATM machine and pull it up where maybe somebody's doing that once a month if they're really motivated. And then suddenly it's like I can do that 100 times a day. And their APIs are just getting hammered as a result. And they had to scale those systems massively. I think it's similar when you start to turn these things on is the usage is just going to really escalate. You have to think about that as well from a systems design perspective.
C
Exactly. No, I agree with that. And that's where I think things like token usage and et cetera, like a lot of those experiences compressing logs, these types of things. Having good design decisions and saying how you're going to distribute the design choices when you're building a large scale systems greatly help with how you think about all of those things.
B
Looking sort of longer term. How do you see human role evolving in the space? Are they going to be supervisors or auditors? Collaborators with agents?
C
It's an excellent question. It's front and center for all of us. The way we think about I think a lot of even as engineers, as an example, like how does an engineer's role evolve?
B
Right.
C
As an example in a lot of this, the way we internalize it is what we're seeing is more than just a tool shift. Like a tool used to be able to do XYZ and it's now becoming this. What we're seeing is that there's a responsibility shift, meaning the role of an engineer or an operations person is clearly changing from, or will change from having to be the author of the logic to be how are you going to engineer the context? So the shift from writing logic to engineering a context is a very, very important shift. And we're starting to see that. And that's why things like coding agents are super useful, because you're already starting to do a lot of that in your day to day. I would say example, right. Like now, in a traditional way, responsibility was super deterministic. You wrote the code, something broke and you are responsible. But with agentic systems, that responsibility actually moves earlier and becomes more probabilistic. And engineers just our nature, we are super uncomfortable with that. Right. We want everything to be very, very deterministic and want to encode logic. And I think learning to do that is a very subtle nuance shift that will happen across the board. And so if you think about that overcoming that uncomfortable nature of learning that engineers have to go through and thinking about agentic systems which force you to think in terms of probabilistic engineering. And so that's a very big shift that I think we all need to go through. But I think the shifting of that responsibility from building business logic to engineering context is a very good way to frame how I think our responsibility shifts in this process.
B
Yeah, I think that is probably the biggest fundamental shift that we're seeing is sort of what the role of software engineering is. Historically. It really is about writing these precise rules and logic to be able to represent the business logic. And now essentially your program and in many ways almost becomes like your data pipeline. It's that context engineering funnel into the model. And the, the business logic is essentially a byproduct of the execution of the model. And, and so part of that is non deterministic, which is a hard thing to sort of, you know, wrap your mind around and get comfortable with. And, and then it introduces these new things that like we've been talking about that become front and center of observability, tracing decision logs, evals, replace unit tests and things like that.
C
And you know, the other thing that, as you were talking about, the other thing that occurred to me was one of the things that we don't talk about enough is having a sense of like emotional resilience when you're dealing with these systems.
B
Right.
C
Because these systems can be super frustrating. There's that initial aha, like I have the superpower. And then soon they start to get very frustrating because they hallucinate, they make surprising choices. And we've seen that the engineers who actually succeed are the ones who are able to treat that as feedback rather than failure. Right. And it's like that classic, the engineering. You have to kind of go back to the core of your engineering first mindset. And we see that for example, with engineers and teams that are doing this better. And so a lot of teams will look at that initial results and say, this is garbage, I'll just write the script myself. Rather than saying, hey, that's interesting. I wonder what context is missing. Right. And what constraint did I forget to encode in my prompts or in my system? Right. So building that mindset I think is very critical because agentic systems improve through iteration and they're not like one off correctness type of systems, you know?
B
Yeah, I think that goes beyond just engineering as well. I think that's true of people are using LLMs for generating market material and things like that. Like, I've certainly seen my fair share of people in that world where that are more resistant to it. They like put one prompt in, they're like, oh, it didn't work for me. I didn't get like a, you know, the perfect pressure release on the first attempt. And they don't know how to kind of take that signal as feedback and figure out how to provide the right context.
C
Right. Or you end up dealing with AI slop. Right. Which is essentially the difference is that different systems have different challenges. Writing some content for SEO is one thing where you can probably get away with not getting the exact thing right, but dealing with mission critical systems is a whole other thing. So there's a big spectrum there and I think going back to how we used to think about even levels of engineers will change a lot. It's no more about you knowing the syntax and you knowing about styling guides and effective principles of basic writing software, but it's more about how you're able to architect the systems and how you're able to work the systems to do what you want to do, which is really build great products that people love to use.
B
Yeah, exactly. I think there's going to be a shift in terms of where we've historically rewarded really deep expertise in a particular domain. And now, at least for certain classes of problems, I think you can kind of get away with having more of a generalist sort of wide view of how to build things because the LLMs are so good at having that deep expertise in like the syntax of Rust or whatever programming language you happen to be working in. Actually, one thing along these lines is what are your thoughts on now this kind of, I think, perceived risk of what does this do to the junior developer because we are working at a higher level of distraction now. And I think that initially a lot of the copilots, the perception was these are great for junior devs. Like this kind of levels them up a little bit. But now I think with the Agentix systems and this change of context engineering and really needing to understand how the pieces of the puzzle go together, that's something that generally more senior people in the organization are good at and are now getting a lot of value out of these models. So of course the risk, the challenge is how do you get the junior people to senior people if suddenly the junior person isn't getting sort of that hands on training? And I guess what are your thoughts on that? And is that a problem that people should be concerned about or are we
C
just kind of working at the wrong tree? And this may be one way what we're seeing is of course this is all super early and we're learning a lot as we roll out these systems and so on. What we're seeing is both are true, meaning the junior developers tend to be a lot better at adapting and learning how these systems work and being able to get the system to do what you want. So they get really good at that. The more senior devs are really better at understanding, hey, agentic systems don't replace your basic engineering discipline. They actually expose and amplify where a mess exists. Right already. So teams, for example, which have clear ownership, strong data, having good operational fundamentals, then are then able to use agents as force multipliers and the teams without those foundations tend to struggle and tend to experience agents as unpredictable and challenging and so on. So one of the things that we have done actually internally, which may, which is actually a pretty good practice, is to build these AI communities and have a very, very active, vibrant community where you have senior devs and junior devs actively engaged in sharing knowledge because there's a lot of learning that's happening on a daily basis and being able to share that knowledge and be able to articulate, hey, here's what worked, here's what didn't work. And by the way, as we are going through this, here are some architectural foundations and best practices that we need to put in place on which we can build greater production systems and so on. I think having doing some of these things greatly tend to help is what I've seen.
B
Yeah, absolutely. I think that one of the misses sometimes companies make is that they buy these tools and then they think that overnight there's going to be, you know, 100% efficiency gain from it and they, they forget essentially that it's a new skill set that you have to give people time to sort of train up on. And that's really important part of the process.
C
The other thing which is interesting that you mentioned because this is the other challenge that most companies that are adopting these is having to justify the ROI for spending a lot of these tools. Yes. And you know, and that's a common challenge and the way we try to balance that is there are like focus on the cases where ROI is very quickly, shall we say, well established. Number one as a system. For example, if you take customer service, right. Like they all they have, they're highly metrics driven, they know what works for them and they know what doesn't work for them. So you can actually measure your ticket deflection rate or your time trove first response or time to resolution and these things can be actually measured. Engineering productivity on the other hand, as you know, is not at all well defined. There's a lot of, we talk about
B
number of lines of code written.
C
Yeah. Late times and commit velocities and all of that, which is good. I mean we try to somewhat early but what we are really poor at is being able to try a lot of that to business outcomes. So again going back there, for example, we had to do a lot of work in terms of it's early but being able to quantify certain metrics in terms of whether are your lead times improving as an example your change, failure rates decreasing, et cetera, a lot of those types of things. Having that in place also tends to help. So you know that you're trying something and you're learning but over a period of time you're starting to get those signals or green shoots of hey, this is actually helping me do things better. Work better.
B
Yeah. I mean I think the reality is it doesn't matter how powerful the models are, you still need to have do the work essentially around putting the right metrics in place to measure what success looks like. I think sales is another similar to customer service is another place that's like if a sales rep they're going to use whatever tool allows them to hit quota more often. So if you can unlock that there's you know, clear ROI for the business. Well, Si, I want to thank you so much for being here. I really enjoyed the conversation and the last thing here, is there anything else you'd like to share?
C
No, it's just thank you again for the opportunity and we're super excited about where all of this is going and we're taking a very comprehensive approach to how we look at this, and we're definitely excited about moving the needle, so to speak, in terms of how a lot of these things continue to add value for us even as we adopt and learn. So thanks again for the opportunity.
B
Great. Well, cheers. Thanks.
C
Cheers. Thanks, Sean.
Podcast Summary: Software Engineering Daily Episode: Engineering AI Systems for Autonomy and Resilience with Krishna Sai (SolarWinds CTO) Date: February 24, 2026 Host: Shawn Falconer
This episode features Krishna Sai, CTO of SolarWinds, discussing the evolution of engineering autonomous and resilient AI systems in enterprise IT—particularly within the context of observability, incident response, and service management. Sai and host Shawn Falconer explore how SolarWinds has adapted to the rise of distributed, cloud, and AI-driven environments, the challenges of operational complexity, the shift to agentic (AI-agent-based) architectures, and the ongoing transformation of engineering roles and workflows.
"The problem with that is that even today a lot of the tools just ingest a whole lot of data and show you a lot of dashboards with red lights and so on. But still finding out why something is red is still a big challenge."
— Krishna Sai [04:23]
"This notion of setting the intent and then the system deciding what actions it needs to take to drive towards that goal turns out to be a very good mental model to baseline on in terms of how we think about agents in the context of enterprise software."
— Krishna Sai [08:47]
"At some point we all like realize that just a human is not going to scale in terms of maintaining the health of these complex environments."
— Krishna Sai [14:22]
"We talk about this internally: the human brain is the most wonderful biological system for observability ever created... In extending that analogy to observability use cases... these two come together as a unified system."
— Krishna Sai [18:58]
"When you start to build out these types of systems, then you have to have specific architectural platform components... the model can propose, but the platform must dispose."
— Krishna Sai [24:31]
"Models... to do what they're good at—really be expressive but not be dangerous... execution of those actions always happen through a control interface that enforces things like least privilege."
— Krishna Sai [37:42]
"The amazing thing was the take rate on something like this was instantaneous... the MTTR goes up by 30 to 50% just overnight."
— Krishna Sai [33:19]
"The shift from writing logic to engineering a context is a very, very important shift... responsibility actually moves earlier and becomes more probabilistic."
— Krishna Sai [42:06]
"[Agentic systems] actually expose and amplify where a mess exists... Teams with clear ownership, strong data, and good fundamentals use agents as force multipliers..."
— Krishna Sai [48:56]
“Engineering productivity … is not at all well defined … We talk about number of lines of code written. … but what we are really poor at is being able to tie a lot of that to business outcomes.”
— Krishna Sai [51:17]