Podcast Summary: Software Engineering Daily Episode: Engineering AI Systems for Autonomy and Resilience with Krishna Sai (SolarWinds CTO) Date: February 24, 2026 Host: Shawn Falconer
Episode Overview
This episode features Krishna Sai, CTO of SolarWinds, discussing the evolution of engineering autonomous and resilient AI systems in enterprise IT—particularly within the context of observability, incident response, and service management. Sai and host Shawn Falconer explore how SolarWinds has adapted to the rise of distributed, cloud, and AI-driven environments, the challenges of operational complexity, the shift to agentic (AI-agent-based) architectures, and the ongoing transformation of engineering roles and workflows.
Key Discussion Points & Insights
1. The Evolution of SolarWinds and Observability (02:16–05:33)
- Expansion of Offerings: SolarWinds has evolved beyond classic network monitoring to cover observability, incident response, and service management for modern workloads, including cloud, containers, and AI systems.
- Horizontal & Vertical Concerns: The platform provides both a holistic view of infrastructure (compute/storage/network) and vertical concerns like performance, security, and cost.
- Operational Complexity: SLAs are central, but it’s increasingly unclear how all system elements contribute to them due to distributed, microservices-based architectures.
- Industry Pain Point: Traditional tools provide abundant data and dashboards, but identifying why failures occur remains highly challenging.
"The problem with that is that even today a lot of the tools just ingest a whole lot of data and show you a lot of dashboards with red lights and so on. But still finding out why something is red is still a big challenge."
— Krishna Sai [04:23]
2. AI-Assisted Programming at SolarWinds (06:20–09:38)
- Universal Use: All engineers at SolarWinds now use AI-assisted coding (copilots, code agents), resulting in a measurable boost in commit velocity (up 25–30%) and deployment frequency.
- Tool Maturation: Acceptance rates for AI-generated code are rising as models improve, shifting bottlenecks to code review.
- AI Agents as a Foundation: The real opportunity is translating lessons from programming agents to broader enterprise automation use cases.
- Agentic Mental Model: Task decomposition, intent specification, and autonomous action are central to both AI-based coding and operational agents.
"This notion of setting the intent and then the system deciding what actions it needs to take to drive towards that goal turns out to be a very good mental model to baseline on in terms of how we think about agents in the context of enterprise software."
— Krishna Sai [08:47]
3. From Statistical AI to Agentic AI in Operations (10:27–14:54)
- Historical Evolution:
- Monitoring → Observability → AIOps → Assistive Copilots → Agentic AI
- Driving Force: Exponential complexity in IT environments necessitates true autonomy—humans alone can't maintain the health of today's sprawling systems.
- Analogy with Biology: Just as biology needed computational help to make sense of burgeoning data, enterprise IT needs AI agents to parse and operationalize vast data landscapes.
"At some point we all like realize that just a human is not going to scale in terms of maintaining the health of these complex environments."
— Krishna Sai [14:22]
4. Breaking Down Silos & Building for Whole-System Intelligence (15:56–20:12)
- Past Siloing: Monitoring, incident response, and IT service management developed as separate silos, both organizationally and in data.
- Biological Analogy: SolarWinds uses a “left brain/right brain” system analogy: the “subconscious” is real-time observability processing; the “conscious” is decision-making and remediation actions—these need to be unified for effective system health.
"We talk about this internally: the human brain is the most wonderful biological system for observability ever created... In extending that analogy to observability use cases... these two come together as a unified system."
— Krishna Sai [18:58]
5. Embedding Autonomy: From Copilot to Always-On Agents (20:45–27:17)
- Copilot vs. Autonomous Agents:
- Copilot Model: Still reactive, summarizes issues post-factum.
- Agentic Model: Continuous, ambient agents detect, reason about, and act on issues—even before humans are paged.
- Real-World Example (Config Agent): Proactively monitors config changes, correlates degradations, suggests or initiates rollbacks.
- AI by Design: Autonomous components must be architected in from the start—not as bolt-ons.
- Key Platform Decision: "The model can propose, but the platform must dispose"—models reason and suggest, but safety/execution boundaries are enforced by platform.
"When you start to build out these types of systems, then you have to have specific architectural platform components... the model can propose, but the platform must dispose."
— Krishna Sai [24:31]
6. System Architecture for AI Autonomy (Data, Control, Reasoning Planes) (36:06–40:33)
- Three-Plane Platform:
- Data Plane: Ingests and normalizes metric/log/event/topology data.
- Control Plane: Executes actions, enforces policy.
- Reasoning Plane: Where AI/agents operate—propose but cannot directly mutate state.
- Granular Autonomy Levels: Actions may be "recommend only," "execute with approval," or "autonomous," depending on context/risk.
- Mandatory Traceability: All agentic actions need transaction traceability and auditable decision logs to ensure operation transparency and safety.
"Models... to do what they're good at—really be expressive but not be dangerous... execution of those actions always happen through a control interface that enforces things like least privilege."
— Krishna Sai [37:42]
7. Model Orchestration and Scalability (29:13–34:22)
- LLM Gateways: Centralized routing, masking, abstraction, and policy for using different models in the pipeline; avoids pitfalls of “wiring logs directly to an LLM.”
- Right-Sizing Models: Not every use case requires massive LLMs; small language models (SLMs) or tune-for-task models are often more efficient for operational agents.
- Use Case Example: Ticket handling/recommendation agent in ITSM boosts MTTR by 30–50% and removes tedious tasks like ticket triage.
"The amazing thing was the take rate on something like this was instantaneous... the MTTR goes up by 30 to 50% just overnight."
— Krishna Sai [33:19]
8. Data and AI: Preprocessing, Guardrails, and Feedback Loops (34:51–36:06)
- Data Preprocessing Essential: Raw log ingestion explodes cost and reduces efficacy—need deduplication, summarization, refinement before feeding data to reasoning engines.
- Analogy: "You don't take a bunch of raw ingredients and call it a meal."
9. Humans and Agents: Evolved Roles & Collaboration (41:39–47:13)
- Role Transformation: Engineers shift from writing business logic to “engineering the context” in which agents act—responsibility moves from explicit code to curating data and constraints.
- Probabilistic Mindset: Embracing probabilistic rather than deterministic outcomes; agentic systems improve via iteration, not one-time correctness.
- Emotional Resilience: Success with agentic systems requires treating “failure” as feedback—iteration, not frustration, is key.
"The shift from writing logic to engineering a context is a very, very important shift... responsibility actually moves earlier and becomes more probabilistic."
— Krishna Sai [42:06]
- Junior vs. Senior Engineers: Juniors often adapt rapidly to agentic workflows; seniors better parse engineering trade-offs and foundational practices. Building vibrant cross-level AI communities inside organizations accelerates collective learning.
"[Agentic systems] actually expose and amplify where a mess exists... Teams with clear ownership, strong data, and good fundamentals use agents as force multipliers..."
— Krishna Sai [48:56]
10. Measuring ROI and Adoption (50:24–52:05)
- ROI Justification: It’s easy to measure value in high-metrics domains (e.g., customer support, sales) but harder in engineering productivity.
- Approach: Start with use cases where impact is immediate/measurable (e.g., ticket resolution) and use clear metrics like MTTR, ticket deflection, lead time, change failure rates.
“Engineering productivity … is not at all well defined … We talk about number of lines of code written. … but what we are really poor at is being able to tie a lot of that to business outcomes.”
— Krishna Sai [51:17]
Notable Quotes & Memorable Moments
- "A human is not going to scale in terms of maintaining the health of these complex environments." — Krishna Sai [14:22]
- "You want your models to do what they're really good at—which is really be expressive but not be dangerous." — Krishna Sai [37:42]
- "The model can propose, but the platform must dispose." — Krishna Sai [24:31]
- "The shift from writing logic to engineering a context is a very, very important shift." — Krishna Sai [42:06]
- "Agentic systems improve through iteration and they're not like one-off correctness type of systems." — Krishna Sai [44:57]
Timestamps for Key Segments
- [02:16] SolarWinds' expanding role & complexity in enterprise IT
- [06:20] Ubiquitous AI-assisted coding at SolarWinds; impact on workflows
- [10:27] Transition from monitoring to fully agentic AI systems
- [15:56] Data silos & whole-system intelligence; biological analogies
- [20:45] Copilot vs. ambient agentic approaches; autonomous actions
- [24:31] "Model can propose, platform must dispose"—ensuring safety
- [29:13] Model selection, LLM gateways, ITSM agent example
- [36:06] Data, control, reasoning plane architecture; guardrails for autonomy
- [41:39] Human–AI collaboration, evolution of the engineering role
- [47:13] Junior vs. senior engineers, creating internal AI communities
- [50:24] ROI thinking and measuring agent adoption success
Summary Takeaways
- The future of enterprise IT operations is orchestrated through a fusion of human expertise and autonomous, agentic AI systems.
- Building resilient and autonomous platforms requires deep architectural changes, not just feature additions. Critical platform choices—like separation of data, control, and reasoning—enable safety, scalability, and traceability.
- The engineer’s job is evolving from authoring explicit logic to architecting the contexts and constraints that guide AI agents—a shift as much about mindset as about tooling.
- Bounded, use-case-specific applications (e.g., config management, ticket triage) unlock the most value today while containing risk.
- Effective adoption of autonomous agents depends on preprocessing data, clear operating guardrails, thoughtful metrics, and robust shared engineering culture.
