
Loading summary
A
And then fast track to today. Basically that is about like 10 years we have been building this product to help customer to how to detect those IT outages and then even like predict and fix them automatically using AI technology.
B
Welcome to Embracing Digital Transformation where we explore how people process policy and technology drive effective changes. This is Dr. Darren, Chief Enterprise architect, educator, author and most importantly your host on this episode. From Mars to the data center AI that prevents cloud outages with Dr. Helen Gu, professor at North Carolina State University and founder and CEO at InsightFinder. Helen, welcome to the show.
A
Thank you for inviting me.
B
I'm really interested about this topic because so far in the AI realm and AI agent realm, it's been kind of the wild west. And I haven't seen really good software best practices applied to it like we do in other systems. But you got something unique that we're going to talk about. But before we do, everyone that listens to my show knows that I only have superheroes on the show. And every superhero has a background story, an origin story. So Helen, what's your background story? What's your origin story? Where does Helen come from?
A
Yeah, so the story go back to almost 30 years ago. So I started actually doing research on how to enable robust distributed systems using AI technology almost 30 years ago. Can you believe?
B
Wait, wait, wait, 30 years ago? You can't do AI 30 years ago. The AI was created by OpenAI, right?
A
No, so OpenAI actually leveraged technology like neural networks, right? But neural networks was invented almost like a decade ago and so like several decades ago. And so the idea of using neural networks to actually extracting patterns and perform predictions and classifications is actually long established research topic. So about 30 years ago, talking about, my starting point is that I was trying to actually enable robust video streaming from Mars to Earth because my research was funded by NASA PI funding project. And so the idea is that you send the rover, right? So believe it or not, there's one vehicle we send to Mars and then basically the Mars will collect the scenes on the rover, will collect the scenes on the Mars and then will send the video back to the Earth. And so my research is to enable this. Video streaming is reliable, right? And so you can imagine there's all kinds of potential issues from Mars to Earth, like glitches or hardware failures or delays, et cetera. So my research is to use neural networks to actually predict the resource, the usage based on the video content. And so at that time I built basically a 3 tier 13 neuron network and with maybe about 10 also parameters that's basically the starting point, right? So fast forward today and everybody knows ChatGPT and they know this, this is AI. But so my research is about like how to use AI technology to make our system more stable and you can actually enjoy, you know, all the IT services. Like you can enjoy your, you know, your, your movies right, on the Internet smoothly. And so that's basically my research and fast forward today. Like, you know, along the way I got my PhD from university, you know, Urbana Champaign. And then after that I joined IBM Research as a researcher. And so I joined a very interesting project. And so that was actually around 2004, so I started my like a career being a researcher study like how to enable like robust real time data streaming processing. So that was actually way before today you probably heard about Spark or Hadoop. And so even before that there was a thing on virtualization or cloud, right? So at that time, IBM trying to build this robust distributed streaming infrastructure, they can actually process video data, audio data, test data in real time and provide the insights, right? So we detect for example, fraud transactions. So my research is to work with those AI experts and to make sure those AI models can run smoothly with the right infrastructure support. So starting there, basically I realized is that there's a lot of AI technologies invented to analyze text data, like video data or image data. But the very few technology were invented to actually analyze very noisy like machine data, for example, system logs, application logs,
B
telemetry, telemetry coming out of the cpu, all that stuff, right?
A
Yeah, exactly. I start to think, okay, if we can use those AI technology to extract the patterns for those text or video, why we cannot use AI technology to extract patterns for our machines. So I started to actually dive into that field and then develop a set of prediction algorithms to predicting, for example, disk hardware failures. So it was actually pretty successful at the beginning. We have the AI algorithms invented in a space more like supervised learning. And then I soon quickly realized to apply this kind of supervised machine learning, it's very hard to actually gather like training labels to build these quality models. So I basically changed my direction to laser focus on unsupervised machine learning, which is you can actually ask AI model to automatically learn without any human guidance.
B
Well, wait just a second. On the unsupervised, because a lot of people, I understand it a little bit, but a lot of the people might be confused on unsupervised. That means there's no human there telling it yes or no, that's the right direction or not. Isn't there a fear? Isn't there a fear that now the AI is going to do the wrong things or not adhere to some kind of guidelines? Like I'm thinking Asimov's Laws for robots. Right. I don't want an AI to go off and decide that we, we have to destroy the human race. Right. Unsupervised is what's it going to do. So can you explain a little bit more on the unsupervised what that really means? Because it does mean no human interaction. But there are guardrails, right?
A
Yeah. So the unsupervised learning here is now learning human behavior or human language. Right. So that was ChatGPT is about right. So like you using those like RMS, they are called a large language model because they are basically predicting human languages. Right. So here I was talking about using unsupervised learning to actually understand the machine behavior. So the good news for machine behavior, we are not actually at the risk of actually lie or mistrain. So there's no intention for doing that. And so the idea is that you can have those AI algorithms to effectively extracting patterns because there's so much data you need to analyze. And so basically I developed a set of unsupervised learning algorithms. But it's very challenging task because without any label training data it's extremely hard to make it accurate. Yeah, so that's basically I started my career as a faculty at NC State and so I started basically Laser Focus on this research area. So we were funded by National Science foundation and Google and IBM, a set of companies. So after that basically I we publish a set of research papers and then companies start to actually reach out to me saying can we use it in a commercial environment? So I basically one of the companies actually Google, I have been collaborating with Google for a long time before that outrage. But then I realized, okay, there's an opportunity for me to really apply technology in real world. So I went to Google as a data scientist and to basically work with their SRE side by side to actually see how we can use our algorithms to predicting or detecting anomis in Google Cloud. And so about like 11 months later. And so we basically use like you know, over 20 real cloud outages at Google Cloud and to show that our algorithms can accurately predicting all kinds of problems in the, in the real production environments with much better accuracy and efficiency and scalability. So Google actually license the technology which is also very Euro thing. And then basically I left Google, launched Insight Finder and then fast track to today basically that is about like 10 years we have been building this product to help customer to how to detect those IT outages and then even predict and fix them automatically using AI technology.
B
So I was going to ask you about that right now. I mean the first step was about detection. Right?
A
Right.
B
I can I see this pattern. I know when I've seen this pattern before I had an outage. Yeah, Right now you can do prediction. You can say, hey, we're headed down a path we've seen before where an outage might be coming. So you've moved from, you know, detection to prediction. And now you said we've also moved into auto correction as well.
A
Yeah.
B
So do you think with the auto correction stuff that these systems will become even more and more dynamic in that they're going to reroute the way things are done kind of dynamically instead of it being so fixed and static? Because I'm just thinking when I set a microservice architecture up together and I connect all the endpoints together through an API gateway or whatever, that is very static. Right. And I could have failures, but with your stuff here, I could automatically reroute all that stuff dynamically without having to write a line of code or have a human in the middle. Is that correct?
A
Yeah. So like rerouting requests is just one of the actions you can take. There's a lot of other actions you can take more like a non intrusive, for example resource scaling and so you can add more resources or you basically can adjust certain parameters so there's different ways to fix the problem. But majority of the problem is actually kind of mixture of different factors. And so the real production environments typically doesn't actually happen just like triggered by small issues. Right. So it always start from some minor issues that propagates. So the idea is that you want to catch those minor issues as early as possible. And typically those minor issues are not noticed by user. And so if you can fix those problems early on, and most of the issues is pretty easy, for example, a lot of production outages caused by certain resource depletion bug. And either you run out of disk space right. On one of the, for example your proxies or your app servers. And so if you rent out disk space, your whole service is done. Right. But if you can actually capture this kind of depletion trend early on and you can actually either stop the bleeding and figure out, okay, what kind of request is causing this resource depletion, you can stop them. So in this way you prevent the the whole system outage. Right. So rather than localize the impact. And so you can kind of isolate the root cause and then fix them easily. Yeah.
B
You know the first thing that pops into my mind, why haven't we done this so much earlier? Because we have all this data and it's, I would think it's highly deterministic. Is it not deterministic? I mean, because a lot of people say, oh, computers, right. They only do what I tell them to do. But why is this such a problem?
A
Yeah, so several things, right? So the one thing is that if you look at computer system, one difference is that they have so many parameters and configurations you need to understand. And just like a simple example, right? So like a web server, you can have thousands of parameters you need to tune in order to actually make it know, functioning. So, so it's very kind of high dimensional space problem. And so that's the, that's one of the difference for human, right. So our brain is trained to make decisions if we are giving limited number of signals, right? So we can actually make a very good decision if we have basically, if I tell you clearly, oh, this server has a full disk, you will probably fix that. But what you saw is basically thousands of nodes and basically they all have all kinds of. Each node has hundreds of thousands of metrics, all dynamic. And so it's very hard for human to try to narrow down or localize the problem. Right? So it's always easy when you see the problem. But the problem is how do you find that needle in the stack?
B
Because there's too much going on is what you're saying, right? Because especially in the cloud, on one server, I could have a hundred different applications running on one server, Right, exactly.
A
And they interact with each other, they sometimes interfere with each other. There's all kinds of highly fluctuating behavior. So the system is actually fluctuating and dynamic by nature. Right. So it's not like you think about traditional IT system are very predictable, very stable. So that's easy. Right.
B
But clouds, clouds are very complex because you don't know what's going to run on them ahead of time.
A
It's very generic. Yeah. And now the AI system make that even more complicated.
B
Oh yeah, yeah, exactly. So where do you see this technology moving in the future? Right? Because I mean it, it sounds like it could be very powerful if I could do predictive analysis. And, and I'm, I'm assuming with your algorithms you're still using a neural network, you're still using, you know, prediction and things like that. Where do you see it going in, in the Future. Do you see these architectures, I don't know what to call it, a cloud operating system, a cloud monitoring system. Do you see them becoming even more dynamic and even more alive, self healing than what we have today? And what does that look like in the future?
A
Yeah, I think that the, the future is basically now. The, what we saw is that you have different elements introduced to the system, right? So early days you only have web services and now you basically have the microservices architecture. Then you start to have a highly dynamic environments. Then now we have RMS and then we also have agents, AI agents. So the interaction is basically not only between different IT services or between software and hardware, is also between different models and between different agents and then the agents with infrastructures, right? So the interaction becomes a lot more complicated. And so the analysis you need to do also needs to expand to cover the whole space, right? So you need to, in order to actually accurately detect or predict problems or do root cause analysis, you need to actually mind your models, data and infrastructure altogether because you don't know where the problem starts from. And the other thing also very different, is that we need to actually perform this kind of analysis in real time. And so most of time if you look at machine data, you have pretty like clear kind of data patterns, right? And now if you look at AI agents and because they behave with you, they interact with human. And so there's a lot of more like, you know, you have unpredictable data
B
being pushed into the system because humans are involved.
A
Yeah, exactly. And also there's more context data you need to actually take into account in order to actually really analyze the health condition of those systems. And so that's basically where we have been working on is to actually how to do this real time evaluation, real time analysis with this rich context and also how to attribute the problems back to different agents, different models, different data sets, right? So you need to do a very fine grained sort of attribution now because before it was pretty clear, okay, there's a CPU problem, there's network problem, there's
B
this problem or a disk problem. Yeah, yeah, yeah. So this is, it really becomes much more complex. I'm going to throw even more complexity in there. I see more and more architectures moving to edge and edge to cloud, where that boundary between the cloud and the data center and the edge is completely shattered. And now I have applications that are running across all three domains. Do you, I mean this tool would have to be able to extend beyond the cloud too, right? Because now you have a lot of Influence on applications outside of your control or outside of your traditional cloud boundary.
A
Yeah, so definitely. So we actually work not with just the cloud, like you know, service providers or like the application running inside the cloud. We, we work with the kind of end device users as well or customers who actually preventing end devices. Right. And so right now, for example, if you use ChatGPT on your computer, and so a lot of times, like a lot of people didn't realize, AI models will have hallucinations inevitably because the AI model is built on top of statistical learning. And statistical learning is basically based on the, you know, the patterns and the statistics not based on, you know, conscious or basically based on the real logic. Right. And so, and some, for example, like a very simple example, if you type in certain, like, you know, you ask certain questions, for example, you ask like what is Windows error code for this code and what does it mean? And to charge BDO anthropic to Gemini and they all give you some answer, but all of them are wrong. And so the reason is that those AI models, they won't tell you. I don't know.
B
No, they don't. No. They make stuff up if they don't know.
A
Right, exactly. Because the algorithm is using statistical path inference and so they always find the path, right. No matter whether this path makes sense or not.
B
That's, that's not, There'll always go, there'll always be a path. Right?
A
Yeah, exactly.
B
But, but in your models you can't afford to do that, right? Because you could be shutting down. If you're just using statistical analysis, won't you possibly shut something down? That's not bad.
A
Yeah. So for us is that we basically deal with true positive, false positive. Right? So of course for any AI models, you need to actually kind of balance the trade off. Right. And so that's the reason we believe. Right? So no AI models is perfect, Right. Just like no human is perfect. We will make mistakes. And so the key point is that it's not like you don't trust the AI models or you trust the AI models 100%. Both are wrong. Right? So that's the reason, very important, that you need to be able to have a kind of monitor in place to actually help you to identify, okay, this could potentially has a problem from AI models, right? And then you can collect that as a feedback and to continuously improve your model. Right. So that's what we actually developed in our product is this feedback driven closed loop analysis for our customers. They can actually, they have the full visibility on those model output and Behaviors and they can give feedback saying oh, this is a good prediction, this is
B
bad prediction, this is good, this is bad. So that's kind of like a, it's a reinforced learning model, right? Where I have unsupervised, but I can inject supervision in the model through reinforced learning. Is that kind of the pattern?
A
Yes, yes.
B
That's very slick here.
A
Is to be able to actually automatically doing this, right? So with minimum effort from the user.
B
Do you think these models that you have, do you think they can be applied in other places or just in the cloud? Are they so specialized that they really can only help in cloud architectures? And do they have to be trained specifically for my type of environment? Do you know where I'm going with that? Or kind of a general purpose one that I can start with that can be used in and become smarter because the longer, the longer it runs, it's going to get better and better, right? For my, for my environment.
A
Yeah. So first of all, we developed this online learning model, right? So the model is not like unlike ChatGPT or you know, anthropic, those large language model, they typically import kind of employ of kind of offline learning model, right? So because their training is so expensive, you cannot afford basically continuously learning training. So but our model focus on more like, you know, this runtime behaviors and we develop this technology we call the composite AI because we are not just employ, you know, one AI technology. We have basically ensemble of AI technologies including like, you know, causal inference, including predictive AI, including unsupervised behavior learning and into including srm, right. Small language model. And so we combine them together and so all those models actually can perform online learning. And so we actually make this model kind of automatically adapt to different environments, right? So the idea is that those model has basically this kind of self learning capability when we deliver those models to our customers. And those models are basically trained and tuned for customer environment specifically. So because it's our belief is that you can have basically some foundation for those models to have some built in logic and inference logic. But because customer environments are very different, right? And so they have different kind of, you know, applications, different like, you know, criterias and different network topologies.
B
There's a whole bunch of.
A
Yeah, right, exactly. So we basically allow our models to be easily customized for different environments. And so in this way we can actually make the AR algorithm highly accurate because accuracy is, is extremely important for our domain.
B
So how long would it normally take? Let's say that I have a brand new company, right? I'M doing cloud technology, brand new regional cloud provider. And if I want to deploy your models, how long does it take me to deploy a model to now say, hey, this is now useful and now is in the training, the automated training by using type of mode. How long does that take in order to deploy your system?
A
Yes. So surprisingly this model can actually start to produce insights pretty quickly. So one example I can tell you is that when we deploy our models in one of the Fortune 50 companies and so we start to actually generate very useful root cause analysis results during the deployment roll out. So one thing, like the reason why our model can be quickly set up is that we are learning over machine data and machine data is just so it's very fast.
B
Yeah.
A
And the other thing is that you can actually bootstrap those models with historical data. Generally speaking, when we actually observe the system's behavior, we start to actually build up this kind of weekly time based patterns. Right. So because like, you know, most of production environments has very clear patterns, you know, monthly or weekly.
B
Oh, weekly. Yeah, yeah, yeah.
A
So we can bootstrap the model with those data and then basically when they deploy they can immediate start or basically when we started with some model and with zero data then typically we, we saw basically pretty good results after one week of learning and then, then the model will continue to, to improve. Yeah.
B
Can, can I take your composite model idea and use it for other things besides just admin type work? Let's say that I, let's say that. What's a good example? A fighter jet. Let's say I want to because I don't know if you know, but the fighter jets today have about 30 servers in them, like a rack of machines. Right. With the network and all. That's complex stuff.
A
Yes.
B
That's not cloud computing. Right. How hard is it for me to take that model and maybe put it into another environment that might not fit? Or. And let's say not a fighter jet. Let's say that I just have it at a power plant, right. That hey, I'm going to monitor not system logs, but I'm going to monitor power logs now instead. Right. Because I can monitor fluctuations in current and voltage and all that stuff. Can I use your guys's stuff right away to start looking at different types of data or has it been so tuned to system administration type work that it's not really useful outside of that domain?
A
So in fact it's actually that actually absolutely you can use that because the AI model we design is data agnostic.
B
Oh, very cool.
A
Yeah. So the key thing that's also coming back to unsupervised learning. We are now relying on rules, we are now relying on predefined thresholds to actually detect the problems. And everything is learning based on the data. And as long as the data has basically, you know, we, we look at the numerical data. Right. We look at, you know, loss data. So pretty much all the kind of like modality we can cover. So.
B
So it's very structured, very structured data, right?
A
Yeah. So log data could be unstructured so we can analyze.
B
There's always, there's a time stamp there. There's always some semi structured maybe is the right word.
A
Yeah, that's right. You are right, yeah.
B
You're not reading PDF files and things like that.
A
Yeah, that's right. That's right. So as long as it has a timestamp, it's basically what we need. And so we basically can. The key thing here is to coming back to your example. So some of my research is funded by army and yeah. So when I work with those government agencies and so a lot of times we learn is that even in those machines they are running kubernetes.
B
Right.
A
So like a little kubernetes clusters and so. Absolutely. So I think a lot of like, you know, problems and can be, actually can be similar. Right. So like, you know, this kind of predictive prevention is critical because you know, any, any glitches happens in those environments are even more like, you know, kind of crucial if you.
B
Oh yeah, yeah. Especially in those environments when you're dealing with the physical world. Right. Like let's say it's at a, it's at a dam or water treatment plant. I mean these are critical systems that we need for our society. So this is great stuff. Hey Helen, this is, this has been awesome. I don't get to go deep into technical stuff very often. So this is a lot of fun, especially talking to someone that's had code on Mars. I don't talk to people that have code on Mars very often. Right. Which is. And it's still there. Right. The rover's still there on Mars.
A
So yeah, I don't know. Like, you know, I left the project a long time ago.
B
Yeah. But it's still, it's still there. It might not be transmitting, but it's still. So you have code on another planet. That is super cool. That is super cool. So. But Helen, thanks for coming on the show. If people want to learn more, where can they go to find out more about your guys technology and what you've done.
A
Yeah, absolutely. So can come to our company website, inside finder.com and and so feel free to reach out to us like@info insidefunder.com and so we also have social pages on LinkedIn and YouTube. And so feel free to actually follow us and reach out to us.
B
All right, Helen, thanks again for coming on the show. This was wonderful.
A
Thank you. Yeah.
B
Thanks for listening to Embracing Digital Transformation. If you enjoyed today's conversation, give us five stars on your favorite podcasting app or on YouTube. It really helps others discover the show. If you want to go deeper, join our exclusive community@patreon.com embracingdigital where we share bonus content. And you can always connect with other change makers like yourself. You can always find more resources at Embracing Digital. Until next time, keep embracing the digital transformation.
Episode #355: From Mars to Data Centers—AI that Prevents Cloud Outages
Host: Dr. Darren Pulsipher
Guest: Dr. Helen Gu, Professor at North Carolina State University and Founder/CEO at InsightFinder
Date: June 4, 2026
This episode explores the journey and real-world application of AI technologies that autonomously prevent IT and cloud outages. Dr. Darren Pulsipher interviews Dr. Helen Gu, whose background ranges from using neural networks for NASA's Mars projects to founding InsightFinder—a company focused on using AI, particularly unsupervised learning, for outage prediction, detection, and self-healing in IT systems. The discussion touches on the evolution of distributed AI systems, the challenges of handling massive, noisy datasets, and the future of dynamic, autonomous cloud and edge infrastructures.
(01:33–05:44)
(05:44–10:41)
(06:52–07:48)
(08:47–10:41)
(10:41–13:41)
(13:41–16:05)
(16:18–19:29)
(19:32–20:12)
(20:12–23:27)
(24:32–26:44)
(26:44–31:57)
On Early AI:
“The story goes back almost 30 years… My starting point is that I was trying to actually enable robust video streaming from Mars to Earth because my research was funded by a NASA PI project.”
—Dr. Helen Gu (01:33)
On Unsupervised Learning:
“If we can use those AI technology to extract the patterns for those text or video, why can we not use AI technology to extract patterns for our machines?”
—Dr. Helen Gu (05:53)
On Human Limits and System Complexity:
“It’s very hard for human to try to narrow down or localize the problem. It’s always easy when you see the problem. But the problem is how do you find that needle in the stack?”
—Dr. Helen Gu (14:24)
On AI Hallucination vs. Reliability:
“A lot of people didn’t realize, AI models will have hallucinations inevitably… because the AI model is built on top of statistical learning… and some, for example, like a very simple example… they all give you some answer, but all of them are wrong.”
—Dr. Helen Gu (20:29)
On Customization and Agnostic Design:
“The AI model we design is data agnostic... as long as it has a timestamp, it’s basically what we need.”
—Dr. Helen Gu (30:02)
Fun Fact:
“I don’t talk to people that have code on Mars very often... So you have code on another planet. That is super cool.”
—Dr. Darren Pulsipher (32:35)
For more information, visit InsightFinder.com or connect via LinkedIn/YouTube.