
Loading summary
A
And then fast track to today. Basically that it is about like 10 years we have been building this product to help customer to how to detect those IT outages and then even predict and fix them automatically using AI technology.
B
Welcome to Embracing Digital Transformation where we explore how people process policy and technology drive effective changes. This is Dr. Darren, Chief Enterprise architect, educator, author and most importantly your host on this episode. From Mars to the data center AI that prevents cloud outages with Dr. Helen Gu, professor at North Carolina State University and founder and CEO at InsightFinder. Helen, welcome to the show.
A
Thank you for inviting me.
B
I'm really interested about this topic because so far in the AI realm and AI agent realm, it's been kind of the wild west. And I haven't seen really good software best practices applied to it like we do in other systems. But you got something unique that we're going to talk about. But before we do, everyone that listens to my show knows that I only have superheroes on the show. And every superhero has a background story, an origin story. So Helen, what's your background story? What's your origin story? Where does Helen come from?
A
Yeah, so the story go back to almost 30 years ago. So I started actually doing research on how to enable robust distributed systems using AI technology almost 30 years ago. Can you believe?
B
Wait, wait, wait. 30 years ago? You can't do AI 30 years ago. The AI was created by OpenAI, right?
A
No, so OpenAI actually leveraged technology like neural networks, but neural networks was invented almost like a decade ago and so several decades ago. And so the idea of using neural networks to actually extracting patterns and perform predictions and classifications is actually long established research topic. So about 30 years ago, talking about, my starting point is that I was trying to actually enable robust video streaming from Mars to Earth because my research was funded by NASA PI funding project. And so the idea is that you send the rover, right? So believe it or not, there's one vehicle the we send to Mars and then basically the Mars will collect the scenes on the the rover will collect the scenes on the Mars and then will send the video back to the Earth. And so my research is to enable this video streaming is reliable, right? And so you can imagine there's all kinds of like potential issues like you know, from Mars to Earth, right? Like you know, glitches or hardware failures or delays, et cetera. So my research is to use neural networks to actually predict the resource, the usage based on the video content. And so at that time I built basically a 3 tier 13 neuron network. And with maybe about 10 also parameters. That's basically the starting point, right? So fast forward today and everybody knows chbt and they know this, this is AI. But so my research is about like how to use AI technology to make our system more stable and you can actually enjoy, you know, all the IT services. Like you can enjoy your, you know, your, your movies right, on the Internet smoothly. And so that's basically my research and fast forward today. Like, you know, along the way I got my PhD from university, you know, Urbana Champaign. And then after that I joined IBM Research as a researcher. And so I joined a very interesting project. And so that was actually around 2004. So I started my like a career being a researcher study like how to enable like robust real time data streaming processing. So that was actually way before today, you probably heard about Spark or Hadoop. And so even before there was a thing on virtualization or cloud, right? So at that time, IBM trying to build this robust distributed streaming infrastructure, they can actually process video data, audio data, text data in real time and provide the insights, right? So we detect for example, fraud transactions. So my research is to work with those AI experts and to make sure those AI models can run smoothly with the right infrastructure support. So starting there, basically I realized is that there's a lot of AI technologies invented to analyze text data, like video data or image data. But the very few technology were invented to actually analyze very noisy like machine data, for example, system logs, application logs,
B
telemetry, telemetry coming out of the cpu, all that stuff, right?
A
Yeah, exactly. And so I start to think, okay, if we can use those AI technology to extract the patterns for those text or video, why we cannot use AI technology to extract patterns for our machines. So I started to actually dab into that field and then develop a set of prediction algorithms to predicting, for example, disk hardware failures. So it was actually pretty successful at the beginning. We have the AI algorithms invented in a space more like supervised learning. And then I soon quickly realized to apply this kind of supervised machine learning, it's very hard to actually gather like training labels to build these quality models. So I basically changed my direction to laser focus on unsupervised machine learning, which is you can actually ask AI model to automatically learn without any human guidance.
B
Well, wait just a second on the unsupervised, because a lot of people, I understand it a little bit, but a lot of the people might be confused on unsupervised. That means there's no human there telling it yes or no, that's the right direction or not, isn't there a fear, isn't there a fear that now the AI is going to do the wrong things or not adhere to some kind of guidelines? Like I'm thinking Asimov's Laws for robots. Right. I don't want an AI to go off and decide that we have to destroy the human race. Right. Unsupervised is what's it going to do. So can you explain a little bit more on the unsupervised what that really means? Because it does mean no human interaction. But there are guardrails, right?
A
Yeah. So the unsupervised learning here is now learning human behavior or human language. Right. So that was ChatGPT is about. Right. So like you using those like RMS, they are called a large language model because they are basically predicting human languages. Right. So here I was talking about using unsupervised learning to actually understand the machine behavior. So the good news for machine behavior, we are not actually at the risk of actually lie or mistrain. So there's no intention for doing that. And so the idea is that you can have those AI algorithms to effectively extracting patterns because there's so much data you need to analyze. And so basically I developed a set of unsupervised learning algorithms. But it's very challenging task because without any label training data, it's extremely hard to make it accurate. Yeah, so that's basically I started my career as a faculty at NC State and so I started basically Laser Focus on this research area. So we were funded by National Science foundation and Google and IBM, a set of companies. So after that basically I we publish a set of research papers and then companies start to actually reach out to me saying can we use it in a commercial environment? So I basically one of the companies actually Google, I have been collaborating with Google for a long time before that outrage. But then I realized, okay, there's an opportunity for me to really apply technology in real world. So I went to Google as a data scientist and to basically work with their SRE side by side to actually see how we can use our algorithms to predicting or detecting anomalies in Google Cloud. And so about like 11 months later. And so we basically use like you know, over 20 real cloud outages at Google Cloud and to show that our algorithms can accurately predicting all kinds of problems in the, in the real production environments with much better accuracy and efficiency and scalability. So Google actually licensed the technology, which is also very a Euro thing. And then basically I left Google, launched Insight Finder and then Fast Track to Today, basically that is about like 10 years we have been building this product to help customer to how to detect those IT outages and then even predict and fix them automatically using AI technology.
B
So I was going to ask you about that right now. I mean the first step was about detection.
A
Right, Right.
B
I can I see this pattern. I know when I've seen this pattern before I had an outage.
A
Yeah.
B
Right now you can do prediction. You can say, hey, we're headed down a path we've seen before where an outage might be coming. So you've moved from, you know, detection to prediction. And now you said we've also moved into auto correction as well.
A
Yeah.
B
So do you think with the auto correction stuff that these systems will become even more and more dynamic in that they're going to reroute the way things are done kind of dynamically instead of it being so fixed and static? Because I'm just thinking when I set a microservice architecture up together and I connect all the endpoints together through an API gateway or whatever, that is very static. Right. And I could have failures, but with your stuff here, I could automatically reroute all that stuff dynamically without having to write a line of code or have a human in the middle. Is that correct?
A
Yeah. So like rerouting requests is just one of the actions you can take. There's a lot of other actions you can take more like a non intrusive, for example resource scaling and so you can add more resources or you basically can adjust certain parameters so there's different ways to fix the problem. But majority of the problem is actually kind of mixture of different factors. And so the real production environments typically doesn't actually happen just like triggered by small issues. Right. So it's always start from some minor issues that propagates. So the idea is that you want to catch those minor issues as early as possible. And typically those minor issues are not noticed by user. And so if you can fix those problems early on, and most of the issues is pretty easy, for example, a lot of production outages caused by certain resource depletion bug. And either you run out of disk space. Right, right. On one of the, for example your proxies or your app servers. And so if you run out disk space, your whole service is done. Right. And so but you know, if you can actually capture this kind of depletion trend early on and you can actually either stop the bleeding and by, you know, figure out, okay, what kind of request is causing this resource depletion, you can stop them so in this way you prevent the whole system outage. Right. So rather than localize the impact and so you can kind of isolate the root cause and then fix them easily. Yeah.
B
The first thing that pops into my mind, why haven't we done this so much earlier? Because we have all this data and I would think it's highly deterministic. Is it not deterministic? I mean, because a lot of people say, oh, computers, right. They only do what I tell them to do. But why is this such a problem?
A
Yeah, so several things, right? So like, you know, the one thing is that if you look at computer system, one difference is that they have so many parameters and configurations you need to understand. And just like a simple example, right? So like a web server, you can have thousands of parameters, you need to tune in to actually make it functioning. So it's very kind of high dimensional space problem. And so that's one of the difference for human. Right. So our brain is trained to make decisions if we are giving limited number of signals. So we can actually make a very good decision if we have basically, if I tell you clearly, oh, this server has a full disk, you will probably fix that. But what you saw is basically thousands of nodes and basically they all have all kinds of. Each node has hundreds of thousands of metrics, all dynamic. And so it's very hard for human to try to narrow down or localize the problem. Right. So it's always easy when you see the problem, but the problem is how do you find that need?
B
Because there's too much going on is what you're saying, right? Because especially in the cloud, on one server, I could have a hundred different applications running on one server, Right, exactly.
A
And they interact with each other, they sometimes interfere with each other. There's all kinds of like highly fluctuating behavior. So it's a, you know, the system is actually fluctuating and dynamic by nature. Right. So it's not like, you know, you think about like traditional IT system are very predictable, very stable. So that's easy. Right.
B
But clouds, clouds are very complex because, because you don't know what's going to run on them ahead of time. It's very generic.
A
Yeah. And now the AI system make that even more complicated.
B
Oh yeah, yeah, exactly. So where do you see this technology moving in the future? All right, because it sounds like it could be very powerful if I could do predictive analysis. And I'm assuming with your algorithms you're still using a neural network, you're still using prediction and things like that. Where do you see it going in the future. Do you see these architectures, I don't know what to call it, a cloud operating system, a cloud monitoring system. Do you see them becoming even more dynamic and even more alive, self healing than what we have today? And what does that look like in the future?
A
Yeah, I think the future is basically now. What we saw is that you have different elements introduced to the system, right? So early days you only have web services and now you basically have the microservices architecture. Then you start to have a highly dynamic environments. Then now we have RMS and then we also have agents, AI agents. So the interaction is basically not only between different IT services or between software and hardware, it's also between different models and between different agents. And then the agents with infrastructures, right. So the interaction becomes a lot more complicated. And so the analysis you need to do also needs to expand to cover the the whole space, right? So you need to, in order to actually accurately detect or predict problems or do root cause analysis, you need to actually monitor models, data and infrastructure altogether because you don't know where the problem starts from. And the other thing also very different, is that we need to actually perform this kind of analysis in real time. And so most of time if you look at machine data, you have pretty clear kind of data patterns, right? And now if you look at AI agents because they behave with you, they interact with human. And so there's a lot of more like, you know, when you have unpredictable
B
data being pushed into the system because humans are involved.
A
Yeah, exactly. And also there's more context data you need to actually take into account in order to actually really analyze the health condition of those systems. So that's basically where we have been working on is to actually how to do this real time evaluation, real time analysis with this rich context and also how to attribute the problems back to different agents, different models, different data set. Right? So you need to do a very fine grained sort of attribution. Now before it was pretty clear, okay, there's a CPU problem, there's network problem,
B
there's this or this problem. Yeah, yeah, yeah. So this, this is, it really becomes much more complex. I'm going to throw even more complexity in there. I see more and more architecture's moving to edge and edge to cloud, where that boundary between the cloud and the data center and the edge is completely shattered. And now I have applications that are running across all three domains. I mean this tool would have to be able to extend beyond the cloud too, right? Because now you have a lot of Influence on applications outside of your control or outside of your traditional cloud boundary.
A
Yeah, so definitely. So we actually work not with just the cloud like you know, service providers or like the application running inside cloud. We work with the kind of end device users as well or customers who actually prevailing end devices. Right. And so right now, for example, if you use ChatGPT on your computer and so a lot of times, a lot of people didn't realize AI models will have hallucinations inevitably because the AI model is built on top of statistical learning. And statistical learning is basically based on the patterns and the statistics not based on, you know, conscience conscious or basically based on the, the, the real logic. Right. And so, so, and some, for example, like a very simple example, you, if you type in certain, like, you know, you ask certain questions, for example, you ask like what is Windows error code for, you know, this, this, this code and what, what does it mean? And to charge BTO anthropic to Gemini and they all give you some answer, but all of them are wrong. And so the reason is that those AI models, they won't tell you. I don't know.
B
No, they don't. No. They make stuff up if they don't know.
A
Right, exactly. Because the algorithm is using statistical path inference. And so they always find a path, right. No matter whether this path makes sense or not.
B
That's not, they'll always go, there'll always be a path. Right?
A
Yeah, exactly.
B
But, but in your models you can't afford to do that, right? Because you could be shutting down. If you're just using statistical analysis, won't you possibly shut something down? That's not bad.
A
Yeah. So, so for us is that we basically deal with you know, two positive, false positive. Right? So of course, like, you know, for any models you need to actually kind of balance the trade off. Right? And so that's the reason we, we believe. Right? So no AI models is perfect, right. Just like no human is perfect. We will make mistakes. And so the key point is that it's not like you, you don't trust the AI models or you, or you trust the AI models 100%. Both are wrong. Right. I think that, so that's the reason, very important that you need to be able to have a kind of monitor in place to actually help you to identify, okay, this could potentially has a problem from AI models, right. And then you can collect that as a feedback and to continuously improve your model. Right. So that's what we actually developed in our product is this feedback driven closed loop analysis for our customers. They can actually, they have the full visibility on those model output and the behaviors and they can give feedback saying oh, this is a good prediction, this
B
is bad prediction, this is good, this is bad. So that's kind of like a, it's a reinforced learning model, right? Where I have unsupervised, but I can inject supervision in the model through reinforced learning. Is that kind of the pattern?
A
Yes, yes. That's very slick here is to be able to actually automatically doing this, right? So with minimum effort from the, from the user.
B
Do you think these models that you have, do you think they can be applied in other places or just in the cloud? Are they so specialized that they really can only help in cloud architectures? And do they have to be trained specifically for my type of environment? Do you know where I'm going with that? Or kind of a general purpose one that I can start with that can be used in and become smarter because the longer it runs, it's going to get better and better, right? For my environment.
A
Yeah. So first of all, we developed this online learning model, right? So the model is now like unlike ChatGPT or you know, anthropic, those large language model, they typically import kind of import of kind of offline learning model, right? So because their training is so expensive, you cannot afford basically continuously the training. So but our model focus on more like, you know, this runtime behaviors and we develop this technology we call the composite AI because we are not just employ, you know, one AI technology. We have basically ensemble of AI technologies including causal inference, including predictive AI, including unsupervised behavior learning and including SRM small language model. So we combine them together and so all those models actually can perform online learning. And so we actually make this model kind of automatically adapt to different environments, right? So, so the idea is that those model has basically this kind of self learning capability. When we deliver those models to our customers, those models are basically trained and tuned for customer environment specifically. So because it's our belief is that you can have basically some foundation for those models to have some built in logic and inference logic. But because customer environments are very different, right? And so they have different kind of applications, different criterias and different network topologies.
B
There's a whole bunch of.
A
Yeah, right, exactly. So we basically allow our models to be easily customized for different environments. And so in this way we can actually make the AR algorithm highly accurate because accuracy is extremely important for our domain.
B
So how long would it normally take? Let's say that I have a brand new company, I'm doing cloud technology, brand new regional cloud provider. And if I want to deploy your models, how long does it take me to deploy a model to now say, hey, this is now useful and now is in the training, the automated training by using type of mode. How long does that take in order to deploy your system?
A
Yes. So surprisingly this model can actually start to produce insights pretty quickly. So one example I can tell you is that when we deploy our models in one of the Fortune 50 companies and so we start to actually generate very useful root cause analysis results during the deployment roll out. So one thing, like the reason why our model can be quickly set up is that we are learning over machine data and machine data is just so.
B
It's very fast. Yeah.
A
And the other thing is that you can actually bootstrap those models with historical data. Generally speaking, when we actually observe the system's behavior, we start to actually build up this kind of weekly time based patterns because most of production environments has very clear patterns. Monthly or weekly?
B
Weekly, yeah.
A
So we can bootstrap the model with those data and then basically when they deploy they can immediately start or basically when we started with some model and with zero data, then typically we saw basically pretty good results after one week of learning and then the model will continue to improve.
B
Can I take your composite model idea and use it for other things besides just sysadmin type work? Let's say that. What's a good example? A fighter jet. Let's say I want to, because I don't know if you know, but the fighter jets today have about 30 servers in them. Like a rack of machines. Right. With the network and all. That's complex stuff.
A
Yes.
B
That's not cloud computing. Right. How hard is it for me to take that model and maybe put it into another environment that might not fit? Or. And let's say not a fighter jet. Let's say that I just have it at a power plant. Right. That hey, I'm going to monitor not system logs, but I'm going to monitor power logs now instead. Right. Because I can monitor fluctuations in current and voltage and all that stuff. Can I use your guys's stuff right away to start looking at different types of data? Or has it been so tuned to system administration type work that is not really useful outside of that domain.
A
So in fact, absolutely you can use that because the AI model we design is data agnostic.
B
Very cool.
A
Yeah. So the key thing that's also coming back to unsupervised learning, we are now relying on rules, we are now relying on predefined thresholds to actually detect the problems. And everything is learning based on the data. And as long as the data has basically, you know, we look at the numerical data. Right. We look at, you know, log data. So pretty much all the kind of like modality we can cover. So.
B
So it's very structured, very structured data, right?
A
Yeah. So log data could be unstructured so we can add.
B
Yeah, but there's a timestamp. There's always some semi structured maybe is the right word.
A
Yeah, that's right. You are right.
B
Yeah. You're not reading PDF files and things like that.
A
Yeah, that's right. That's right. So as long as it has a timestamp, it's basically what we need. And so we basically can. The key thing here is to coming back to your example. So some of my research is funded by army and yeah. So when I work with those government agencies and so a lot of times we learn is that even in those machines they are running kubernetes.
B
Right.
A
So like little kubernetes clusters and so. Absolutely. So I think a lot of like, you know, problems and can be, actually can be similar. Right. So like you know, this kind of predictive prevention is critical because you know, any, any glitches happens in those environments are even more like, you know, kind of crucial if you.
B
Oh yeah, yeah. Especially in those environments when you're dealing with the physical world. Right. Like let's say it's at a, it's at a dam or water treatment plant. I mean these are critical systems that we need for our society. So this is great stuff. Hey Helen, this is, this has been awesome. I don't get to go deep into technical stuff very often. So this is a lot of fun, especially talking to someone that's had code on Mars. I don't talk to people that have code on Mars very often. Right. Which is. And it's still there. Right? The rover's still there on Mars.
A
So yeah, I don't know. Like you know, I left the project a long time ago.
B
Yeah, but it's still, it's still there. It might not be transmitting but it's still. So you have code on another planet. That is super cool. That is super cool. So. But Helen, thanks for coming on the show. If people want to learn more, where can they go to find out more about your guys technology and what you've done?
A
Yeah, absolutely. So can come to our company website, insightfinder.com and so feel free to reach out to us like at the info sidefunder.com and so we also have social pages on LinkedIn and YouTube. And so feel free to actually follow us and reach out to us.
B
All right, Helen, thanks again for coming on the show. This was wonderful.
A
Thank you. Yeah,
B
Thanks for listening to Embracing Digital Transformation. If you enjoyed today's conversation, give us five stars on your favorite podcasting app or on YouTube. It really helps others discover the show. If you want to go deeper, join our exclusive community@patreon.com embracingdigital where we share bonus content. And you can always connect with other change makers like yourself. You can always find more resources at Embracing. Until next time, keep embracing the Digital Transformation.
Host: Dr. Darren Pulsipher
Guest: Dr. Helen Gu (Professor at NC State University; Founder/CEO, InsightFinder)
Date: May 28, 2026
This episode dives into cutting-edge advancements in AI-driven IT operations—how unsupervised machine learning, developed by Dr. Helen Gu, prevents, predicts, and automatically fixes cloud outages. Dr. Gu shares her unique journey, from NASA-funded Mars research in the 1990s to founding InsightFinder and pioneering self-healing data center technologies. The discussion ranges from AI best practices to the expanding applicability of her models far beyond traditional cloud environments.
On the Mars Code Origin:
“I don’t talk to people that have code on Mars very often. ...And it’s still there.”
— Dr. Darren Pulsipher [32:40]
On Feedback in AI Models:
“No AI model is perfect—just like no human is perfect. ...The key point is not like you don’t trust the AI models, or you trust the AI models 100%. Both are wrong.”
— Helen Gu [22:15]
On Enabling Real-Time Detection and Self-Healing:
“The idea is you want to catch those minor issues as early as possible... if you can fix those problems early on, you prevent the whole system outage.”
— Helen Gu [12:00]
On Living, Adaptive AI:
“Our model can be quickly set up... when we started with some model and with zero data, then typically we saw pretty good results after one week of learning, and the model will continue to improve.”
— Helen Gu [28:20]
Learn more: