
A distributed system is a network of independent services that work together to achieve a common goal. Unlike a monolithic system, a distributed system has no central point of control, meaning it must handle challenges like data consistency,
Loading summary
Narrator
A distributed system is a network of independent services that work together to achieve a common goal. Unlike a monolithic system, a distributed system has no central point of control, meaning it must handle challenges like data consistency, network latency, and system failures. Debugging distributed systems is conventionally considered challenging because modern architectures consist of numerous microservices communicating across networks, making failures difficult to isolate. The challenges and maintenance burdens can magnify as systems grow in size and complexity. Julia Blaise is a product manager at Chronosphere, where she works on features to help developers troubleshoot distributed systems more efficiently, including differential diagnosis or ddx. DDX provides tooling to troubleshoot distributed systems and emphasizes automation and developer experience. In this episode, Julia joins Shawn Falconer to talk about the challenges and emerging strategies to troubleshoot distributed systems. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
Shawn Falconer
Julia, welcome to the show.
Julia Blaise
Thanks, Sean. So nice to be here today.
Shawn Falconer
Yeah, absolutely. So I wanted to start off digging into your background a little bit. Can you talk a little bit about your journey into the world of microservices observability, what led you to Chronosphere and why you're interested in sort of these issues around troubleshooting?
Julia Blaise
Yeah, absolutely. Well, I started out as a librarian. Actually, maybe not the most traditional career path into tech. I worked at the Library of Congress. I got a fellowship there. I went from Library of Congress to actually working at the Smithsonian. I will say it was less like what you think of as traditional librarianship, kind of written word librarianship, and a little more digitally focused librarianship. I was working with scientists and researchers and I was helping them store and organize their data so that they could ask questions of it and get the answers they needed. You know, going from information to insight, as I used to say. And then I did, you know, I was working in D.C. at the time, and I did eventually move over to work at a company called Palantir. It was maybe less well known in 2014 than it is today, but the reason I moved over is at least at the time, you know, they really talked about their software as a fundamental tool to help the government do something very similar to what I had been doing as a librarian, right? That is like understand, organize, analyze their data in a central location with a central toolkit. And I think government agencies faced really similar challenges to those faced by the scientists I had been working with, which is the data was stored in silos and each silo is organized differently. And you had different tools to work with each silo. And very few people really had been putting in the manual effort to understand how to work with all those different data silos and get that data together to provide insight. So you can probably see kind of where the through line is. My like information to insight role as an individual contributor to going to work at a company where that seemed to be their whole purpose. And it was really exciting, you know, and I really kind of enjoyed that path from librarianship into tech. While I was at Palantir, I started out in again that government facing side of the business in a customer facing role. So the first time I actually engaged with observability, I was actually what we would say is high side. I was in a customer secure computing facility. I had been on call, it was late at night. And you know, our developers for that software, they weren't always able to get on those government sites, right? They weren't always able to come out there and actually get hands on keyboard to see what was happening when something went wrong. So they would rely on people like me to kind of sit at the computer, be on the one phone line that could connect to the outside world and be their hands and follow their instructions. So I think my first engagement was, hey, I need you to grep for something that looks like this. And I was like, cool, what is grep? I don't know. So they really walked me through, you know, what it means to ssh send somewhere, what it means to grep what a log is, you know, what a metric is, how to describe what's on a metric dashboard so that they can kind of guide me through what else to look for to help them diagnose the problem. And it was really interesting, you know, I really enjoyed engaging with that side of software and it really demystified software a lot for me, which I appreciated. So as I spent time at Palantir and as I grew and the company grew, I actually moved into their product. Org and that's where I started learning about the difference between sort of monolith and microservices and on prem infrastructure and containerized infrastructure, because I was working with teams that were doing both. Some that kind of started natively building their services in K8s, others that had built services in a monolith and then were kind of trying to migrate them over to work in a more containerized environment and split that up into microservices. And it was really challenging, you know, and there were challenges on both sides and I enjoyed helping people with those and kind of working on those challenges, which brought me to Palantir's central observability team, which at the time was called their signals team. And that was the whole purpose of that team. So on that team, our challenge was to take all that telemetry data from all of the software, whether it was on PREM or commercial cloud or govcloud or secure cloud, monolith, microservice, whatever it was, and kind of develop tools and methods to bring that data into a central place where they could use it to troubleshoot issues. Of course, that did bring me to Chronosphere, I think, pretty naturally. We actually interviewed Chronosphere as a vendor at one point when I was at Palantir in that role. And they were just honestly, some of the most transparent, expert, nicest vendors I had ever interviewed. And I was just like, this company gets me. They really understand my problems. They understand my engineers problems. Palantir, at the time, I had been there for six years, it had gone public. I felt like I had kind of reached the end of what I had wanted to do at that company. I was looking for a new role, and Chronosphere seemed like a natural fit.
Shawn Falconer
Awesome. Yeah. I mean, I actually think that, you know, the path from librarian to product management, you're working in data, working in, you know, the world of microservices, stuff like that, is not necessarily that, like, crazy a path, because if you think about some of the work that, you know, has come from sort of organizing books and structure taxonomies, that's probably the original inspiration for, like, ontologies and things like that. Which also then leads you to Palantir, where they're, you know, big proponents of ontologies and done a lot of work in that space as well. This kind of, a lot of the, like, foundations of what we think about databases and stuff probably came from being inspired from the way that we thought about organizing books in libraries.
Julia Blaise
Absolutely. I like to say librarians have been organizing data for 2,000 years. Right. Like, these are not new human problems. These are problems that we've always had and we've always tried to develop tools to try and fix. And data is just another form of information. Right. It's different in size, it's maybe different in complexity. It's different in kind of the pace of change. But, yeah, you're using some of the same foundational principles, like how to store it. What's the best way to store this for access? What is the best ontology and way to organize it? What metadata do we need to index? Right. Like, that's The Dewey Decimal System is a way of indexing metadata so that you can find things quickly. Right. It's all related. Yeah.
Shawn Falconer
And for our younger listeners, you'll have to go and do a Wikipedia search on what the Dewey Decimal System. But in terms of like these problems around data silos, it's kind of interesting. Like we've been working on that problem for probably as long as humans have been writing down and storing information. Like, and there's. That problem's not. It doesn't seem to be going away. It's like we're actually seems to be getting worse if anything because we have more and more data to manage. Like what are your thoughts sort of on the like taking a step back and just looking at that challenge. How do we make progress to kind of breaking down some of these silos?
Julia Blaise
Yeah, I think it really is about understanding the needs that we have from that data. So like what questions do we want to ask? Who's asking those questions? How fast do they want to get answers? Because if we start there from kind of a need based approach to how we want to organize this data and reference across this data, I think we're going to be able to organize and break down those silos faster for the right people. So I. That's probably also the product side of me talking. Right. I'm always talking about what's the problem we're trying to solve. If the problem you're trying to solve is organize the world's information and put it into a single place so that everyone can use it. That's massive. Right. Like you'll be working on that forever. You will never be done. There will always be something.
Shawn Falconer
30 years.
Julia Blaise
Right. That was their original charter. Organize the world's information. Do you go to Google and always find exactly what you need right away? Maybe, maybe not. Right. It's a hard, hard problem. So I think you really have to start with like what's the outcome we want and how do we build tailored solutions to work across the relevant data for that outcome? Less of the full scale approach. We're always going to have all the data in one place. We won't, but we can find better ways to let the right people access it in the right way.
Shawn Falconer
What are your thoughts on some of the challenges that things like microservices introduce? We had the monolith three tier architecture. There's a certain simplicity with that. But of course we run into some scalability both from just look at an engineering perspective. If everybody's working on sort of the same code base, it's all part of this large piece of software that gets deployed somewhere. It can really slow things down. We break it apart, makes us more agile, more nimble. But then we also, as we break it apart more and more, we potentially introduce a lot of challenges from a distributed systems perspective. Just like brittle infrastructure for requests and responses, One part of this complex mixture of dependencies goes down. The whole thing goes down. What are your sort of thoughts about having worked in that space for a while?
Julia Blaise
Yeah, a couple of things. The first one you mentioned is, right, with microservice, I think one of the biggest problems they introduce is where is the problem coming from? I'm going to reveal my age again here, but I think in the past, right, if you heard a phone ringing, it was like, well, it's one of three phones in the house because they're all connected by wires to the wall. And that was kind of your monolith focused troubleshooting. You kind of knew where the problems were likely to occur. You had a pretty good understanding of the system. Now if you have a problem in a distributed system, it's like when I lose my cell phone, right? And I ask the Google to ring my phone to help me find it. And now I have to figure out where that ring is coming from. And it could be anywhere. It could be the kitchen, it could be in the refrigerator. True story, I've done that. It could be in the car, it could be in a friend's house where I was two weeks ago, right? Like microservices introduce the problem of is it my microservice? Is it your microservice? Is it my dependency on some other internal team? Is it my dependency on my cloud provider for some kind of auto scaling infrastructure that I don't even understand how I interact with it. I just send the request across the API and hope it works. You just have so many more places where that could be coming from. And I think that is a huge problem with microservices is it introduces a first order problem of where did it actually happen? Where did it actually start? And I think to add on to that, not only is that the problem, but you're having to dig through so much more data to find the answer, right? Because you're running containerized infrastructure. The whole point of that is so that you can auto scale, right? I can scale up. When I have more load, I can scale down. Should save me money, should make things easier to deploy, to run, to change over time. That also means that all those deploys and scales and runs and Changes are introducing more data, more data volume, more data complexity. Right. You're going to have higher cardinality, more interesting facets of data that you're having to kind of rule in or rule out where you're trying to navigate to find out where things are coming from. And so you have that kind of compounding problem of could be coming from a thousand more places than it used to. I have a lot more data about each of those places to dig through than I used to have. And I would almost say this is kind of my own conclusion. But looking at those two things, the other problem I think microservices introduces, we rely on fewer and fewer people at our organizations to really understand and be able to solve that problem of where the trouble is coming from. So I think you get this like this hero, right? And I saw this at Palantir, I still see it at my current employer. Although we're really doing our best to get out of this with the different tools we're developing. And I think we're making progress. But yeah, you have your organizational heroes. And for someone listening, you know, if you're working at one of these companies, like think in your head like, who do I call when the incident's really bad? Probably four names on that list. Right. And that's a problem. And I think that's really the ultimate problem that microservices have introduced is you're over reliant on having the right people in the right incident room at the right time to fix a problem. And that's extremely brittle. What if those three people win the lottery? Right? Like now you're out of luck and you don't have anyone who can solve that problem. And I think that's like a huge risk in companies that are running these microservices today.
Shawn Falconer
Yeah. This idea of a hero, for instance, has come up before in other interviews I've had with people who work in the space. So I'm not surprised to hear you talk about that. Like that is a huge, huge challenge. Like essentially you have with an incident, like a bust factor of one, maybe three. So how do you fix that? What can organizations do to help solve that problem?
Julia Blaise
Yeah, I think the first thing you have to do is you have to know what data you need and what data you don't.
Shawn Falconer
Right.
Julia Blaise
Because we talked about that sort of data explosion just now. Not all of that data is necessary. And especially working at the company I work for now, I see customers consistently reduce the data. They actually store the data they have on hand to solve an incident by 60%. That's a huge number. And that vastly simplifies the troubleshooting process. If you can just get rid of some of that noise so that every time you're facing an incident, you're working with a much smaller, more relevant data set. The other thing organizations can do, I think, is to make that data accessible without requiring expertise in the tool. I was actually talking to a prospect and he said he used to work at Google and he was like, the tools that our SREs built to dig through our data were terrifying, right? Like you build super complex things that only a few people understand. Or you know, maybe if you don't build things in house, you purchase something from a vendor and it requires you to learn a new query language, right? It doesn't have to be in house to be complicated to learn. If you're having to learn new query languages and you have tools for all, you have some in house, you have some things you purchase from a vendor, right? And now you have to learn each one of those tools, you're just going to keep running into this problem. So I think understanding what data you need and don't need and trimming it down to just what you need so you have a strong starting point, reducing your tool sprawl or focusing on tools, maybe you do still have two or three, but tools that are really easy to walk up and use, you don't have to spend a lot of time learning how to work with the tool, right? The tool is kind of walk up friendly or built for kind of your NOC or your novice user. And I think then the other thing is you just have to have those tools that are also built to handle change because your data is going to be changing all the time. And that's where the expert often is relied on. Because no one remembers what happened two weeks ago. No one knows how that dependency arose or like why it's calling this thing anymore. So you have to have tools that can help you kind of understand the history and the context of where you are right now so that anyone can walk in and help fix. I don't think heroes are bad. I just think I'd rather invest in heroes as people to really dig into the root cause after the fire is put out, if that makes sense, right? Like let's make it really easy to put out the fire, get back to normal functioning, then bring in those experts when they have a little more time to go deeper, let them dig in. Like, let them ask complex questions, but use them in that way where it's less brittle and less something that you're depending on for your entire business revenue to keep working. Right?
Shawn Falconer
Yeah. I mean, I think the challenge with the HERO approach is that it's just difficult to scale. Right?
Julia Blaise
Yeah.
Shawn Falconer
If someone's unavailable or whatever it is, you don't want to essentially be in a situation where you can't solve an incident because someone's on vacation. Right. I feel like in a lot of ways, because of these challenges that microservices distributed systems bring in, there's been a lot of like, brain power that's been put into, like, how do we improve on GREP and logs? Like, how do we basically, like put like a nicer software over top of being able to do distributed rep?
Julia Blaise
And I think, you know, you're totally right. The HERO problem, not only is it hard to scale, it burns people out. So, you know, that's another thing that, like, in general, it's just a human. I don't want to burn out my friends regardless of whether or not they scale. I don't want to put that pressure on them. But. But yeah, nicer tools, nicer tools are great. Tools that don't require you Learning a query language are great. I think there's also something here about learning to use different types of insights together to give you a holistic view where you can kind of lean in and say, like, what can I learn from my metrics? What can I learn from my traces? What can I learn from my logs? How do I use those three things together? Instead of always going straight to grep for logs, can I actually use my metrics or traces to find a signal earlier, locate the place it is, and then when I GREP for logs, I'm grabbing through a much smaller portion. Right. Like, I know kind of exactly what I'm looking for at that point, so it's easier to find. So I also, you know, we think a lot about what are the right purposes of each of these tools. How can we combine those purposes to give you a faster to learn, easier to use experience?
Shawn Falconer
So you talked a little bit about this idea of essentially cutting down the amount of data that you're storing so that if you can cut things down by 60%, then it's just easier to essentially deal with that volume of data because you're probably going to have less noise. Noise to signal ratio. How do you do that? Like, how do you determine how to cut out 60% of this data? Because I feel like when it comes to things like telemetry, logging, monitoring, people are just like, more data is better I don't know what I'm going to need, so I'll just keep everything essentially.
Julia Blaise
Yeah. The packrat problem is something that we hear customers talk about. Everyone wants to keep everything a couple of ways. We think about this. First of all, we think about what you're already using, what is in a monitor, what is in a dashboard, what are people searching for today with the query tools, what they have, what are your service accounts calling for repeatedly to do whatever financial analysis they might be doing on the back end. That's a great indicator of what data you need. And honestly, that's a small fraction of your data. It's shocking how small a fraction of your data that is. You can also kind of look at that and look at best practices. So like dashboard templates that exist out in the market for monitoring Kates, those are pretty well understood problems, right? You know you're going to need your container metrics and you know you're going to need them to be at this kind of interval and in this kind of summarized way. That's pretty easy to kind of look at that, look at what your people are using and combine that to get a really good understanding of, hey, what do we need that we know we need because people are looking for it. What do we think we should have? Because this is what the market is saying, let's put that together, let's see how much of my data fits one of those definitions. And everything that's not in that definition, let's throw it away because there's no utility for it today. Now, I think the other piece of this is I mentioned data change. What you need might change over time. So you also need tools that let you change what you're collecting or what you're storing very dynamically. It shouldn't take you reinstrumentation and redeploy to change what you're collecting. It should be something that you can do from a central location when you start to see something new. So you have a customer, they have a metric, no one's ever used the metric. They drop the metric. Next week, they see a ton of queries against this metric and someone setting up a dashboard and those users are reaching out saying, hey, I can't find this metric. And you're like, oh, I didn't know you need it. Now if I have a tool that can just say like switch it on, you have it tomorrow. People are going to be a lot less worried about dropping data today when they know they can get it back as soon as they need it again. And I Think that's kind of the maybe two sided way of having this problem is first you look at what people need, trim down to just that, and then as those needs change, have dynamic tooling that lets you sort of adjust that collection to meet what people need.
Shawn Falconer
Right now, can you talk a little bit about Kronosphere's differential diagnosis and how that potentially relates to this problem that we're talking about?
Julia Blaise
Absolutely. So differential diagnosis is inspired actually by what we saw these heroes doing when we went to talk to our customers and said, what's your process when you call in the hero? What do they do? And it's based on their diagnostic process. So DDX basically does what they do with one click. It takes all of the data about the thing that's having the problem, right? So I've narrowed it down to this particular endpoint and this service and it says, cool, what data do we have about that endpoint in that service? What are all the facets of that data? Let's take those and let's split them up into piles. Let's look at things that are bad and things that are good and let's compare those two things. So bad, good, you've got errors, you've got successes, you've got really high P99s versus really low P50s. DDX sort of does all that split up and dimensionality analysis for you and just presents you with the results so that you can start to find outliers. Because that's what these heroes are doing is seeing what's unusual about the things that look bad, right? How does that compare to the things that look good? What can I change? How do I make that bad data look like the good data? And DDX does all that for you in one click. I think really the power of this and how this works with all we've just talked about is it's doing that analysis on a really good set of data, right? Like we know this data is relevant, we know these facets are valuable. We're not doing it on a big pile of who knows what. That's probably full of noise. We're doing it on data that's hopefully already high signal so that you can trust the results. I'll also say one thing here, we haven't talked about this yet, but everything we do with DDX is transparent. So we kind of present you with user results. You can choose to act on them very quickly, right? Like the machine does the pattern analysis on the high scale data. You bring your human knowledge and context to that and say, yep, that Build version. I did forget to deploy that to Japan. Real life example. I need to go do that deploy, and everything will go away. But you also have to trust the system. So anytime you're doing kind of a human machine partnership like that, you really want everything the machine is doing to be incredibly transparent and verifiable by the human. Because I think we as humans, the first time a system says something looks funny, we say, are you sure? Let me check. You know, I don't know. I don't trust this. So the other way I think DDX works with all that we've talked about is it makes all of its results verifiable by the humans that are actually working with it so that you can trust it over time and learn to maybe like, let yourself lean on it a little bit more. Let your novice users bring you conclusions from that. Because, you know, as the expert, you can always go in and verify their results if you're suspicious and it lets you leave the room, it lets you leave the room and kind of leave that diagnostic work to maybe people who are less familiar with the system.
Shawn Falconer
So you talked about how some of the inspiration for this kind of comes from, like, looking at what the hero is doing. Is there also an inspiration from, like, the world of medicine? You know, if you look at differential diagnosis there, like, that's about distinguishing diseases from others that have, like, similar symptoms. There's some inspiration that came from that world as well.
Julia Blaise
Oh, absolutely. I should have said that earlier. I'm glad you brought that up. Yeah. We looked at this and we were like, what are people doing? They're dividing, you know, symptoms into piles. They're saying, what do these symptoms tell me? What do those symptoms tell me? How do I look at the difference between those to identify a most likely root cause? I happen to have, you know, family in the medical industry. I think a lot of us do. And we said, hey, that really sounds like differential diagnosis. And I have to give credit to the TV show Dr. House, another old TV show. Right. But, like, that's what he was famous for, is like, you know, let me look at these things. Let me compare them to each other. I'm going to use that to get to a diagnosis much more quickly than I could by just, like, looking at one symptom and seeing what that can tell me. So being able to do that kind of a maybe more cluster comparative analysis is what differential diagnosis is about. Absolutely. A drew inspiration from the medical world. Those people are some of the best diagnosticians that we have in the human population. So I love drawing inspiration from other fields where it makes sense.
Shawn Falconer
Yeah. And they've been dealing with like, big data problems before there was even the term, like, big data.
Julia Blaise
Oh, my gosh. Yes.
Shawn Falconer
Yeah. Like, I think the human genome was mapped back in the 90s or something like that, but.
Julia Blaise
And we still don't understand half of what it told us. Right.
Shawn Falconer
So in terms of ddx, like, how does this, like, work behind the scenes? Like, how is it figuring out how to point you in the right direction?
Julia Blaise
Yeah, like I said, it takes all those facets or dimensions. So when I say facet or dimension on your data, we're saying, let's look at all the labels, let's look at all the tags, let's look at all the values for each of those labels or tags and let's look at the recurrence of them on the things that look bad. So very simply, let's say you have a pile of data, you've got a hundred requests, right? You've got a hundred requests, and 30% of them had an error and 70% of them did not. So 30. I shouldn't. I don't even need to say percentages, right? This is 130 and 70. 30 had error, 70 did not. We take those, we divide them up, then we say, okay, let's look at all the facets. So all the tag value pairs on each of those piles and let's see which ones recur the most frequently in each pile. So every time there's an error on 100% of the errors, do we see this build version? On 100% of the successes, do we see this cloud region? Right. And we go across those and we kind of order those from most prevalent to least prevalent in terms of like, tag value pair in each file. And that's how we start to kind of present you with those results. Is error seem to have this build version, this cloud region, this user token in common. Right? Like those three things are on greater than 95% of your error requests, then we go look at your successful requests and we do the same thing. Maybe on your successful requests, we see that it's a completely different build version. Maybe it's the same cloud region. So maybe that one's a red herring, right? Because it's the same prevalence on errors and success and maybe successes are then equally spread out across all your user tokens. Tokens. So maybe it's something about this user request. With this build version, did the build version introduce some new validation? And that user request, they happened to be used to sending in something that's no longer valid. Right. Like you can start kind of correlating those two things together when you see that those are common in errors. And you can also take out the noise of like what's common in both by doing that side by side analysis. So in ddx you can also rank by different things. So you can rank all your results based on what you see in errors. So that you can see are the things that are highly prevalent in errors prevalent or not in successes. Because if it's the same prevalence, if the same Tag is in 90% of your errors and 90% of your successes, you can rule it out. And I think that's like the other part of this differential diagnosis is being able to build a hypothesis. I think it's this. And then prove or disprove it by iterating with the tool and being able to say, let me rule this out, let me rule this in. Let me continue to kind of iterate and find the things that are most outlier in my things that look bad so that I can fix them.
Shawn Falconer
So in order to do some of this like pattern recognition behind the scenes, are you using some form of clustering algorithms to group these based on the different features essentially of the error, like the region that the deployment took place in?
Julia Blaise
Yeah, I think it's even simpler than that. We're looking at counts and then we're ranking them. Right. It's something that we are able to do, I think because of the fact that we can do these things at scale. And Sean, honestly that's as much as I know about this answer. So at that point that's where my knowledge ends. No problem. Yeah.
Shawn Falconer
Look, a lot of I think incidents tend to be related to like change management. Someone makes a change and then of course that results in some sort of like outage or spike in latency or whatever it is. Like, why is it that these kind of like user impacting incidents tend to be related to change management versus some other type of issue?
Julia Blaise
Yeah, I think we see absolutely. Like the first step you take when you see an incident is when was the last deployment, what was in the last deploy that could have caused this? Anytime you're changing code and kind of the code meets the road, you're going to open yourself up to pushing code paths that could not be covered in testing. Right. So when you're, when you're developing software, and I say this as a product manager, not a software developer, to apologies for any inaccuracies to my software developer friends who are listening but what you do right is you, you feel responsible for evolving the code of your own service. So you say, I want to make this change. I think this change is going to make things faster. It's going to handle new data types. It's going to fuel this new feature that my product manager really wants to get out to customers. Great. I've written the code to make the change. What's my next step? Well, hopefully your next step is writing tests, right? You're writing unit tests, you're writing integration tests, you're writing end to end tests. You can't write tests for every possible facet and permutation of how this code will interact with all of those dependencies up and down the chain. Right? We talked about microservices, but it's not even just about microservices. It's about what it's going to encounter in the wild. Right? Muhammad Ali Everyone has a plan until they're punched in the face. Every code looks good until it runs in production. So you write your test to cover as many reasonable, happy paths as you think will be tested in production. Then you release, then you hit production data, then you hit some weird configuration in one tenant that you didn't know would exist. There are always unknowns that happen when you reach that production release point that couldn't be covered in tests. And that's why the first thing people ask when they're trying to troubleshoot an issue is what changed? Because probably the fastest way I can get out of the issue is to correlate it with a deploy and roll that deploy back. You know that if every incident could be fixed by rolling back a deploy or turning off a feature flag, people will be so much happier because then you're out of the fire. Then you can bring in that hero to actually work on root causing why that deploy had that problem, what kind of workload did it run into that it did not expect and how to fix that? I think DDX also kind of brings that into play. Right? We can sort of show you your deploys in your system and let you kind of do that DD analysis for things before and things after to help you understand whether or not the deploy was the root cause or what in the deploy changed. That can point you to what you need to go and fix before you roll it out again. But that's, I think, I mean maybe a long winded version of your answer of why change events matter, why deploys answer or matter is there the first point where the rubber meets the road and the road always has turns that you don't expect.
Shawn Falconer
We started off by talking about, like, how do you essentially get away from having to depend on like, this, like, heroism that tends to happen within organizations to deal with these types of incidents. And DDX is attempting to do that. But what knowledge does a developer who's using DDX actually need in order to navigate and troubleshoot the services? Like, is there a heavy investment that they have to, you know, make in terms of like, okay, well, now there's this like, new tool that I need to understand and use in order to just like, sort of debug these types of issues?
Julia Blaise
Yeah, no, actually that was really important for us. When launching ddxs, you could go into your observability system because a monitor fired, right? What if it said like, high error rate at this endpoint in my service, you could go directly from that to your service page. And if you sort of looked at that service page and said, yep, that looks like my problem, you could just click a button that said differential diagnosis and get to those results. You didn't have to learn anything else about the tool that was really, really important to us. You didn't have to learn a query language. You didn't have to learn how to navigate a bunch of things that didn't feel familiar. Monitors feel familiar. Clicking a button that says diagnose this problem is a really easy single step to learn. It doesn't require you to learn anything about the underlying data. It doesn't require you to learn a new query language. It just gets you some results. Now, if you do want to go and sort of understand what's behind those results and look at all the data, we present it to you. We try to give you a UI to make it really understandable. At that point, maybe when you're first learning the tool, you want to go talk to someone and say, oh, what does this log mean? Right. Like, maybe I'm less familiar with that, but hopefully that's more about understanding your system and using the expert to help you build the context on your system and not having to use your expert to help you use the tool or learn the tool. So making it something that was kind of a single button that started at something that was already really familiar, like a monitor was really important to us in building ddx.
Shawn Falconer
Can you talk a little bit about this concept around hypothesis driven troubleshooting and like, why that's important?
Julia Blaise
Yeah, it gets back to kind of what those doctors are doing when they're doing differential diagnosis. They're looking at the symptoms and they Say, okay, based on the symptoms that I see right now, I think it is Cushing's disease, or in this case, I think it is a cloud region. The next thing they do is go and try to prove or disprove that. Because the easiest way to get to a fix is to say, hey, based on the data, I have X now that I think that what other data can I collect to prove myself right or prove myself wrong? Usually people walk into an incident and these heroes kind of have four theses off the top of their head, right? They're like, it's probably that we just spun up a new region. It's probably that we just did a deploy, it's probably that we just onboarded a new tenant and we don't know what their traffic looks like. And it's doing something funny with our APIs. So we see people coming in with those hypothesis. They needed a tool to be able to kind of filter down to the relevant data for that hypothesis and put it to the test to see if it correlated or not. This is about more than DDX though. And I think like this hypothesis driven testing and troubleshooting is something that we want to continue to bring into the Chronosphere experience as a whole. Because I think it's just so much easier to learn, it feels natural, it feels like what people are already doing. So much easier to learn that than to learn to say, well, if you write this Prometheus query and then do this kind of summary and then do a rate over this window, you'll be able to find out the answer, right? Like that sounds a lot harder than if you see the data and you think it's X, push this button to get the information about X to tell you whether or not that's true, right? And it's about doing troubleshooting based on probabilities across high scale data. Instead of trying to do troubleshooting by writing pinpoint accurate Prometheus queries to try and give you a specific answer. I think also speaking of giving you a specific answer, one more thing I'll say here is hypothesis driven troubleshooting is about being honest with yourself about what the data is showing you. And I think often if we're trying to find an answer to the problem, it's really easy to go down the garden path and kind of give in to confirmation bias and find information that supports the thing you're already thinking about. So this whole idea of hypothesis testing is we'll give you data that hopefully lets you see really upfront whether you're likely to be right or whether you're likely to be wrong. Because we don't want you to go down the garden path, tell the incident room it's definitely this. Shut off all traffic to that cloud region, and then realize it was never that in the first place and you missed something else, and now you're still in that incident fire room. Plus you look bad and have egg on your face. The philosophy here is, you know, help people do what feels natural, help people stay away from accidental confirmation bias, and hopefully by building those kinds of interactions into the product, help them fix problems faster.
Shawn Falconer
Are those patterns for those different types of hypotheses, like, are they common enough across organizations that you could sort of encode them automatically into the product? So I can essentially say, like, okay, well, it's most likely one of these five things. And essentially, you know, I can click a button, some magic happens, and it tells me whether that is the case or not.
Julia Blaise
Yeah, it depends. You know, I wish I could give you a better answer than that. I would say there are some typical classes that often, you know, come up over and over. Imbalances in traffic between regions. That's why I keep saying cloud region. Right. Imbalances between tenants. So a specific tenant configuration. If you're someone who's like B2B and you're working with businesses, that can be a common cause. Problems across different environments. Right. Or K8's namespaces, those typically come up over and over as easy ways to start identifying what the problem is. Right. And why it's happening. That said, we give our customers the ability to kind of decide what they want out of the box. Because each customer is a little bit different. Not Every customer is B2B SaaS who wants to track things by tenant out of the box. Some people serve the general public. And doing this analysis out of the box based on customer ID is just going to be extremely noisy because they've got, you know, 90,000 customers for their software, right. Like, that's just not going to be useful. So we do give them the ability to kind of tailor the experience for their organization. We also talk to them about adding instrumentation. So if they do want to add more custom tags or custom labels that this tool can then work with to help give them that out of the box analysis. We'll talk to them about that and have them add that and then kind of use that in the software. So maybe some things are common, a lot of things are unique. We try to give people the ability to tailor the tool results based on what their Organization sees and their incidents over and over again.
Shawn Falconer
I think there's a lot of interest in a lot of companies now that are looking at how do you use AI, especially newer techniques around generative AI, to automate a lot of the things with troubleshooting standard SRE tasks. First of all, what are your thoughts on the likelihood that we'd be able to automate a lot of this stuff away? And how far away do you think any of that is actually from ever happening?
Julia Blaise
I'd love to say tomorrow. Right. Like that would be great. At the end of the day, my friends are engineers. They want to write code, they don't want to troubleshoot problems.
Shawn Falconer
No one loves on call.
Julia Blaise
Yeah, no one says yes. I'm on the platform on call rotation this weekend, so I'd love to say they're around the corner. I think a couple of thoughts I have in the more general level. AI, LLM, machine learning, call it what you want. All of these things rely on good data to work from. It's really easy for these things to hallucinate or to start presenting you results that, yeah, they look funny, but when you look at it as a human with your contextual knowledge about the problem, you're like, that's nothing, that's just noise. Right. So I think a big problem that we need to solve in order to make those kinds of tools effective in observability is solving the problem of the data they're working with and making sure that data is really high quality so that we can start to trust their insights a little bit more. I think that trust is another piece of it. I think for these tools to work, they have to be able to tell you what they're doing so that you can verify it. Like I said, every time I work with a customer and they say, oh, I have an anomaly, I say, cool, what do you do when you see that anomaly? And they're like, well, I go see if it's right. Right. Like, I don't really trust the system. Right. And I need to be able to understand what's in that black box. I don't want any black boxes, actually, let me just put it that way. No black boxes. When it comes to AI, you need to be able to tell me how you came to that conclusion. You need to be able to replicate it essentially so that I can watch what happened and trust it. And we need to build up that trust over time so that these don't become systems that are just sort of training your developers to not look at them. If that makes sense. Right. So I think we need to solve those two problems for them to be really transformationally effective. I think there's potential, you know, I think like we can certainly progress down that path. It's a place where we as a company would, would like to invest. I think in some ways we have an advantage because we're starting with a really good dataset for all of our customers because of that trimming that we talked about earlier. But we're really concerned about being able to be transparent, being able to take out that noise, building a system that people will trust out of the gate and building a system kind of with that caution in mind. I certainly don't want to build something that goes all the way into automated rollbacks and then no one can get a deploy out. And I'm in a sev because no one can do a deploy because the system keeps rolling it back. But I can't see in the box why it's deciding to roll things back. And now I'm just like, ah, turn off the AI. Right. I'm so frustrated. I don't want to get scenario. And I think that's also a potential if we're not careful with how we develop these tools for observability.
Shawn Falconer
Yeah, I mean I think that you know, when it comes to like leveraging AI for things like writing code, there's some advantages there that maybe don't exist as much when it comes to observability type of tasks. Because one, there's a massive amount of code that you can use to sort of like train this stuff on. But additionally, even though most people using your GitHub copilots of the world sort of inherently don't necessarily trust the outputs. There's a lot of checks and balances essentially between like copying a piece of code from whatever to actually it hitting production because it's probably going to go through integrated tests like cicd, there's a couple compilation processes. So ideally if there is some major, you know, obvious mistake, it would be caught in there. I think that's a little bit more challenging when you start to get into sort of these largely human driven processes of sort of like debugging and trying to figure out what's going on and it might actually be a fairly complex set of things that you need to adjust in order to solve some multi tenant error or like new region deployment issue.
Julia Blaise
Right. Human driven processes with human driven problems behind them. Right. Like humans are the hardest thing for an AI to figure out because we're always changing. We are the most confusing creatures on the planet. Therefore, the kind of problems we can introduce to a system are tremendous. So yeah, I think you're right. That's really hard to tackle with AI. Where is something like writing code? You've got so much testing, like you said, built in to validate that you've also probably got some knowledge about what looks good and what looks bad. I think there are other use cases too where it makes sense in this kind of situation, like explaining what you see. Right. Like if you are looking at a dashboard and a PromQL query and you're like, I don't understand what this query is trying to do, that's a really good place to put an AI to help you with that human language translation for something where again, you have a ton of data available on what prom queries mean out on the Internet. Right. Like look at stack overflow, you probably have great data there. So there are places for it. I think sort of solving the troubleshooting problem is a really hard problem that'll take us some time to get to.
Shawn Falconer
Yeah. So I guess like, I think the take here would be there probably be some assistive technologies like built in to maybe help make people more efficient, but you're not going to be able to just have quite some AI magic black box that just solves all your problems.
Julia Blaise
Yeah, maybe someday, but not now.
Shawn Falconer
So based on that, like where do you see things in observability and troubleshooting tools like evolving over the next couple years?
Julia Blaise
Yeah, I mean certainly in Observability Everyone says OpenTelemetry, Open Standards. I'm gonna say it. I think we do. I hear it more and more from customers. No one wants vendor lock in. And by vendor lock in I mean no one wants proprietary formats that they don't control. It gets to that no black boxes thing. It gets to the kind of like I can choose where I put what tools I use, how I combine and recombine this data. So I do think like an observability theme will continue to be open telemetry. That standard has also matured tremendously. So I think we're close to a tipping point where it becomes easier to adopt OpenTelemetry than to not adopt OpenTelemetry. I think the other thing is I mentioned tool sprawl. Right. And needing to learn a lot of tools. I hope that a trend in the coming years is kind of fewer and fewer special purpose tools. This is my log tool, this is my trace tool, this is my event tool. And more platform based tools where we do kind of Bring relevant insights together to give you a full picture. I hope for my engineer friends sake that that is true. I just think we're going to get better insights when we bring the data together and can combine analysis from all sides. AI is of course not going away. We're going to see it evolve. Right. I would be remiss if I didn't say that that's going to continue to be a trend in observability. I hope we see it progress. I'm really excited to see what it can do and I hope that we can do AI in observability with accuracy and with transparency so that developers can really start to lean on that tool and trust that tool. And I guess the other thing is I just see the acceleration of data accumulation, like that data growth. That's just going to keep getting faster. Right. We're going to keep doing microservices. People are going to migrate over to containerized infrastructure. The data volume problem is not going away and if anything it's going to grow.
Shawn Falconer
Yeah. I think that first point that you made about moving to these open standards like OpenTelemetry, I think that's a trend that we're seeing across the industry, even outside of observability. But if you look at sort of the decomposition of the warehouse investment in open table formats like Iceberg and then even to infrastructure as code terraformed and then open tofu and things like that, like people don't want to be vendor locked in essentially. And I also think your point about sort of moving away from some of these like point solution type of approach to more of a platform where you're bringing a lot of this data together makes a ton of sense because even outside of your, you might only have like a snapshot of what's really going on if you're using these like more narrow point solutions. No one wants to also have to go to like seven different tools to try to figure out what's going on.
Julia Blaise
You're right. You're so right. I don't want to have to do that.
Shawn Falconer
Absolutely. Julia, this has been great. Anything else you'd like to share?
Julia Blaise
Sean, thank you so much. It's just been a pleasure and you know, I hope we can talk again when AI really starts to transform the observability industry and talk about what that that's doing and how that's going to work in the future.
Shawn Falconer
Fantastic. Well, thanks so much for being here and cheers.
Julia Blaise
Thank you.
Podcast Summary: Troubleshooting Microservices with Julia Blase
Podcast Information:
In this insightful episode of Software Engineering Daily, host Shawn Falconer engages in a deep conversation with Julia Blaise, a Product Manager at Chronosphere. Julia brings her unique perspective on troubleshooting distributed systems and microservices, drawing from her diverse background and extensive experience in the tech industry. The discussion delves into the complexities of microservices, the challenges they introduce, and innovative strategies to streamline troubleshooting processes.
[01:38] Julia Blaise:
“I started out as a librarian... working at the Library of Congress to digitally focused librarianship. Transitioning to tech was driven by my passion for organizing and analyzing data to provide insights.”
Julia’s unconventional career path from librarianship to technology underscores her expertise in data organization and analysis. Her tenure at Palantir allowed her to immerse herself in observability, working closely with government agencies to manage and troubleshoot complex data systems. This experience naturally led her to Chronosphere, where she now focuses on developing tools that enhance developer efficiency in managing distributed systems.
[09:20] Julia Blaise:
“Microservices introduce a first-order problem of where did it actually happen? Where did it actually start?... there are so many more places where that could be coming from.”
Julia articulates the inherent complexity of microservices compared to monolithic architectures. While microservices offer agility and scalability, they exponentially increase the potential points of failure. This distributed nature makes it significantly harder to isolate and identify the root causes of issues, necessitating more sophisticated troubleshooting tools and strategies.
[12:15] Julia Blaise:
“You're over-reliant on having the right people in the right incident room at the right time to fix a problem. That's extremely brittle.”
One of the critical challenges Julia highlights is the dependence on specialized “heroes” within organizations who possess the deep expertise required to resolve complex incidents. This reliance is risky and unsustainable, as it creates bottlenecks and vulnerabilities if those key individuals are unavailable.
[16:26] Julia Blaise:
“If you can cut things down by 60%, then it's just easier to essentially deal with that volume of data because you're probably going to have less noise.”
Julia emphasizes the importance of filtering out unnecessary data to focus on high-signal information. By reducing data noise, teams can streamline the troubleshooting process, making it more manageable and efficient.
[14:40] Julia Blaise:
“Making data accessible without requiring expertise in the tool is crucial. Tools should be walk-up friendly or built for your novice user.”
Simplifying data access ensures that a broader range of team members can engage in troubleshooting without needing specialized training, thereby democratizing the process and reducing dependence on experts.
[18:49] Julia Blaise:
“Differential Diagnosis is inspired by what heroes do during incidents... It takes all the data about the problematic endpoint and splits it into 'good' and 'bad' piles to identify outliers.”
DDX is Chronosphere’s innovative tool designed to emulate the diagnostic processes of expert engineers. By automatically analyzing and comparing different facets of data, DDX helps identify the root causes of issues swiftly and accurately.
Data Segregation:
DDX divides incoming data into "good" and "bad" categories based on specific criteria like error rates or latency spikes.
Facet Analysis:
It examines various dimensions (e.g., build version, cloud region) within these categories to pinpoint anomalies.
Outlier Identification:
By highlighting what differs between the good and bad data, DDX surfaces potential causes of incidents, enabling faster resolution.
[23:00] Julia Blaise:
“We rank results based on what is highly prevalent in errors and low in successes... helping you rule out what's common in both.”
This methodical approach ensures that troubleshooting is both comprehensive and targeted, minimizing the time spent sifting through irrelevant data.
[30:30] Julia Blaise:
“Hypothesis-driven troubleshooting is about being honest with yourself about what the data is showing you... helping people fix problems faster.”
Julia advocates for a structured approach to troubleshooting, akin to medical diagnosis, where hypotheses are formed and tested systematically. This not only accelerates issue resolution but also reduces the risk of confirmation bias, ensuring that teams remain objective and effective.
[35:17] Julia Blaise:
“AI relies on good data to work from... we need to build trust by making everything transparent and verifiable.”
While AI holds significant promise for automating aspects of troubleshooting, Julia cautions against over-reliance. She underscores the necessity of high-quality data and transparency to ensure AI-generated insights are trustworthy and actionable. The integration of AI must complement human expertise, rather than replace it, to avoid potential pitfalls like erroneous automated rollbacks.
[39:59] Julia Blaise:
“OpenTelemetry is becoming easier to adopt than not. We hope to see fewer point solutions and more platform-based tools that bring data together for comprehensive insights.”
Julia envisions a future where observability tools are more unified and standardized, reducing tool sprawl and enhancing data interoperability. The adoption of open standards like OpenTelemetry is pivotal in achieving this integration, fostering a more cohesive and efficient observability ecosystem.
Throughout the episode, Julia Blaise provides a compelling narrative on the evolution of troubleshooting in microservices environments. From her unique background to her innovative work with DDX, Julia offers valuable insights into overcoming the complexities of distributed systems. The discussion underscores the importance of reducing data noise, democratizing data access, and leveraging structured troubleshooting methodologies to enhance system reliability and developer efficiency.
[42:33] Julia Blaise:
“I hope we can talk again when AI really starts to transform the observability industry and talk about what that's doing and how that's going to work in the future.”
As the observability landscape continues to evolve, Julia’s perspectives highlight the critical balance between automation and human expertise, setting the stage for future advancements in the field.
Notable Quotes:
Julia Blaise [01:38]:
“Going from information to insight was my goal, and that led me naturally from librarianship into tech.”
Julia Blaise [09:20]:
“Microservices introduce a first-order problem of where did it actually happen? Where did it actually start?”
Julia Blaise [12:15]:
“You're over-reliant on having the right people in the right incident room at the right time to fix a problem.”
Julia Blaise [18:49]:
“Differential Diagnosis does what heroes do with one click.”
Julia Blaise [30:30]:
“Hypothesis-driven troubleshooting is about being honest with yourself about what the data is showing you.”
Julia Blaise [35:17]:
“AI relies on good data to work from... we need to build trust by making everything transparent and verifiable.”
Julia Blaise [39:59]:
“OpenTelemetry is becoming easier to adopt than not.”
This comprehensive summary captures the essence of Julia Blaise's insights on troubleshooting microservices, emphasizing the need for smarter tools, structured methodologies, and the judicious use of AI to navigate the complexities of distributed systems effectively.