
Modern cloud-native systems are highly dynamic and distributed, which makes it difficult to monitor cloud infrastructure using traditional tools designed for static environments. This has motivated the development and widespread adoption of dedicated o...
Loading summary
Eric Schabell
Modern cloud native systems are highly dynamic and distributed, which makes it difficult to monitor cloud infrastructure using traditional tools designed for static environments. This has motivated the development and widespread adoption of dedicated observability platforms. Prometheus is an open source observability tool designed for cloud native environments. Its strong integration with Kubernetes and pull based data collection model have driven its popularization in DevOps. However, a common challenge with Prometheus is that it struggles with large data volumes and has limited cost optimization capabilities. This raises the question of how best to handle Prometheus deployments at large scale. Eric Schabell works in Devrel at Chronosphere where he's the Director of Community and Developer. He also is a CNCF Ambassador. Eric joins the show with Kevin Ball to talk about metrics collection, time series, data managing Prometheus at scale, trade offs between self hosted versus managed observability, and more. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through latent Space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website K. Ball LLC.
Kevin Ball
Foreign.
Host
Eric, welcome to the show.
Kevin Ball
Hey, thank you very much. Nice to be here.
Host
I'm excited to get to talk with you. So let's maybe start out with a little bit of introduction about you and what brings you here and maybe start touching on Prometheus and Chronosphere.
Kevin Ball
Sure. So my name is Eric Schebel. I work at the ChronoSphere and Observability Company. I have a position that has a pretty long, weird title, so I generally just introduce myself as the Director of Evangelism. You recognize in a startup that stuff gets pretty fluid every year. We want to reset the targets, reset the focus, you know, so things are constantly growing and evolving in different directions and you end up getting more and less and other things under your umbrella. So I do a lot of stuff around Devrel basically in the observability space. It's a good way to put it with my team.
Host
Awesome. Well, let's maybe then talk about what is Chronosphere and what is Prometheus, which we came on to talk about.
Kevin Ball
Yeah, as Prometheus is the topic we decided to discuss here, I'll kind of twist it in that direction. So Chronostratum's observability platform, it's a SaaS offering that ingests Open Standards, Cloud Native Computing foundation is something that me and my team are a big part of. Two of us, including me, are ambassadors of the cncf. We like to help out, contribute and talk about and do everything we can to promote, you know, open source in the sense that that's we believe to be the best way to get started with basically anything. I've devoted most of my life in the app dev space to it first before I came over to the observability side. So just it's a continuation of what I normally talk about in relation to Prometheus and open telemetry and Perses, a new project and fluentbit, which is also something that we're very integrated with. And a couple of the founder and a couple others work at the Chronosphere. All these things are the standard way of delivering and communicating over telemetry protocols. Collecting, storing, querying, that kind of stuff is all integrated into our offering. So anybody that starts out initially at a smaller scale in the open source world finds themselves growing and scaling and having more and more difficulties managing teams that are growing that are spending all their time managing the infrastructure and trying to make it all scale. I find it quite nice to be able to unplug something like that and find a vendor with open standards and is easily able to simulate and recognize the query protocols or the query languages and the things that we use from Prometheus.
Host
Cool. So let's maybe dive into what Prometheus is. I saw it described as a cloud native observability platform. What does that mean?
Kevin Ball
Yeah, so Prometheus is a metrics collection project is the best way to do it. That's one of the signals that you would want to collect in an observability platform. There's some very unique things about what they do. So this was originally written back in the day, I believe, by some people like SoundCloud. And the idea being that it has to be highly performant, it has to be highly scalable and they want it to be as unobtrusive as possible. So one of the things that it does is it scrapes, which means it goes out and gets the data from an endpoint and doesn't require you to set up things to push data to it. It doesn't involve collectors, it doesn't have agents, it doesn't involve anything like that. They allow you to fine tune that as you wish as a developer. So you can use, you know, auto instrumentation. They call it like you just flip a switch on a Java library and it starts spitting out all kinds of Java metrics usually not a great experience because Java metrics are just a watch list of stuff that's just crazy. So you want to get more specific. So then you can use that library to trim that stuff down and deploy applications again with the proper instrumentation. And they usually expose some kind of endpoint where they publish these metrics on. And you set up Prometheus to go out there and know, scrape these endpoints every so many seconds. 10, 5, 15. Whatever you choose to do. You're also dealing with the. You need a backend for this by default. If you're just playing around with Prometheus, it just does it in memory. But you really don't want to do that in real life. So you have to put some kind of backend storage in there. And that's a time series what you're storing. So normally in a database query you're, you know, it's a finite set of whatever that you're collecting, like select star on whatever table and get this stuff and get a bunch of data here. You're basically trying to represent constantly collecting data on a machine. So for example, CPU usage constantly, every second of what's going on is milliseconds or whatever is a little bit hard to store and would probably overload almost any organization's network in a matter of no time, especially when you're doing it across thousands of machines. So they have a way of sampling the data and then re representing that when you do queries. So that allows you to do time series queries and figure out what's going on and create the dashboards and all that kind of stuff. It also provides alerting mechanisms so it has the ability to set up alerts and rules around that kind of stuff that then triggers and sends off and can integrate with things like pagerduty or you know, send you slack message or whatever it is that you're using in your Org. The query language is also pretty much a standard in the industry. It's called the Prometheus query language promql. Doing that, you can generate database queries basically against the stored metrics data and create visualizations. And there's been embedded dashboards inside of Prometheus, but that's not really generally how people do that. They tend to query with an external dashboarding tool that used to be centralized an awful lot in Grafana, but there's up and coming projects. The very first visualization and dashboarding project just reached sandbox status this last fall is going to be at Kubecon EU here With one of the project pavilion little stands is the Prometheus project and they're basically focusing first on the metrics querying through Prometheus instances. They're starting to expand into traces and some of the other stuff. It's a very young project. Yeah, but that's pretty much the infrastructure, what you're going to end up with the Prometheus.
Host
I'd love to dig in on a couple of different pieces there. So first, talking about the sort of piece by piece you talked about, one of the things that's distinct about Prometheus is it is going out to find the data and things like that. Can we talk a little bit about that? I saw particularly in like the cloud native environment, there was some stuff around service discovery and how do you find things within Kubernetes to go and look things up? So how does that all work and what does it enable?
Kevin Ball
So initially, some of the links I sent you, I assume you'll put them in the show notes, is around the workshop that we have. You initially start by doing that statically, just defining where to go, what endpoint, what you know, IP address, where are the targets I'm trying to scrape. And that's fun in games when you have one or two things, but you don't want to be managing a list of thousands of those things. And let's be honest, Kubernetes is set up to be an environment that you write a description for and say go out there and run these services and when certain things happen, react this way. But that's about all the control you have. So you really don't know what their IP addresses are going to be. You don't know where they're going to locate and how many there's going to be. So they have something called service discovery and you can integrate various tools that do that. They support several different ones. Zookeeper is a pretty famous one that people know. And what that does is just monitor the space you're running your clusters in. And so when new nodes, new pods, new containers and stuff come up, it just automatically add those to the list and start looking for endpoints and scrape those also. And generally speaking, you've ensured that those are providing those endpoints, otherwise you're going to have a lot of down targets. And yeah, that's kind of how the service discovery works. That makes it all dynamic and that's much more realistic for a cloud native environment.
Host
Yeah, so another thing that you talked about here and you sort of alluded to, you said these are collecting time series as Contrasted to. You mentioned another project that's doing logging or tracing or things like that. Can you kind of maybe dive in a little bit on what is the distinction between like time series collection versus what you might get in like a logging or tracing type of observability?
Kevin Ball
Okay, yeah. So a little bit of the difference between metrics, logs and tracing. So starting with logs, that's a very standard. It's like text lines, right? Going right down the chain, right? Just a bunch of data that you're storing there and it's log lines. That's pretty normal. Maybe you turned it into JSON with something like fluent bit that parses it into machine readable stuff, but it's still text pretty normal. Storage traces is also something that, where you're, you're setting points in your service calls that you're catching and tracking as it goes by, which is also a pretty straightforward data kind of set that you're putting together. It's not a constant stream of stuff. And the metrics where they have to have a lot of really smart ways to deal with this because of things like cardinality explosions. So if you look at object orientated programming and you make an object for a person and has a name and it has a age, has Social Security number, has whatever, an address, these are all metrics with labels. So your person might be the metric and the labels might be all these things underneath it. And if you happen to put something in there, like the IP address from the machine that you're getting this information from, that's a unique thing. And every time this gets sent out, you're getting a unique thing you have to save in the database. That's a cardinality explosion. And the more these things can create that kind of stuff is where you get the mistakes, where it gets really hard to query this kind of stuff. And in a time series you want to be able to have a way to store this kind of streaming data and this massive amounts of data in a way that you can measure it periodically and just connect the lines between those short little periods is the best way I can give you an analogy around that.
Host
I guess then a question becomes how do you decide when is it you want to use time series collection as compared to sort of full traces or something along those lines. Is it when you start?
Kevin Ball
Well, I mean, don't get too wrapped up in the way it's stored. I mean that's just time series databases are needed for metrics. They just r. So all of them require this. It's not an option that you can choose one or the other. If you're looking at the signal type that you want to collect, whether I should use metrics, logs or traces, that kind of has to do with what you're trying to observe and what you're trying to watch. So if you want dashboards with visualizations on what my infrastructure looks like, you have things like thresholds. You're setting and watching for things that get beyond the threshold. When you're looking at resource consumption, when you're looking at network lag and things like that, these kind of measurements only come through metrics. When you're looking to find out where did my calls go through my service network, that's. Traces and logs are just dumping information that every developer put into his app. Every application that's running on that machine is logging something. Right. It's putting messages in the log that I started up that I'm having trouble, that I'm dying now, or whatever the deal is. It's three distinct signal types that you're using.
Host
Yeah, that makes sense.
Kevin Ball
And each one is probably stored a little different. Yeah, well.
Host
And part of why I'm asking this is I think metrics are much more common in sort of operational monitoring and things like this. And a lot of software developers are thinking, oh, I want my logs so I can debug or trace. So kind of translating that mindset into what happens here. So let's maybe look then a little bit further. You have your metrics, they're stored in Prometheus, and you're starting to want to query them or think about them. What types of questions you've already started doing this are best answered by this data and, like, what does the query language look like to explore them?
Kevin Ball
I think you've kind of alluded to it there at the end. So I think a lot of DevOps, modern organizations are no longer in that boat where developers are just like, I want to see my logs. That was when I was a developer. That was like 15 years, 20 years ago. Right. We used to spend. I used to make jokes about this with a couple of guys here because I do quite a bit of stuff in Fluidbit now. And there's a lot of logs and stuff where we're simulating and dealing with. With. I used to spend time at the coffee machine because the machines were slow enough to compile the Java code we were writing, that you had time to go get coffee and talk about problems you're having, or. One of the things we used to discuss quite a bit is like trying to figure out how to make a good Java exception message for the logs because some humans reading that downstairs on the third floor, right, that's your ops guy now that doesn't work that way anymore. That's not really, really what you're focusing on. And that complete picture of your cloud native environment. So the place deploying apps and services, your team owns the service and you want to have a visualization of that service and everything about it. And luckily we have a lot of organizations dealing with things like platform engineering, SRE teams and all this different stuff that have a really good idea of how they want to do this and provide a developer view, provide an SRE view that's carrying the beeper that has this stuff and know who to call if something breaks. But when you own a service like that, you're not so much digging around looking at individual metrics, you're having a predefined look at your running object element. Whatever you service that you own and that's your landing page, that's your part monitor for what you own. And that's telling you depending upon what's this organization has gone through, they pretty much have a good idea what they want to know. If I'm the order service, I want to make sure nothing's blocking orders. That's where the money comes in the retail Org. So they'll be watching really close the connections to whatever credit card companies they're using or maybe the incoming stuff from the website. And if something starts slowing down orders and they see that little thing go all the way down like this, and maybe they're going to say hey, you know, it's time to look at our dashboard, they should be able to really quickly find out where's the problem. Are the orders not coming in, Are they not being processed out? Is the timing out, going to the credit card company, you know, that kind of stuff. Yeah. So I think what you end up is with having their own view of their world. Right. And that's a combination of all three of those things. Not necessarily just metrics or just logs or just traces. Not saying that that doesn't exist, but it's generally you don't want to set up your dashboards with here's my metrics, here's my logs and here's my traces. And then when SRE's beeper goes off, he or she is like let's look at a log, let's look at traces. No, what you're trying to figure out is a high level view you can drill down into to get to the problem.
Host
So with Prometheus then Prometheus is still just providing one of these pieces, which is the metrics tracking. And I think as you highlighted, that's probably why people are then setting up queries to query against it from an external rather than using built in visualizations because they want to integrate this with their logs, their tracing, things like that. What does that end up looking like for someone who hasn't for example, written something in PromQL like what is this as a query language, what does it look like and how does it work?
Kevin Ball
I'm going to say right up front it's not easy. This is something that a lot of people struggle with, including me when I'm writing labs and digging around and trying to find the right thing to given an example of. But it's within the workshop that I alluded to, there's about four chapters that are involved with learning PromQL and kind of walk you through all the various aspects of it. There are so many different ways to represent your data. You can do a simple counter, you can do a simple like tachometer kind of thing. You can do a graph, you can do histograms, which is, you know, histograms is going in the direction of looking at a graph over time. It's so complex what they're looking at that they're basically bucketing chunks of it. So the first minute is bucketed, the second bucket represents the first bucket plus the second bucket and it gives you a different look at your data. You have heat maps. You ever watch baseball when they show the pitcher stuff throwing and what are the pitches that they hit the most, you know, and it gets red in the areas where they get the most. You can represent your data or your view of what's going on that way too. You have topologies for your service calls where you're basically showing the network of the services and how the calls are going and the lines get thicker the more calls that go over it. The PromQL gives you the ability to. And some of the tooling embedded right now into the newest Prometheus is really nice where they explain the query. So you're putting the query together, it has command line completion in there. Luckily helping us find the metrics that are available at the point you're trying to do that and you can see the result sets, you can dissect them because you're also doing sort of like queries within queries to build up a bigger query. You'll apply things like rate across a query. So to kind of generalize a specific query out over time. Maybe you'll only want to look at a small window, only five minutes or only one minute or over 10 minutes. There's so much flexibility in that. And yeah, depending upon what you want to land in. So generally speaking, say you're a service owner. When you come in there, you have a couple of things that are important to you. Probably that it's up, you know, so you'd like to have a green bar when it's up and a red one when it's not right, or yellow one is degrading or things like that. There'll be a threshold when it starts degrading. You allowed to get to a certain point and then it's a problem. Things like that are the very high level look. And then when things start going wrong, you have to be able to dig down into it. So if it's a specific service, maybe I want to go look at the traces, see one of those topology kind of graphs and see that, hey, this one is not getting anything. Click on that and I can go down and look at the trace and then maybe I can look at the log for that specific thing. And we like to talk an awful lot about observability being the pillars in the old days where they talked about logs, metrics, traces being the three pillars. For quite a few years now, we've been talking about a chronosphere that we think it's the phases. I think we should speak a business language. It's much more complex than an individual tool. And I like to tell the story that it's just like you driving around in an old car you just restored. You're really, really proud of this old car and you're driving along and all of a sudden the temperature of the engine starts going up and it starts getting kind of weird how it's driving. You're like, oh, I'm close to the mechanic I use. Let's whip in here quick and see what he says. You pull up and it's getting worse and sounds bad. And you get out and he comes running out with his. He goes, oh, come over here and look at this. And he starts opening up a toolbox and showing you all the tools. Meanwhile, you look outside the window and your car is overheating, catching on fire. And he's in here talking about these great tools he's got. So we like to talk about the phases of observability where you want to know as fast as you can what's wrong. And then you want to be able to triage it as quickly as possible, preferably fix it with remediation. And you want to be able to go back and look at like root cause analysis and find what the long term solution is for this thing. And to do that we don't care whether it's two metrics, one label, you know, three traces and half a log, mine, it doesn't really matter as long as I can get to the answer as quick as possible and fix the problem. And I think that's pretty much modern observability in a nutshell, where Prometheus has a very important role in helping collect, manage and display these kind of stories.
Host
Let's maybe if we can make it a little more concrete by going through an example. So let's say I don't know if we want to use a car example or something like that. You have a service you're monitoring. Yeah, maybe let's take it from ground up first. Like what do you need to configure to get Prometheus starting to track things in there? Is there a thought process around what needs to be tracked or everything or it's there by default. Then how do you design those queries that are going to give you that dashboard that tells you oh shoot, my car's on fire, what do I need to do here? And then take it through that route?
Kevin Ball
Right. So for Prometheus itself there's exporters, for example. So let's say that your service is a node service. Node js, it's very easy to turn on node exporter. It creates an endpoint and generates a bunch of node type, of metrics that are pretty standard. You're able to get quite far with just doing that. And what's nice about that is you don't have to redeploy anything. It's no, no code changes involved. You target the service, you set the exporter on and you start watching what's going on. Once you start collecting data, you now have all the generic metrics around things like CPU usage, what the memory usage is. I don't know what, what specific things are in this exporter, but like if it was a Java one, it's, it's heap sizes and all that kind of stuff. Everything that you normally would be able to monitor around something like that without having to design it yourself. So the people that write these exporters and manage these exports and maintain them are trying to make life as easy as possible for people to just want to kickstart it and see what they like. That's either way how you usually Start just so you can see what's available, and then you can start trimming stuff down if it's too much for you. Because one of the things that gets out of control really quickly is it's very easy to collect a lot of data. And if you're doing this in the actual cloud, in and out with your data is water through the pipe is money, you know, and so if you don't have the ability to see it, which is one of the big things that Chronosphere is good at, is using a control plane to see your data coming in and out and tell you when you're not using your metrics. Turns out that across our customer base anyway, that on average, 60% of the metrics are not used that are collected. And that's not a bad thing at this point in time. It's just a big chunk of your bill you don't really want to be paying, right? And so the first thing you start looking to do is how can I trim this down or at least stop it from being stored? You do want to ingest everything, because you might need it in the future. But it's not a problem to not pay for storage if you're not using it. If it doesn't show up in a dashboard, it doesn't show up in a query and an alert or anything like that. No ad hoc queries from the user. Nobody's touching it. What are you doing? So that's kind of what you run into really quickly when you use a standard exporter or a standard library that would just spit out everything. But for us developers, that's a nice way to start, right? And then you say, okay, I only want to know cpu. I only want to know how much memory it's using. I only want to know whatever. And so then you go back and you start instrumenting it using code, you know, going actually in there and saying, I want this, this and this and the rest. I don't need. Redeploy the thing. There you go. Now we have it trimmed down to just what I want. And that's the stuff that you're querying to create your dashboards. So I'm saying I want to see a chart that shows me how much memory consumption is going on in the last five minutes or in the last 10, or in the last day or whatever it is. By default, it might be in the last hour, but you can expand that and drill down in it, cut and slice and dice any way you want inside that stuff. So that's kind of the evolution you go through to get to something that makes you happy. And trust me, when you start getting somebody carrying the beeper, you're going to start seeing things in there that are also include documentation in your dashboard. So there might even be Playbooks or runbooks that they have when certain things happen, go down this path and do this and call this person, alert that person, look for a feature flag that got changed, maybe there was a new deployment. Oh, yep. Definitely reverse that stuff. That kind of stuff.
Host
Yeah, that makes sense. Okay, so just to make sure I'm understanding Lifecycle, first you build this exporter, which you probably can start with just a package off the shelf that's sending a bunch of data. Prometheus is going to then start tracking that in time series. You look at that and you can start building your dashboard by setting these queries against it. And I think one of the things you highlighted there that is kind of interesting and might be worth exploring is like what time series enables is querying over ranges. And I did see there looked like there's a core distinction between you can do sort of instantaneous moment in time queries and you can also do kind of range queries. Maybe we want to look at what differences there are there.
Kevin Ball
You studied hard.
Host
Do what I can. So, all right, you build your dashboard on that. Maybe you set up some alerts on that. Question I actually have is, is there any difference between the queries you're doing for dashboard versus Alerts? Does Prometheus have native alerting support or are you querying that?
Kevin Ball
It's a separate binary that you can install, but it definitely has an alert manager. Interesting part about this is, and I ran into this quite a bit when I first built the workshop is you tend to forget like it's in memory, but it's also writing it to a little file system in your directory there. So every time I would like adjust something and restart, you know, and things like that, I think, okay, you go look at it and it would start a new graph of, you know, has standard graphs. I'd be querying just say uptime, you know, you can just do up and it'll show you all the instances that you're tracking. Are they on or are they off? And you'll see this little thing going. If you query something else that happens to have like a counter that's running or whatever it is, it'll, you know, show you whatever the graph is. And it's really hard to get interesting stuff when you just started. So you got to remember that your student is out there doing this. He just started this up on his machine. And his graph starts with an hour with this little blip in the corner. And you basically got to get him to go over here and turn this thing into like a minute. And then it starts looking. But it's very different than what I have that's been running, because I know this. I let it run for half a day and then I start working on whatever I'm going to give you an example of. Otherwise we have no nice examples. Also, what's really, really kind of freaky is you'll go away, come back and redo stuff, and then there'll be stuff with a big blank spot and then stuff way back there and stuff way over here. And if your query doesn't span all that, you won't see it all until you do. And so slicing and dicing stuff that is long dead and long gone, but is still in your database because it collected that time series when it was al. So some container that was running or some instance of your service may not be there anymore, but you're getting like, it feels like false positives, you know, you're like, hey, wait a minute, this one isn't running. Well, no, that you're looking at legacy data there.
Host
Yeah, that is kind of an interesting question, especially as we talked about, if you're changing your metrics that you're collecting or trimming them down, like, how do you deal with versioning in this type of metrics database system?
Kevin Ball
Versioning of what exactly?
Host
I mean, the example you use, right, Okay, I had it running and then there's this dead spot and something changed. Maybe that dead spot is because I took things down and I'm actually changing my collector a little bit and now I have a new set of data from this point forward.
Kevin Ball
Yeah, this is a little bit more of a development environment we're talking about. So you should never see this in production. You know, a blank spot means it all went south. That means you were offline, your retail store is selling nothing. You know, people can't get to your website, that kind of stuff, it does happen, but that's not really a good sign. You do everything you can not to do that. Right. So when you do updates or new versions of whatever, it's a rolling thunder kind of thing. Right. So that's why we're in the cloud native environment. So you can bring up a new instance and then take down the other one once this traffic's taken it over. A really good example is how people try to take care of we haven't really got that far yet. But when this starts scaling Prometheus, one of its weaknesses I think is it was not built for high availability. That's not really part of the design. So what that means is if I have one instance collecting all my stuff and I have dashboards and I have alerting attached to that, it gets too much traffic, it'll overload and die. Everything dies. No dashboard, you know, you can't query it anymore. You can't, you know, everything kind of goes bad. And if you try to load balance that with another instance, you can't put one alerting and one dashboard behind it because an alert will go off on one of these and it won't. Won't see the other one. So then you have to do two alerting. You see this scaling starting to go out, and we're not even talking about putting database, you see this spread out into this weiroir of incredibly complex topologies to try and load balance all this stuff. And one of the things that you saw was that you could set it up as a, I have an instance running, I have another instance on the service, and then I have a third instance on a service, and if it starts really flooding this service, it'll spin off its own instance just to cover that, to deal with that heavy load and keep the other ones running. And that's really nice. But the new one you spun up has no history beyond the point that it came alive. And the other ones lose all history from the point that the other one took it over. Unless you know that in your dashboards and can account for that and query that together into the one dashboard. You know what I'm saying? So there's a real complex problem you start juggling when you're doing this by hand.
Host
Let's maybe actually dive down that a little bit. Because, you know, one of the desirable perks of going cloud native is if you have that big viral hit, you go, and it's relatively straightforward to scale, right? You're not dealing with, oh, I have to figure out, how do I set up a new server? You're like, okay, take this, this service and this set of pods and scale it up, go. What happens to Prometheus as you do that? It sounds like there's some amount of automatic failover or trying to scale in there. But like, how does that end up playing out?
Kevin Ball
You can define when something gets like, like I said, you can spin up a new instance with it, but the high availability is not built in. There's not a feature there's not a function you can turn on that accounts for that. So you're getting another non high available instance of Prometheus that starts from that moment onwards, collecting data for whatever it's monitoring. And you're applying your own tricks as the trade to spin this stuff up and to automate that kind of thing. This is where vendors start becoming interesting, right? Because you can already smell and feel, if you're any kind of a software developer or a person that's had to manage stuff like this, that now I'm starting to do a lot of proprietary work myself and it's going to take more hands and more, you know, people to maintain this. And I always do this with like, imagine your DevOps people all over here, your whole team is on the right side of the room doing everything they're supposed to be doing around developing whatever your business is doing, say a retail site or whatever, and managing all the services and having a good time. And then slowly but surely about half the team is on the left side of the room managing your very successful environment, which is now scaled way up and has a whole bunch of, you know, metrics, monitoring going on and all that kind of stuff. Wouldn't it be nice to get half those resources back on the right side of the room doing what you want to do and not messing around managing the infrastructure anymore? And that's, that's where vendors start coming into play, where they're, you're happy it went down the open source road and you can unplug and reroute your stuff directly into something that looks an awful lot like what you've been doing. You recognize the query language, you recognize the protocols being used. Your dashboards are not, you know that those efforts are not lost, they're easy to replicate. Wherever you land in something like that and you take the management out of it. And most of these vendor platforms and like some part of the stuff I just described from the control plane from Chronosphere is a big, big help that's basically quantified all that for you and lets you just concentrate on what you really want to concentrate on.
Host
So let's maybe talk about that then. What do you get if you're ready to move from Prometheus to Chronosphere or another vendor pact, actually, maybe even just like, how do you know? Is it when you start seeing that complexity? Is it when you first get a spike that overwhelms your Prometheus? Like what are the signs that maybe you're outgrowing the manage it yourself infrastructure?
Kevin Ball
There's several things. They're pretty classic in almost any open source environment, right? So there's a really funny marketing kind of story that goes around where you say killing your heroes. Who hasn't run into someplace where they've worked where there's one, maybe two people that are like the big rock star guys that know it all, might even be girls, doesn't matter. But I mean they know everything, right? They were there since whenever. They know where all the bodies are buried and what happens when they leave, you know what I'm saying? That's usually the ones that are pretty core to running a complex, highly, you know, scaling up open source environment. Some places are really happy to do it. I know our CEO, both our founders, our CTO and our CEO were both at Uber in the beginning and spent the first, I think it was three years running. They built the M3 database out from scratch and that's a time series database and set up the whole infrastructure there for Uber and ran it for three years. He did an article not too long ago about how if I think they have 400 or 450 engineers or something like that, they're just doing the infrastructure for the observability. I mean, who does that?
Host
Apparently they do.
Kevin Ball
And yet I know why they do it because he said how much they would have to pay if they came over and did it at ours, it had been like $65 million. So you're like, we all saw some of the leaked information that some of the customers that were on somebody's yearly earnings calls and you're like, with those kind of numbers you can pretty much run a pretty nice department, you know. And I've worked in universities where money they didn't have but time they had. So they didn't care how long it took you or how many hands were involved or whatever was going on to manage the open source stuff. It just couldn't cost anything. And I think what you get to the point in time where you know, you, you're a CIO or you know somebody that's responsible for these organizations, head of the Observability Central Observability Team, or you're the head of the srest and you just want your guys focused on what they got to focus on. You're seeing the burnout, you're seeing the stress, you're seeing the too many incidents, things like that. And it's often not hard to figure out where you can start cutting costs and where you can take some of the load off, right? And I think people are getting a lot better at. You see the conferences we go to and you hear the talks, the examples from the organizations that are setting up pretty good observability teams and environments. And they're doing it at big scales. I mean, I mean, good lord, look at doordashes and things like that. They're doing it at a mega scale and they're not running a team of 450 observability guys. That's not for everybody. You know, I don't think there's any one specific thing, but we've all been in the environments where you're just firefighting and too much. Your team is doing stuff that he doesn't want to do. I used to always make kind of jokes about it stuff when the DevOps first started coming out, because I was like, I signed up for dev, you know, I didn't sign up for ops. You'd hear that around. And if you're not careful and even now, if you look on the Internet and kind of Google around, I think they say that all these developer reports that come out about, you know, what they're doing and what the languages are using, all this stuff, I think it's like 35, 36% of your time is spent on actually coding. Think about that. That's barely a day in the week. It's like we did our own research and stuff and 10 hours a week were spent on this kind of observability problems. That's crazy. Out of 40, you know, come on. Is that what you signed up for? And that's why people leave. You know, if I want to be a developer, I want to be a developer. I don't want to be a troubleshooter the whole time and don't want to carry a beeper that's going off all the time when I'm trying to have Christmas dinner and things like that. So let's be honest. It's a complicated thing. It's. It's not easy to run all these very complex things that we're doing.
Host
Yeah, well, I feel like what you're describing here is there's sort of a curve where you start out and you have more time than money. Maybe you're in a university or you're like in a cash strapped startup environment or something like that. And you're like, okay, great, open source, get it going right? Or just a small environment. It's not a big deal. You get to a point where you are hitting the limits of what open source gets you easily.
Kevin Ball
Right?
Host
For Prometheus, that might be it sounds like when you go from one instance to having two, and now you've got to navigate all of these, like, am I sharding versus am I just replicating versus, like all the different federation questions. And maybe we can, can talk a little bit about what, go into a little bit of detail of some of those, though you've covered some already and you say, okay, hopefully by the time you hit that, you actually have a little more money in the bank and you can pay for a service like Chronosphere to take care of it for you. And I do want to put a pin on this and come back to what does migration look like if I'm, you know, do that. And then at some point, though, once again, you get to the point where you're so big that the costs of paying for the service are high, but you also have so much money you can pay for a whole department to manage it, and then maybe you migrate back out. Let's kind of maybe look at that migration path.
Kevin Ball
Well, I think you touched on a really good one. I didn't want to let it slip away. I think the big part of open source that we're always chasing is the open standards, the ability to be an architect, whether you're designing apps or designing infrastructure, you want to have the ability to stand the test of time as much as possible. And you also want to have components that can be replaced by another component, but are still speaking the same standardized language. When somebody in all these organizations take enough time and effort to create a standard something, whether it's TCPIP or whether it's an observability protocol like the open Telemetry protocol, if you've chosen that road, that means, and I think Containers is a great one. I mean, Docker in the early days owned that, right? They had their thing. They were the big cat on the block. And when they started getting approached about we need to standardize this kind of stuff, they didn't really want to hear it. And so what happens is Open source world gets together and it's a bunch of companies too, you know, But I mean, it's all these people contributing. They sit down, they write the oci, the Open Container Initiative, and now you can write any engine you want against the containers. Now you have Podman, now you have Docker, now you have whatever you want that's coming down in the future. It's all standardized, right? Kubernetes, standardized, YAML, standardized. You want to have these kind of tools. And I think that is what you're doing. And what you're positioning yourself for by watching the cncf, those kind of projects in the observability space, the Prometheus, the open telemetry, the Jaegers, the whatever you're using, that's generally speaking doing their best to try and not, you know, tie your stuff into a knot. And what that means is, is when you're ready to actually migrate to something, it should be relatively painless. Right. And you're going to find out really fast it's a vendor is compatible or not. You know, one of the things we do as a pilot, so we show it to you, you get a trial run, you know, you get to take your environment, put it in ours and plug it in and see what happens. Just seen some really neat stuff happen when they do that. People are finding things they didn't find before. They're figuring out that they have so much metrics coming in they didn't even touch. You know, things like that. It's because we're all so busy trying to do the day and day to day, you don't have the time to step back. And who hasn't been in those positions where you'd love to step back and do some real strategic stuff and you just don't get the time.
Host
We've talked a couple times about, okay, you need all these different pieces to have your observability solution. You need metrics so you're keeping track of what high level is going on with your machine. You need logs and telemetry so you can dive tracing so you can dive deeper. And in the open source world, you might have spun up some of each of these on your own. Maybe you're spinning your logs out to cabana and you've got your tracing going and you've got Jaeger so you can do that and you've got Prometheus covering your metrics. If you're moving to something like Chronosphere. Can you pull all of that under one umbrella?
Kevin Ball
Pretty much, yeah. That's what you're trying to do. That's. And it's. It's not necessary either. It's. It's kind of a funny thing. So say that your organization has been experimenting with this stuff as you go. Right? And you might be big on the open telemetry ecosystem. It's just the thing you've bought into. It's what you've seen. You got the collectors out there already, fine. But you also have some legacy stuff over here in the corner that is kind of a problem. Right. And expensive and coming due and maybe you want to get off of it. There's tools like fluid bit, very lightweight collector, telemetry pipeline. Basically anything in, anything out is their catchphrase. And so they have inputs and exporters on both sides. Doesn't matter where you get it. And it's very lightweight. It's written for the cloud native stuff. It's a sub project of the FluentD, you know, which was the monolithic stuff, APMS, the fluid bit thing can go out there and you can even go to edge cases and really lightweight. It's able to handle, you know, high volumes and high stream and it's really quick at processing all this stuff and lets you on the edge already take down the amount of volume of stuff you're getting that's coming in. They can expose it as a metrics endpoint. So Prometheus can go scrape it, they can pass it on to a logging backend, they can pass it on to OpenTelemetry. You know, they can use the forwarding protocol they have from Fluidbit, or they can turn it into an open telemetry envelope and pass that off to them that collector understands. So the infrastructure you already have in place can be maximized. But what's out there that isn't able to yet, you can put something like a flume bit in front of it and easily kind of obtain that it's not the right protocol yet. And then work on that on your own time and hopefully get rid of it before you got to renew. You just unplug all that stuff and you're onwards with your OpenTelemetry. The OpenTelemetry collectors can just be redirected to another destination, right? Which could be Chronosphere and whatever cloud that you need, or it could continue to be some back end you have, or whatever the deal is, it's quite flexible. I think that kind of covers the whole roundabout question.
Host
No, that makes a ton of sense. So the one thing that I did want to come back to briefly is we touched a little bit on what happens as you start to scale up these decisions around, okay, in Prometheus maybe you're deciding, am I doing fallover? Am I sharding things? What do I need to deal with to scale? Is there a federation type of thing? Maybe we can talk a little bit about what that looks like in the open source world and then bringing that into Chronosphere. Does that whole problem just dissolve or are there still things to think about even if you're using a managed solution?
Kevin Ball
So when you have to start taking decisions around how you're sharding backend databases or storage I guess you should call it, because it's time series is not exactly the same thing. That's you managing your infrastructure. And I don't know what you think when you. When I start hearing sharding and stuff like that and high availability and load balancers and it sounds like we're getting complicated. I mean, I'm not running that infrastructure, so. But it sounds like it's starting to get hard. The whole idea of a managed service is that you don't really care. You know what I'm saying? It's not that you don't care, but one of the things that we spend a lot of time on is working with the customer and, and trying to give them the most optimized whatever we can give them. There was a lot of effort a couple years ago when, when everybody was kind of cutting back their spending in the marketplace to optimize both storage and transmission and collection and all this different kind of stuff. I think those are the kind of things you're happy to look at, but you don't want to spend your time on that. Right? That's the reason you try to get off of the infrastructure you're managing yourself. And the bill is important. Monitoring the bill and providing insights to what does my consumption look like, what teams are using it, being able to balance that kind of stu. You're starting almost to get into the finops kind of view of what's going on in your organization. Right? Financial operations and those things are all baked into a more mature observability platform that it can include pipelines, telemetry data, tracing logs, whatever. It's events, all kinds of stuff get integrated into that. The look and feel shouldn't be so dramatically different, which is what's nice about coming into something like Chronosphere. I think there's an awful lot you're going to recognize from the open source world. It's the same, you know, query language. It's the same idea of what you're looking at. Dashboards are relatively the same. It's an experience you're looking for. How do I get my, you know, my stuff set up as quick as possible? How am I able to dissect stuff? There's definitely things under the hood that you will not find in, you know, the younger open source environments. It's stuff that you might want to write yourself. But I mean, we're talking about serious organizations before you get to that kind of level. We have various features like that that take you right down into the, to the problem and couple of clicks, got some great quotes from customers when they tried it. You know, I'm not really trying to sell you anything here, but it's just that that's what the maturity looks like and that's the difference between, you know, doing it on your own and seeing it scale up and starting to have problems. And to be really honest, everybody's environment's way different, Right? Everybody has their own specific problems, their own specific legacy stuff and their own specific issues. We have some customers that have discovered that they use less than 10% of their ingested telemetry data. That's extreme, but that's a use case where they're very much focused on something specific and that's the only thing they monitor. And fair enough, if that's the case, that's the case. But they do it at massive scale. And so that leads to a lot of, you know, automatic ingestion of garbage, basically for them. But even if you can just have it right, get half the chuff out of the way, that's gotta be, gotta be an ROI you're interested in. And we show that you get to actually run it for a couple weeks and see what it looks like in a pilot. It's not to alleviate fears and stuff. I mean, it's. But it's to show what it looks like in your environment, what it can do. Try to use real data in real environments. It's not meant to be just a test bed. So that's kind of the experience that you're looking for, right? Can I get off of what I'm doing and stop thinking about sharding and stop thinking about, oh my God, there's another instance and another one and another one and I got beeped again.
Host
Yeah, no, absolutely. It's. How do I focus on getting the value out of this thing, not just all the work to keep it running.
Kevin Ball
I was going to say one of the things we often talk about is like, nobody would care what it cost if you had better customer experience, if I had better on call experience, if I had happy engineers, if I had more money in the bank and less downtime. Right. And very often that's not the case. It's just a bucket load of money going somewhere and everything is all a mess.
Host
You know, I heard somebody say at some point, they said, you know, data is growing exponentially. But I have yet to find the company whose data budget is also growing exponentially.
Kevin Ball
Yes, yes, that's a really good example. Yeah. And we wouldn't care if you're using it, but you're just not using it. That's the problem.
Host
So every time I have one of these conversations, I come away being like, holy smokes. All the different things I learned. And hopefully, hopefully folks listening along also have that sense of like, I'm coming away smarter than I was an hour ago.
Kevin Ball
And I would say if you go take a look at all the workshops we have, you can get hands on. From zero to installing it to learning about fluent bit, the Perses, Project Prometheus or OpenTelemetry. We have all of that online for free.
Host
Awesome. Well, thank you, Eric.
Kevin Ball
You're very welcome.
Software Engineering Daily: Prometheus and Open-Source Observability with Eric Schabell
Release Date: April 15, 2025
In this insightful episode of Software Engineering Daily, host Kevin Ball engages in an in-depth conversation with Eric Schabell, Director of Community and Developer Relations at Chronosphere and a CNCF Ambassador. The discussion centers around Prometheus, an open-source observability tool, and the challenges and solutions associated with scaling observability platforms in dynamic, cloud-native environments.
[01:35] Kevin Ball: "Hey, thank you very much. Nice to be here."
[02:19] Eric Schabell:
Eric introduces himself as the Director of Evangelism at Chronosphere, emphasizing the fluid and evolving nature of his role within the observability space. He highlights Chronosphere's commitment to open standards and cloud-native solutions, positioning the company as a key player in managing and scaling observability infrastructures.
[04:00] Eric Schabell:
Eric delves into what makes Prometheus a standout tool in the observability landscape. Originally developed by SoundCloud, Prometheus is praised for its high performance, scalability, and unobtrusive nature. He explains the pull-based data collection model, where Prometheus scrapes metrics from various endpoints at configurable intervals without relying on agents or collectors.
Notable Quote:
"Prometheus is designed to be highly performant and scalable, making it ideal for dynamic cloud-native environments where traditional monitoring tools fall short." — Eric Schabell [04:00]
He contrasts time series metrics with logs and traces, underscoring Prometheus's specialization in handling continuous, high-volume data streams essential for real-time monitoring and alerting.
[09:22] Eric Schabell:
Eric breaks down the differences between metrics, logs, and traces. He emphasizes that while logs and traces provide valuable information for debugging and understanding service interactions, metrics are crucial for measuring the ongoing performance and health of systems. He highlights the challenges of managing time series data, particularly the issue of cardinality explosions—where an excessive number of unique metric labels can overwhelm storage and querying capabilities.
Notable Quote:
"In a time series database, managing high cardinality is essential to prevent overload and ensure efficient querying." — Eric Schabell [09:22]
[20:19] Eric Schabell:
Eric outlines the typical lifecycle of deploying Prometheus for service monitoring:
He also discusses the importance of trimming unnecessary metrics to optimize storage costs, noting that on average, 60% of collected metrics may go unused.
Notable Quote:
"Once you start collecting data, the challenge shifts to managing and optimizing what you store to avoid unnecessary costs." — Eric Schabell [20:19]
[26:03] Eric Schabell:
Scaling Prometheus introduces significant complexities, particularly around high availability and data sharding. Prometheus was not initially designed for high availability, making it difficult to manage failover scenarios without incurring additional overhead. Eric points out that scaling often leads to intricate topologies that require manual intervention to handle increased load and ensure data consistency.
Notable Quote:
"Prometheus doesn't have built-in high availability, so scaling it requires managing multiple instances and dealing with complex topologies." — Eric Schabell [26:03]
[30:51] Eric Schabell:
To address the scaling challenges of Prometheus, Eric advocates for transitioning to managed observability platforms like Chronosphere. He explains that such platforms handle the underlying complexities of scaling, high availability, and cost optimization, allowing engineering teams to focus on developing their applications rather than managing observability infrastructure.
Notable Quote:
"Managed platforms take over the heavy lifting of scaling and maintaining observability tools, freeing your team to concentrate on what they do best." — Eric Schabell [30:51]
[35:52] Eric Schabell:
Eric discusses the migration process from self-hosted Prometheus to Chronosphere. Emphasizing open standards and compatibility, he assures that migrating should be relatively straightforward due to shared protocols and query languages like PromQL. He highlights the flexibility of Chronosphere in integrating with existing telemetry pipelines, ensuring that organizations can seamlessly transition without significant disruptions.
Notable Quote:
"Because we adhere to open standards, migrating to Chronosphere from Prometheus can be done with minimal friction, ensuring continuity and reliability." — Eric Schabell [35:52]
[38:42] Eric Schabell:
Eric explains that platforms like Chronosphere aim to unify various observability signals—metrics, logs, and traces—under one umbrella. This consolidation simplifies the observability stack, providing a cohesive view of system performance and health. He underscores the importance of flexibility, allowing organizations to integrate legacy systems while adopting modern telemetry standards like OpenTelemetry.
Notable Quote:
"Consolidating metrics, logs, and traces into a single platform enhances visibility and simplifies troubleshooting across your entire infrastructure." — Eric Schabell [38:42]
[44:47] Eric Schabell:
Cost is a critical consideration in observability. Eric emphasizes that unmanaged, self-hosted solutions can lead to exponentially increasing data costs without proportional value. Managed platforms like Chronosphere offer cost optimization features, enabling organizations to monitor data usage, eliminate unused metrics, and maintain control over their observability expenses.
Notable Quote:
"Without proper management, your observability costs can spiral out of control. Managed platforms provide the tools you need to keep expenses in check while maintaining comprehensive monitoring." — Eric Schabell [44:47]
As the discussion wraps up, Eric encourages listeners to explore Chronosphere's workshops and resources to gain hands-on experience with observability tools and practices.
[45:35] Eric Schabell:
"If you explore our workshops, you can gain practical experience with installing Prometheus, configuring Fluent Bit, and leveraging OpenTelemetry—all available online for free." — Eric Schabell [45:35]
[45:49] Host:
Kevin thanks Eric for the insightful conversation, leaving listeners with a comprehensive understanding of Prometheus, the challenges of scaling observability, and the benefits of managed solutions like Chronosphere.
For those interested in enhancing their observability practices, exploring Chronosphere's resources and workshops is a valuable next step.