
In this episode, we spoke to . Karthik is a principal software engineer at Harness and co-founder and maintainer of LitmusChaos, a CNCF incubated project. We talked about Chaos engineering , the Litmus project and more. Do you have something...
Loading summary
Abdel Sigewar
Hi and welcome to the Kubernetes podcast from Google. I'm your host Abdel Sigewar.
Kaslan Fields
And I'm Kaslan Fields.
Abdel Sigewar
In this episode we spoke to Kartik Sachitanend. Kartik is a principal software engineer at Harness and co founder and maintainer of Litmus Chaos SCNCF incubated project. We talked about Chaos Engineering, the Litmus project and more.
Kaslan Fields
But first let's get to the news. Kubernetes 1.31, codename Ellie is released. The detailed blog with enhancements, graduations, deprecations and removals can be found in the show notes. And don't forget to listen to our episode with the release lead Angelos Kolaitis.
Abdel Sigewar
The schedule for Kubecon and Cloud native con North America 2024 is announced. As a reminder, the event will take place in Utah Salt Lake City between November 12 and 15.
Kaslan Fields
Score has been accepted by the CNCF as a sandbox project. SCORE is an open source platform agnostic tool that allows you to write an application configuration using the Score spec in YAML format and then converting it into a deployable manifest using one of the supported SCORE implementations. Currently the implementations include Docker Compose, Kubernetes, Helm, Cloud Run and humanitech. And that's the news.
Abdel Sigewar
Today we're talking to Karthik. Kartik is a principal software engineer at Harness. He is the co founder and maintainer of Litmus Chaos, a CNCF incubated project. Kartik has worked closely with the cloud native ecosystem, first with OpenEBS, a Kubernetes storage solution, and now with Litmus Chaos. In the last six years you also co founded something called Chaos Native which was acquired by Harness. Welcome to the show, Karthik.
Kartik Sachitanend
Thanks Amdel. It's great to be here in the Google Kubernetes podcast. Really looking forward to this conversation.
Abdel Sigewar
Awesome. Thanks for being with us. So I guess we'll have to start where this conversation has to start because we're going to be talking about Litmus Chaos and Chaos is in the name. So I assume that means Chaos Engineering.
Kartik Sachitanend
Yes, this is about Chaos Engineering. I'm sure all of you know about Chaos Engineering already. It's been around for more than a decade and a half, I should say. It's really become popular and it's become a little mainstream over the last few years and there are a lot of projects in the CNCF landscape that are around the Chaos Engineering. So yes, Litmus Chaos was one of the first Chaos Engineering projects that was accepted into cncf. We were one of the earliest projects to get into sandbox and then incubating status. And the community around the project has really grown over time. A lot of great feedback, a lot of releases that were led by the community. So I think it's really brought about a change in how people look at Chaos Engineering, especially for cloud native endowments. So yeah, I think it's been a great journey so far.
Abdel Sigewar
Awesome. So I think we'll have to like because, you know, our audience is quite diverse in terms of their experiences and we receive a lot of feedback about people saying that sometimes we go a little bit too deep and sometimes we are too high level. So I want to start a little bit high level. What is Chaos Engineering for those who doesn't understand what it is?
Kartik Sachitanend
Okay, so Chaos Engineering is, you know, the standard textbook definition is it's the process of testing some distributed computing system to ensure that it can withstand unexpected failures or disruptions.
Abdel Sigewar
Right.
Kartik Sachitanend
There is a principles of chaos.org website that was put together by the initial pioneers of chaos, Netflix, Amazon, etc. Which gives you more details about the principle around Chaos Engineering, how it should be carried out, what a typical Chaos Engineering setup would look like, or the practice of Chaos Engineering, how would it look like? They talk about being able to inject different kind of failures that actually simulate real world events. There's something called Murphy's Law which you might all be aware of. If there is something that can fail, it will fail at some point. Right? That's the gist of it. So Chaos Engineering is mainly about understanding your distributed system better, how it withstands different kinds of failures because failures are bound to happen in production, and then also trying to create some kind of an automation around it because you would want to test your system continuously. So Chaos Engineering is not like a one off event. It's not like you perform something called a chaos experiment. Chaos Engineering is carried out as experiments. So it's not like you do an experiment one day and then you're revisiting that after months or weeks. It's something that you would need to do constantly. So there's a need to simulate these failures in a very predictable and a controlled way. We are talking about Chaos Engineering and we are talking about unexpected disruptions. But now it's really interesting. The experimentation itself is actually a very controlled event. So you see the blast radius, what you want to sort of cause, and you try and simulate failures and then you go armed with something called as the steady state hypothesis. So you have an expectation or a Notion of how your application should behave under ideal circumstances. What is its steady state and how much of a deviation there is from the steady state, how much of a deviation you expect from its steady state. That is the hypothesis. So you go armed with that hypothesis and you inject a particular failure. You see how the system is behaving. You see whether that conforms to your expectation or you learn something new. And sometimes you learn something about the system that is sort of needs some kind of fixing. You discover something suboptimal inside of your system. So you understood a weakness, you uncovered and understood that weakness and typically would go back and fix it. It could be a process fix, it could be an actual fix that you're making onto your software, or it could be something in the way you deploy it. It could be some kind of a deployment control. It could be any of these things. So you make that fix and then you repeat the experiments and you try and take your experiments from a very sanitized, controlled, low level environment to higher environments. You always have different kinds of environments that sort of build up towards your production. You have various dev environments, your QA environments, your performance test environments, and then your staging environments, and eventually your actual production. So Chaos Engineering starts out typically in some of the lower environments, just for you to understand how the experiment itself is carried out. Then over time, you increase the stakes. You do it in an environment that really mimics what happens in production, and then eventually you do the experiment in production itself. This is a very quick introduction to what Chaos Engineering is. It's all about experimenting. It's all about injecting some kind of failure that simulates the real world event and then trying to understand whether the system behaves as expected or not. That's the essence of Chaos Engineering. And when it was initially conceived, we talked about principles of Chaos Engineering that was put together by Netflix and co, and they built it initially. They really advocated its use in production, because that is where you know the real value of the experiment is, because that is where you have the system experiencing the kind of dynamic workloads. That is where the system has been soaked, right? You have a lot of patches going into production. It has seen a lot of changes. And then you have a lot of real world load coming in. There are a lot of maintenance upgrade actions that are going on on your production. So it's really a very complex and dense system. So that is where you get most value for money when you run the experiment. But that's not where most organizations start today, because an experiment that has Sort of gone wrong and sort of inadvertently causes downtime can have a lot of negative consequences. So the chaos is actually exercised in lower environments until you're comfortable on both aspects, doing the experiment in a controlled way, as well as how tolerant your application is to this kind of a failure. You sort of mature on both sides and then eventually you take it to production. That's how people are doing it today.
Abdel Sigewar
Got it. So you covered quite a lot of things. And I want to unpack one thing at a time. So one exercise that I have participated in in the past that I was thinking about while you were talking is at Google, we call this Dirt disaster and Recovery. Disaster recovery testing, Right?
Kartik Sachitanend
Yeah.
Abdel Sigewar
And I remember that when I was doing dirt exercises in the past, there was kind of two types of dirt exercises. There was like real dirt exerc. It's simulated dirt exercises. Right. So real, where you are actually taking stuff down and simulated is more like you simulate or you pretend something went down just to test the process of recovery. Right. And then the other thing was what you talked about, which is the controlled disaster. So essentially you're not just randomly shutting things down in production. You start with lower environments and then you graduate toward the production environments. Right. I think my question to you is, it feels to me that Chaos Engineering versus disaster recovery testing is like Chaos Engineering is something you will do continuously, Right. You will continuously run experiments in kind of an automated way, right?
Kartik Sachitanend
Yes.
Abdel Sigewar
Instead of like you go, okay, once a week we're going to do our disaster exercise and see how it goes. Or once a year sometimes, right?
Kartik Sachitanend
Yeah, yeah, I think that you're right. The philosophy has sort of evolved over time. When Chaos Engineering was initially introduced, it centered around the concept of game days. Game days are these events where all the stakeholders come together. You have people, actual SREs, that's managing the application infrastructure, you have the developers, you have the support folks, you even have somebody representing the customers. And then you all take a decision to sort of inject a specific kind of a failure at a very small level and then see how things are behaving. You have your APMS primed, you sort of verifying if you have the right alerts, you're looking at receiving the right notifications. And if at all things go bad, you know exactly what to revert or what to change to get back to normal. So people used to do Chaos Engineering experiments only as part of game days. But then with the introduction of cloud native, where the release times have sort of become 10x or maybe multi x faster. Right. Everything is independently deployable. You have multiple moving parts and there are a lot of dependencies. You have an entire dependency tree. For example, if you look at Kubernetes, you have a very dense orchestration infrastructure that's sitting on some kind of hosts. And on top of that you have, you know, the actual kubernetes microservices, you have your container runtimes and things like that. Then you have your actual application dependencies, you have message queues and databases, etc. You have your own applications middleware and then you have its own backend, front end, etc. So it's barely a pyramid and so many moving components there, individual deployments that are getting upgraded all the time. And the notion of CI CD today, the CI CD ecosystem allows you to deploy every day or maybe even quicker. When so many changes happen so rapidly, and when there are as many dependencies as there are today, you really need to be testing continuously. So Chaos Engineering went from being this specialized game day model to becoming a continuous event. So there is something a concept called continuous resilience. So you basically test every time you deploy or you use Chaos Engineering experiments as a way to greenlight your deployments. So these are promotions to production. So these kind of use cases are what we see predominantly in the Chaos Engineering world today. And disaster recovery testing. By definition, they are still one off events. Probably would do them once in a quarter or once in a few months where you could be actually taking down systems like you said, or you could be simulating the loss of certain systems. You basically cut up access to a specific zone so that everything in the zone is not accessible, or you actually physically take when you actually shut down, let's say cloud instances instead of a specific zone. Both are valid disaster recovery scenarios. It is another matter that you could use the Chaos Engineering tooling of today to carry out the disaster recovery test whenever you choose to do them. But I think by and large the disaster recovery tests still continue to be one of events, like you said, whereas Chaos Engineering sort of has moved into the realm of continuous testing.
Abdel Sigewar
Yeah, that's kind of what I was thinking about when you were explaining Chaos Engineering. And as you said, yeah, when I used to be at least part of these dirt exercises, it was more one off big events that a lot of people are aware it's happening. Right. So.
Kartik Sachitanend
Right.
Abdel Sigewar
It was a lot of fun. I had a lot of fun doing these kind of things. So then can you, in this context of like automation and continuous chaos testing, where does litmus chaos fit and what does it do?
Kartik Sachitanend
Yeah. I can give you a little bit of history on how Litmus Chaos came into being.
Abdel Sigewar
Yeah, sure.
Kartik Sachitanend
In fact, it was this need for continuous resilience and testing that led us to build Litmus. So this was sometime around 2017 and we were trying to operate a SaaS platform that was based on Kubernetes, which was using a lot of stateful components. And one of the things that we wanted to do was test the resilience of our Kubernetes based SaaS. Every time we release something into our control plane, every time we release something into our SAS microservices, we wanted to go and test it. And what we had initially was an assortment of scripts to do different things. There was no one standardized way of being able to inject something. So if I had some particular failure intent, this is my failure intent or chaos intent, and this is what I would like to validate when I go ahead and do my fault. And this is how I would like to see the results of my experiments. There was no one standardized way of doing that because different groups of developers and different teams were testing the services in different ways using different tooling, some of which was actually already called chaos tooling. By that time. We already had some open source tools and also some commercial tools that were available at that point of time. We wanted to do all this standardization of how we want to define your chaos intent, how you want to do the hypothesis validation, how you want to see results, we want to attach it to pipelines. And how would people write newer experiments? There should be one standardized API for writing newer experiments. There should be some kind of homogeneity. And we wanted to do all this standardization in a cloud native way because we were primarily dealing with Kubernetes. So something that, let's say Kubernetes, DevOps or a developer person would understand, something that is storable in Git repository, something that can be reconciled via an operator, something that conforms to a resource definition in Kubernetes. So we had all these requirements coming together, which is why we built Litmus. So Litmus is basically an end to end. When it began, it was just doing failure injection via Kubernetes custom resources. So you would basically define your chaos intent in a custom resource. There would be an operator that would sort of read it and inject the failure and give you the results in a standard form. The result was also another custom resource which you could read off. That's how we began. But over time, as we move this into the community, as we Open source data and we learned more about what people want in the space over a period of time. It grew into an end to end chaos platform that actually implements everything that is talked about in this. Principles of Chaos. Principles of Chaos asks you to be able to inject different kind of failures that correspond to different kind of real world events. So we built up a huge library of different kind of faults and then we basically added something called probes that is a way for you to validate your hypothesis. Probes are entities or means of validating certain application behavior. You could be doing some API calls, you could be doing some metric parsing, or you could be doing some custom commands, you could be doing some kubernetes operations, all these standard stuff that you would want to do, which actually tells you or gives you insights about how your application or infrastructure is behaving. We built that framework. Then we also added the ability to schedule experiments. We added the ability to trigger experiments based on certain events. We added the ability to control the blast radius. How do you isolate your fault? How do you ensure that your failure is getting injected only in a specific namespace, only against a specific application, only for a specific period of time? And how would you sort of define which user has the ability to do what kind of faults in what environments, for how long? So all this governance and control, we sort of brought that in and then also made it easy for people to scale this entire chaos operations from a single portal. So you have the ability to add different target environments into a control plane, a centralized control plane. And we have the ability to orchestrate chaos against each of those targets the way you want it. So this kind of multitenancy was also built in. And then slowly we went from doing single faults or specific failures to more complex scenarios where you can string faults together in some patterns that you would want. Because when real world failures happen, oftentimes they're a result of multiple components behaving in an unexpected manner. Yes, sometimes there are single point of failures, but many times it's a combination of several events that leads to a big outage. So let's say you want to like.
Abdel Sigewar
A cascading failure basically.
Kartik Sachitanend
Exactly. So how do you sort of simulate that kind of a cascading event? So we brought in the concept of workflows and then we also workflows where you can string together different faults at different times. Then also the ability to reuse experiments across teams. We made the experiment as essentially a resource. It's YAML file, so it's a reusable entity. So you create templates out of it, store it in a git repository. Anyone who has a requirement to create that kind of a scenario can just pick it, tune it with their application instance details and then run it. All this ability to reuse these experiments. So this entire framework was built over a period of two, three years, and that's how Litmus became really popular in the community. We have users that are using it even for non Kubernetes use cases as well. Though it was initially built in a very Kubernetes centric way, we also gave enough flexibility within the platform to be able to use Kubernetes as a base, as an execution plane, while you're doing chaos against entities that are residing outside of Kubernetes as well. For example, you would like to take down something in AWS or gcp, you could still do that. For example, you have a managed service that you're doing chaos against, or you have some kind of a vanilla compute instance somewhere that you want to bring down. You can do these things via the cloud provider specific API calls, but they're all getting executed from inside of a Kubernetes cluster, which has some kind of permissions to do the chaos against your cloud provider. You may be using Workload Identity in case of Google Cloud, you could be using the IRSA in case of Azure, those kind of things. So we built that system. While you could reuse Kubernetes as a way to orchestrate the chaos, way to define the chaos, et cetera, you could still do the chaos inside of Kubernetes or outside of Kubernetes, and that sort of caught on, and that's where we are right now in the community. So Litmus was very useful for us. We built it as something that would aid us to test our Kubernetes SaaS platform. And over time it acquired a life of its own. And today we have organizations across domains that are using it, people who are majorly users of the cloud or Kubernetes, but they are across domains. For example, there are telco organizations, there are food delivery organizations, folks in the Medtech software vendors. There are different kind of users, including other open source projects that are leveraging it today. So that's a quick snapshot of how Litmus began, what it does and where it is today.
Abdel Sigewar
One final question was going to be something you described, which is you could target basically any environment, although Litmus itself was built to run on top of Kubernetes.
Kartik Sachitanend
Right.
Abdel Sigewar
But the target environment against which you run an experiment could be whatever Right, that's true.
Kartik Sachitanend
Yes, yes.
Abdel Sigewar
And those like integrations, those like specific. How do you shut down VM on a specific cloud provider? Is this built by the open source community or is this something that people have to build themselves? Or how does this work?
Kartik Sachitanend
There are some experiments that are already available. The community has built it and we've pushed it onto a public Chaos hub. But we've also laid out exactly how someone could do it if they want to build their own experiments for something that is not already put up on the public Chaos hub. So there is a bootstrapper that we provide and some templates. What this bootstrapper does is basically asks you to construct a very simple YAML file that consists of some metadata about your experiment and the target that you're trying to do chaos against, etc. Then it uses this information to generate the code. At least the scaffolding of the experiment. In Litmus, the experiment has a very specific structure. You have something called as a pre chaos check that you would perform like a gating condition to say I can actually go ahead and do chaos. Then you have the actual fault injection, then you have the post chaos phase. Then you have all these different probes that we have for HTTP probes, Prometheus probes and command probes, things like that. So they are all brought in. So you have basically a scaffolding that allows you to go ahead and add your business logic for doing the fault that you want on any infrastructure. We have some documentation that helps you to package this entire code into a Kubernetes job and then eventually into a custom resource. And then once you have that, you're ready to orchestrate from the control plane. It becomes a first class citizen on the Litmus platform. So there is a specific approach to how you could construct your experiments. There is some aid that is always provided in terms of this bootstrapping utility and the documentation, etc. But there's also a huge community. There is a Litmus Slack channel on the Kubernetes workspace which is vibrant. There are a lot of conversations happening there. There are a lot of folks who actually built out their own experiments, created their own private chaos hubs that they are using and you could definitely interact with them and see what you can reuse, see what you can contribute upstream, what methodologies they follow to create a certain experiment and then go from there.
Abdel Sigewar
Got it. This make me think of something. This is a little bit off, maybe not off topic, but like off what we would. The questions that I had in mind, would you Potentially be able to use Litmus for user validation testing. I'm thinking about a scenario where you could like inject some errorness messages to simulate an actual user and see. So, for example, let's say you have a queue based system, right? And you have microservices that subscribe to queues, receive data and act on them. And you want to see if a microservice will behave properly if the data is malformatted. Right?
Kartik Sachitanend
Right.
Abdel Sigewar
I guess that's something you could do. Like you could just inject errorless data, data with errors in a queue and see how a subsequent microservice would behave, right?
Kartik Sachitanend
Yes, that's definitely something that you can build out though the core. When you look at failures, there are different kinds of failures. Failures could mean different things to different Personas. It could mean different things to different target applications. The platform has been designed in a flexible way so you could write something like how you just described it, erroneous messages, and see how the service handles it. That's something that you can definitely do. Things that are on the periphery of chaos, for example, load testing. You would basically want to put insane amount of load on your surveys and see whether you have the right rate limiting in place. What happens to all the genuine requests that were going on? Let's say you're putting a spurious load on your service and you're rendering it incapable of handling the actual genuine requests. Is that happening? Do you have the right controls against that? There is also a chaos experiment. What you describe is also a chaos experiment. And then you have the more traditional forms of failures. You take down a node, you take down a pod, you cut off the network, you inject latencies. These are some of the things that probably are what you would find in Chaos tools when they say we have chaos experiments, but a lot of these other things or valid chaos experiments, and we have some users writing some very innovative experiments that are still being orchestrated by Litmus. So that's the idea.
Abdel Sigewar
You should be.
Kartik Sachitanend
You should have the flexibility to use one platform to inject different kinds of failures that you want to and track all your results and track all the resilience aspects in one place. So that's the idea and that's actually.
Abdel Sigewar
What I wanted to come to. Because when we think about failure, it's not necessarily always something is down or it's up or something is not able to handle load. It could also be like all sorts of random weird stuff that could happen. You know, malformatted data, like, yeah, maybe on purpose or not on purpose. Like, you know, it could be an SQL injection, it could be a cross site script, like all security related stuff, right?
Kartik Sachitanend
Absolutely.
Abdel Sigewar
It could be maybe failed authentication. You explicitly try to authenticate with the wrong credentials multiple times to see if the system will behave properly. If you have any, like, you know, if you have any logic to say if the same user tried to authenticate a couple of times with the same username password and it fails, you should block them or stuff like that, right?
Kartik Sachitanend
Absolutely. These are all valid scenarios. We recently had someone try and write experiments to sort of test their security check. If you're able to do a certain operation and if you are going through that means that you're not secure, are you able to create privileged containers or let's say you've probably had the right settings on your S3 bucket. If you're able to access it, then that's a problem. The experiment can also basically incorporate negative logic tests like this. So if at all you're able to do certain thing which you're originally not expected to be able to do, then that's a failure. You could use the platform for doing things like that as well.
Abdel Sigewar
Yeah. So who are the main audiences of Litmus Chaos? Who are the main kind of Personas?
Kartik Sachitanend
Yeah, so when it began and when the traditional approach to Chaos engineering held sway, you know, early 2019, around that time, 1819, it was mainly the SREs. So they were the folks who were trying to ascertain resilience of the services that they had deployed and were maintaining. But as the awareness around continuous resilience grew, we had more folks whom you would sort of associate with DevOps functions, people who are writing pipelines, managing pipelines, et cetera. So they were interested in adding Chaos as part of pipelines. So we started getting requests to create some kind of integrations with GitHub Actions. We started providing some remote templates for GitLab integrations with Spinnaker and things like that, where people started adding it into their pipelines. And then as things grew more interesting, we sort of partnered with another open source project called Octeto, which sort of helps you to do some kind of testing even before you create your image and you push it to your registry. This is specifically for Kubernetes, where they allow you some namespaces so you can basically sync code between your code workspace and your pod on the cluster. And people were using Chaos experiments in that kind of an environment too. So this is the actual core developers, even before they shipped anything or committed something. So we had Persona groups sort of evolving over time went from being, you know, SRE and sort of cluster admin kind of Persona, to the folks who are doing the continuous delivery to the actual developers, the innermost loop. So we have Chaos engineering being looked at by all of them. But I should say predominantly users are still of the latter type. It is the SREs, somebody with that kind of an allied function. It could be somebody that is looking at signing off the QA or people doing performance testing. And this is sort of caught up. People have started doing chaos experiments as part of their performance testing routines. So they have standard benchmarks, the pure benchmarks that they do with the different workload parameters, with different kind of IO profiles and things like that. And then they have these mixed benchmarks that they do where they're trying to benchmark the system under a specific condition, under a certain kind of degraded condition, the degradation having been caused by a chaos experiment. So that is something that we are really seeing evolve. So these are the different kind of Personas that are looking at chaos today.
Abdel Sigewar
I see. Cool. So at this stage, Litmus Chaos is an incubated project and you are on your way to graduation. Right, Right. How is that going?
Kartik Sachitanend
Yeah, graduation is a long process. We are very excited as the Litmus Project team, our community is excited. They were made aware of the fact that we've applied for graduation and they showed a lot of love on our graduation pr and they have been asking us also this question as to how it is going. So I think the CNCF incubation and graduation process takes its own time. There is a specific due diligence process. There are certain criteria that they look for. One of the things that as a project team, we've been prepping ourselves on acing the security audit. There are a lot of security features that we built in into Litmus as we grew, but we wanted to actually get an audit done and get a lot of feedback on where we can improve our security posture. That's something that we worked upon and we've submitted all our improvements to the auditing authority, which is very shortly going to do the retest. We've also gone ahead and added more over time. And this was not specifically something that we decided to do once we got to graduation. This has been an evolving process from the time we moved from sandbox incubation and then onward offer incubation. We've added more committers into the project, people who are invested into the project, some because they're using it in their organizations and they're dependent upon Litmus very strongly for testing the resilience of their solutions and they've made it part of their lease processes. And there are some other competitors who are probably not actually using it within their organizations. But just because of the love for Chaos, they've been doing a lot of over a period of time. So they've become maintainers and then we've gone ahead and improved our community base. People who have started adopting the project, both at an individual level and organizations who've actually publicly come out and said we've adopted Litmus. Many times organizations might be using it, but they might not be very open in saying so. But the number of organizations that are publicly stating that we've used Litmus Chaos, that has grown a lot, especially in the end user community. So we've been working on that. We've been working on adding more mentorship programs as part of Litmus project. So there is the LFX program in cncf, there's the Google season of code. A lot of folks who sort of participate in these programs have contributed to Litmus. So there's always some or the other mentorship program that's going on where Litmus Chaos is participating. And we are also trying to work with other projects in the CNCF community where we are trying to get them to use Litmus. And we had a very fruitful relationship with the TEMCO working group or the CNF working group who have been actively using Litmus as part of their test beds. There have been other CNC projects who have been using Litmus to test their resilience. We are also integrating with other projects where we see an actual fit. For example, Backstage is one of the integrations we've done. We are also trying to integrate with other tooling which is sort of allied with Chaos, which is on the periphery. Like we said, load, for example, K6 integrations. All these things have been going on and they've been improving our presence in the community and our relevance, thereby indirectly helping our graduation efforts. Right now we've basically created the graduation pr. We have some folks who are interested in sort of sponsoring or, you know, carrying out the due diligence. And that's where it is right now. While we prep on our side to help with whatever is the process, whatever information is needed by the TOC for evaluating the project, we're trying to get that in place and we're looking forward to engage with them. So we are in the journey. Hopefully we make more progress more quickly and get there. But I'm sure that we will get there at some point in the near future.
Abdel Sigewar
Nice. And my last question to you was going to be, you have an upcoming conference called Litmus Chaos Con, right?
Kartik Sachitanend
Yeah.
Abdel Sigewar
Can you tell us a little bit about that and where people can find information if they want to, like, check it out?
Kartik Sachitanend
Definitely. Litmus Chaos Con is something that we really wanted to conduct. We were enthused by the sort of reaction and enthusiasm we saw for a Chaos Day event. It was a co located event that we did in one of the previous Kubecons where a lot of Litmus users, individual users, organizations came and spoke about how they were using it, what they wanted to see in Bitmas going ahead. We had a lot of booth traction during the Kubecons. The project meetings and some of our other talks were very well received. We have been having Chaos talks being accepted in main Kubecon event over the last few years and we saw all this as a positive and indicator for, you know, people's need for, let's say, getting a full confidence around Chaos itself, around Litmus itself. So we decided to do Litmus Chaos Con. It is on September 12. It is a full day event and you can find details about it on the events page, Community CNCF IO events. That's where you'll find details about Litmus Chaos Con. And we have a very interesting lineup of speakers, folks from different people who are Litmus users and there are some general Chaos practitioners in there as well. And there are users from different kinds of end user organizations. There are speakers from different kind of end user organizations, I should say people who sort of run poker, online poker, people who do food delivery, people who are maintaining streaming services, video streaming services, people who are software vendors. Different kinds of users of Litmus are coming and speaking about their unique challenges, what they wanted to do, how they use Litmus to achieve that. I think it's going to be very interesting for the community to learn from the experiences of these various speakers. We've got some amazing speakers here and the agenda and other details are all available on the community CNCF IO. So yeah, we're really looking forward as the project team, really looking forward to hear from all the speakers during the conference.
Abdel Sigewar
We will make sure to add the link to our show notes about your upcoming events. Karthik, thank you very much for your time. I learned quite a lot from you. I had no idea what Litmus is and now I have a basic idea of what it does. Thank you very much.
Kartik Sachitanend
Yeah, thank you so much, Abdul, for giving us this opportunity. To talk about Litmus Chaos. Really enjoyed this podcast. Great questions and looking forward to interacting more with the audience.
Abdel Sigewar
Thanks for your time and have a good one.
Kartik Sachitanend
Thank you.
Kaslan Fields
Thank you very much, Abdel, for that interview. I am really excited about this one because I've always kind of been interested in Chaos Engineering because something with chaos in the name has to be fun, right? And also because I started out my career as a quality assurance engineer writing tests in Perl for a storage attached network San system.
Abdel Sigewar
So wait, wait, wait, wait, wait, wait. Like, this is. This is a scoop. You've written Perl code?
Kaslan Fields
Yes.
Abdel Sigewar
Wow. All right. I have to bow towards you. Like, Perl is such an interesting language.
Kaslan Fields
And that's why now I'm just a YAML engineer.
Abdel Sigewar
Got it. All this trauma from Perl, I guess.
Kaslan Fields
Bash and YAML, that's what I do.
Abdel Sigewar
All right. I didn't know that. Okay.
Kartik Sachitanend
Okay.
Kaslan Fields
Yeah. So I was excited to hear about Chaos Engineering and I really liked how you all discussed. I mean, when he was talking about the basics of what Chaos Engineering was, I was like, hey, testing. Yeah, I know this world.
Abdel Sigewar
Yeah, it's kind of. I mean, I think it's all same words to say the same thing. I have to admit that I am not hearing that much Chaos Engineering recently or up to when we decided to do this episode. I haven't been hearing it that much. I think maybe just we call it something else. But as you say, the principles are the same. You just want to make sure that your system works right.
Kaslan Fields
In my early days as a quality assurance engineer, I remember I was always coming up with ideas about, like, what testing needed to be in different ways that you could do testing. Like the concept of writing tests before you create the system versus writing them after you have the system and things like that. And so I feel like this kind of took me back to those days and kind of reminded me of how much fun it can be to think about ways that you can break a system.
Abdel Sigewar
Yeah. Yeah. So the only experience I have personally with Chaos Engineering, generally speaking, is using Chaos Monkey, which is a very popular tool. I've. So I used that in the past, much smaller scale and for VMs, which is kind of the same concept. It's just like orchestrating your tests, if you want to call them that. I call them orchestrating chaos through a tool.
Kaslan Fields
I've never used Chaos Monkey, but I've definitely heard of it. You also talked in the interview, though, about dirt, which is an acronym that I have seen at Google. Though I have never been involved with it. And it sounded like that was quite related as well.
Abdel Sigewar
Yeah. So that's also something I did in my time at Google when I was working in data center. I can't say that much details about it for obvious reasons, but I think the concept of dirt is public. How do we conduct it is not. So dirt stands for disaster recovery and testing, which is essentially the same idea. You basically come up with a scenario. Sometimes it's a hypothetical scenario, sometimes it's an actual scenario and you simulate or you actually take stuff down and you see how other things behave.
Kaslan Fields
Which I think is a really fun approach to testing. That is very chaotic.
Abdel Sigewar
Pretty much the only thing I will say is, and this is, I think public information is that dirt doesn't necessarily. Or okay, a disaster and recovery testing exercise doesn't necessarily have to be always about IT systems. It can actually be about physical stuff.
Kaslan Fields
Yeah. And I feel like this is a good point to throw in the constant reminder of. Test your recovery plans.
Abdel Sigewar
Yes, please test your backup plan.
Kaslan Fields
Having a recovery plan is not the same thing as having a recovery plan that works.
Abdel Sigewar
Yes. Test that backup you've taken of the database last month to make sure it works.
Kaslan Fields
Yeah. Kind of a random aside on this since I worked for a storage company, I worked for a NetApp in the past. That is public information. And so I worked on testing their SAN systems and I also worked in Vault at one point. And so a lot of my work at that time was, you know, about bad things that can happen to your storage. And of course we work in tech. And so Silicon Valley, the TV show comes up periodically.
Abdel Sigewar
Oh yeah.
Kaslan Fields
And it's a very painful show to watch. It's fantastic, but very painful. And I had to stop watching at the point where they had like a catastrophic data loss.
Abdel Sigewar
Yes.
Kaslan Fields
And I was like, no, this is what I do for work and I.
Abdel Sigewar
Can'T handle this reminded you too much of work.
Kaslan Fields
Uh huh. Yep. So chaos engineering.
Abdel Sigewar
I watched Silicon Valley. I don't think that it's that disconnected from reality.
Kaslan Fields
And unfortunately, yes, the only thing we can.
Abdel Sigewar
Well, one of the things we can say is just remember what happened a few weeks ago with airlines and we just stopped there whole thing don't go into chaos. So I think testing was at the core of that fiasco.
Kaslan Fields
True, Good point. Wouldn't that be an interesting interview to do?
Abdel Sigewar
Yeah. So there was actually a YouTube show. I'll try to find it to include in the show notes. There is actually a YouTube not show. It's like, I think an interview where and this is related to what I talked about earlier when I talked about the disaster recovery exercise being physical exercises. Some actually physical exercises even sometimes involve trying to physically break into places like physical intrusion as part of testing, where you're not testing it system, you are testing if your security systems, as in physical security systems are actually up to standards. Right. And those are really fun exercises to do, actually.
Kaslan Fields
Yeah. And so for chaos Engineering, I think this concept of disaster recovery testing is one thing that can fit inside of the box of chaos Engineering. It seems like chaos engineering is a very big umbrella. Yes. Introducing chaos into your systems through testing. And that's kind of what tests are meant to do. So I feel like most forms of testing could arguably fit into the world of chaos, as long as it's intentionally doing something that will probably break instead of a test that's intentionally doing something that is supposed to be a good path.
Abdel Sigewar
Yes.
Kaslan Fields
I guess any of the tests that are doing a bad thing could be chaos engineering.
Abdel Sigewar
Yeah. But I think it's interesting the term, but I mean, chaos is probably a scary word, but I think in the context of what we discussed with the, with Karthik, it's probably we could say that it's controlled chaos because you are, you know, what you're doing.
Kaslan Fields
I did like the use of that term. Yeah, yeah.
Abdel Sigewar
So you're trying to read like come up with a realistic scenario, but you execute it in a controlled environment. And you also collect metrics and blogs and you see how your system behaves in general.
Kaslan Fields
A realistic failure scenario.
Abdel Sigewar
Yes.
Kaslan Fields
That you implement in a controlled way.
Abdel Sigewar
Exactly. I think the controlled way is a key here because I mean, technically anybody can just walk into a data center and start pulling cables out, Right. That would be technically chaos. I don't know how many people will want to do that. So I think in this context it really means you do it in a controlled environment. So you can. And I like that. Karthik talked about the fact that you like the recommendation is to always do it in the lower environments and then kind of like bring it up to the higher environments as you feel comfortable with how you execute your tests.
Kaslan Fields
Excellent. So let's talk a little bit then about litmus chaos. It's a CNCF project, open source incubating project, and it's meant to help folks do chaos engineering, right?
Abdel Sigewar
Pretty much, yeah. It's an orchestrator for chaos scenarios, if you want to call them. I think they call them recipes, if I'm not mistaken.
Kaslan Fields
Oh, yes.
Abdel Sigewar
And basically experiments. Sorry, experiments.
Kaslan Fields
Yeah, experiments.
Abdel Sigewar
Yes.
Kaslan Fields
And so There's a special word.
Abdel Sigewar
Yes.
Kaslan Fields
And experiment is experiments. We're scientists. Yeah, mad scientists.
Abdel Sigewar
So they have a framework for building experiments and then you can also, you can build your own experiments, you can use community experiments, but essentially you just, you have an orchestrator which runs on top of kubernetes and then you give it an experiment. And the experiment would be doing a set of actions and monitoring, collecting information about how your system behaves. And yeah, so that's essentially, that's in the tldr. What litmus chaos is really.
Kaslan Fields
I do really like the term experiments for this. It makes sense. And in a scientific concept of you're trying to develop a plan, your hypothesis and test it, essentially. So it kind of makes sense. There's that connection there. But also you're being a mad scientist, introducing chaos into the system. So I like that a lot.
Abdel Sigewar
Yeah. Actually I remember when I was preparing for the episode, I did some research, I went to the website, they have a hub which is sort of like a marketplace for these experiments. So like already pre made experiments that you can just reuse. And I was looking at them and at some point I was like trying to figure out are these acting realistic? So one example I can give that I looked at in the hub was a cloud provider experiment. Without mentioning it doesn't matter, the name is not important. But essentially what you do is you have a load balancer and then you have a bunch of virtual machines attached to it. And what you do is you detach those virtual machines from the load balancer. Right. Experiment. And I was looking, thinking about it like, is this actually realistic? But then I was like, yes, it is, because. And you tell me, what do you think about this? Imagine you are doing this as part of infrastructure, as code. You're running some terraform code and then your run breaks in the middle. So the load balancer is created but your VMS are not attached or the other way around. Right. That could happen. And that's actually a realistic scenario. Or somebody runs the wrong command and end up doing, executing the same, replicating the same behavior. So I find that quite interesting. In these scenarios that maybe doesn't look realistic, but you really think about them. You're like, yeah, this could happen actually.
Kaslan Fields
Yeah, certainly. I mean, VMS disconnecting sounds like something that's going to inevitably happen. You all talked about Murphy's Law as well at the beginning of this. Yes, that just seems like Murphy's Law waiting to happen. Exactly.
Abdel Sigewar
Another thing that I was also thinking about, another example would be you accidentally Add the wrong firewall rule, you know.
Kaslan Fields
Oh, yep, that is going to happen for sure.
Abdel Sigewar
Or you remove the wrong firewall rule. I mean, it could be either way, right?
Kaslan Fields
Mis implement a firewall rule.
Abdel Sigewar
Exactly.
Kaslan Fields
Accidentally type in it.
Abdel Sigewar
Yeah. Wrong. Using the wrong tag, the wrong label, selecting the wrong virtual machines, or writing a firewall rule that maybe overlaps or disables another one. Because priority in firewall rules, that's how usually firewall rules in cloud work, they have priorities. So yeah, these are actually scenarios that could happen.
Kaslan Fields
Right, that's interesting. So there's definitely this level of chaos testing that's like pretty obvious. I feel like with at like a cloud level you have all this infrastructure, basic things like turn it off and on again. Yes. Misconfigure something. There's a whole world of kind of generic chaos engineering tests or experiments, I suppose that could work on all sorts of system and provide valuable testing. But litmus specifically is Kubernetes tool, Right. Or does it work at the cloud level?
Abdel Sigewar
Yeah, it runs on top of Kubernetes. The Orchestrator itself is Kubernetes. Yes.
Kaslan Fields
Okay. Yeah, it's a CRD, basically. Yeah.
Abdel Sigewar
Branch of CRDs and operators. Yeah, pretty much several.
Kaslan Fields
Yeah, that makes sense. So did you all go over any examples of Kubernetes use cases? Because I could imagine like POD disconnects. That's something that we test ourselves all the time. When I set up my blog, actually that was the first thing that I tested. I deleted the container and then recreated it to see if all of the data was still there. Testing my volumes for my Kubernetes workloads.
Abdel Sigewar
I don't remember, to be honest, if we discussed this, but I can see how this kind of scenarios could be valid. Right. I mean, I don't know. Random example from the top of my head, if you don't have auto repair enabled and you just take stuff down. Just a very simple example, right. Take down the Kubelet, you know, just shut it down or block a port like again, back to the firewall. For example, write a firewall rule that blocks certain ports that Kubernetes need for the communication between the nodes.
Kaslan Fields
Right, the networking. Yeah, the networking test could get ugly.
Abdel Sigewar
Yes, yes. Those would be actually very fun to execute, to be honest with you. Yeah, let's watch. Can happen. You know what, remind me, it reminds me of the episode we had with David, Clustered. I remember we had David on one of our episodes and that was essentially what Clustered was about the show. It was like, hey, break this thing and let Me try to figure out how to fix it.
Kaslan Fields
That's very interesting. I wonder if to set up a scenario like clustered. So in Clustered, of course, David Flanagan or RAW Code, he sets up a cluster with bad things happening on it, and you have to figure out what's going wrong and fix it.
Kartik Sachitanend
Yes.
Kaslan Fields
I wonder if you could use existing like standard Chaos Engineering experiments and just like run those on your cluster to get it into a bad state.
Abdel Sigewar
I guess you could. I'm looking at chaotic use of Chaos Engineering.
Kaslan Fields
Yeah.
Abdel Sigewar
I'm actually looking at the Chaos Hub and there are some experiments for Kubernetes.
Kaslan Fields
On that, so I would imagine they normally involve recovery.
Abdel Sigewar
So I'm looking at a list here of the already existing Kubernetes experiments. So you have container kill, disk kill. Disk kill is essentially you fill up ephemeral storage. Right. You go on the node and you just like fill up the disk to see how the node will behave.
Kaslan Fields
I don't need automation to do that. I have plenty of it.
Abdel Sigewar
I mean, DD a bunch of one gigabyte files, I guess. Right?
Kaslan Fields
Yeah. One of my old prototyping projects that I did at one point, we were constantly running out of memory on the node.
Abdel Sigewar
There you go. There are actually some experiments for hogging up memory and CPU resources. Right.
Kaslan Fields
Yep.
Abdel Sigewar
Returning some HTTP codes from the pod, so receive. You receive an HTTP request, but instead of returning a 200, you return a different code. Right. Draining a node, causing some Iostress. I'm just reading off the list. So, yeah, there are quite a lot of. Basically what you're describing, just like a bunch of random. Kill a random pod and see what will happen. Right.
Kaslan Fields
So for our next episode, between now and our next episode, we need to set up a cluster with some Chaos Engineering experiments run on it, leave it in a bad state, and then give them to each other to try to fix.
Abdel Sigewar
Yes. It would be nice to have. It would be fun to have David and Karthik on the same show. Actually. I don't know if they have ever been together, but I should probably dm. I don't know.
Kaslan Fields
Chaotic.
Abdel Sigewar
Yeah. I don't know if David have decided to bring the show back because I think he stopped at some point.
Kaslan Fields
Oh, yeah. Because he's been busy.
Abdel Sigewar
Yes, yes.
Kaslan Fields
But definitely check out rock codes in his community. He has a discord. Makes all sorts of really good content.
Abdel Sigewar
Yeah.
Kaslan Fields
If you want to see more of people trying to fix broken clusters, it's a good place to go.
Abdel Sigewar
It was a lot of fun. It was a lot of fun. Yeah. No, it was really cool to discuss this. It's definitely not something that we get a chance to talk about very often, especially if you're a developer. But I think for people who are test engineers like yourself, like you used to be, and it's pretty fun. Yeah.
Kaslan Fields
So let's wrap it up then. Thank you very much, Abdel, for the interview, and I'm excited that we got to learn about chaos engineering and explore testing.
Abdel Sigewar
Yeah. I hope this episode was not chaotic. All right, thank you.
Kaslan Fields
Thank you. And we'll see you next time.
Abdel Sigewar
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media Pod or reach us by email@kubernetespodcastoogle.com you can also check out the website@kubernetespodcast.com where you will find transcripts and show notes and links. To subscribe, please consider reading us in your podcast player so we can help more people find and enjoy the show. Thank you for listening and we'll see you next time.
Title: LitmusChaos with Karthik Satchitanand
Hosts: Abdel Sghiouar, Kaslin Fields
Release Date: August 20, 2024
In this episode of the Kubernetes Podcast from Google, hosts Abdel Sghiouar and Kaslin Fields engage in an insightful conversation with Karthik Satchitanand, a Principal Software Engineer at Harness and the co-founder and maintainer of LitmusChaos, a CNCF-incubated project. The discussion revolves around Chaos Engineering, the evolution of the LitmusChaos project, its role in the Kubernetes ecosystem, and the broader implications for continuous resilience in cloud-native environments.
Karthik begins by elucidating the concept of Chaos Engineering, emphasizing its foundational principle: testing distributed systems to ensure they can withstand unexpected failures or disruptions.
Karthik Satchitanand [03:22]:
"Chaos Engineering is mainly about understanding your distributed system better, how it withstands different kinds of failures because failures are bound to happen in production, and then also trying to create some kind of an automation around it because you would want to test your system continuously."
He highlights that Chaos Engineering is not a one-off activity but a continuous process involving controlled experiments that simulate real-world failures to validate system resilience.
The conversation delves into the genesis of LitmusChaos, which was born out of a necessity for continuous resilience testing within Kubernetes-based SaaS platforms.
Karthik Satchitanand [13:55]:
"Litmus is basically an end-to-end chaos platform that actually implements everything that is talked about in the Principles of Chaos."
Initially developed to standardize chaos experiments across different teams, LitmusChaos evolved into a comprehensive platform featuring a vast library of failure scenarios, probes for hypothesis validation, scheduling capabilities, and governance controls to manage the blast radius of experiments.
LitmusChaos offers a robust framework for conducting Chaos Engineering experiments with features designed to integrate seamlessly into Kubernetes environments:
Karthik Satchitanand [18:52]:
"We built a huge library of different kinds of faults and then we added something called probes that are a way for you to validate your hypothesis."
LitmusChaos caters to a diverse set of personas within the cloud-native ecosystem:
Karthik notes the shift from specialized game-day events to continuous resilience practices driven by the dynamic nature of cloud-native deployments.
Karthik Satchitanand [23:23]:
"Chaos Engineering has moved from being this specialized game day model to becoming a continuous event."
Currently an incubated project within the CNCF, LitmusChaos is on the path toward graduation. The project team has been actively enhancing the platform's security posture, expanding the community of committers, and fostering integrations with other CNCF projects.
Karthik Satchitanand [30:55]:
"Graduation is a long process. We are very excited as the Litmus Project team, our community is excited."
Efforts include undergoing security audits, increasing community engagement through mentorship programs, and collaborating with other open-source projects to broaden LitmusChaos's applicability and integration within the CNCF landscape.
Karthik announces the upcoming LitmusChaos Con, scheduled for September 12. This full-day event aims to bring together LitmusChaos users and Chaos Engineering practitioners to share experiences, challenges, and best practices.
Karthik Satchitanand [35:08]:
"Litmus Chaos Con is a full day event... We have a very interesting lineup of speakers, folks from different people who are Litmus users and there are some general Chaos practitioners in there as well."
The conference will feature speakers from various industries, including telco, food delivery, and MedTech, highlighting diverse use cases and the impact of Chaos Engineering on system resilience.
The episode wraps up with the hosts and Karthik reflecting on the broader implications of Chaos Engineering and the role of tools like LitmusChaos in fostering resilient, reliable cloud-native systems. They underscore the importance of continuous testing and proactive resilience strategies to navigate the complexities of modern distributed systems.
Kaslin Fields [44:02]:
"Chaos Engineering is a very big umbrella... all sorts of random. Kill a random pod and see what will happen."
The conversation emphasizes that while the term "chaos" may evoke apprehension, its controlled and systematic application is crucial for building robust infrastructures capable of withstanding real-world disruptions.
Karthik on Chaos Engineering Continuity [03:36]:
"Chaos Engineering is not like a one-off event... It's something that you would need to do constantly."
Karthik on LitmusEvolution [13:55]:
"Litmus is basically an end-to-end chaos platform that actually implements everything that is talked about in this Principles of Chaos."
Karthik on Continuous Resilience [30:55]:
"Graduation is a long process. We are very excited as the Litmus Project team, our community is excited."
Karthik on LitmusChaos Con [35:08]:
"Litmus Chaos Con is a full day event... We have a very interesting lineup of speakers."
This comprehensive summary captures the essence of the podcast episode, highlighting the critical discussions on Chaos Engineering, the development and impact of LitmusChaos, and the vision for future resilience practices within the Kubernetes community.