Kubernetes History Inspector, with Kakeru Ishii - Kubernetes Podcast from Google

Summary4 min read

Kubernetes Podcast from Google – Episode Summary

Episode Title: Kubernetes History Inspector, with Kakeru Ishii
Hosts: Abdel Sghiouar, Kaslin Fields
Release Date: February 13, 2025

Introduction

In this episode of the Kubernetes Podcast from Google, host Abdel Sghiouar steps in solo as Kaslin Fields is on vacation. Abdel welcomes Kakeru Ishii, the initiator of the Kubernetes History Inspector (KHI), an innovative open-source tool designed to visualize Kubernetes logs and streamline the troubleshooting process. The conversation delves deep into the functionalities of KHI, its development journey, and the motivations behind its open-sourcing.

Guest Background

Kakeru Ishii is based in Tokyo and works as a support engineer at Google. With a strong background in handling complex Kubernetes issues, Kakeru leveraged his experience to develop KHI, addressing the challenges faced by support teams when diagnosing cluster problems through extensive log analysis.

Notable Quote:

"I needed to understand the macroscopic view of the cluster. So that's why I needed to create it." – Kakeru Ishii [07:02]

Kubernetes History Inspector (KHI) Overview

KHI is introduced as a log visualizer tailored for Kubernetes environments. Before KHI, troubleshooting required meticulous examination of various logs such as Pod logs, Kubelet logs, ContainerD logs, and more, often necessitating high expertise to filter and interpret the vast amounts of data.

Notable Quote:

"It provides detailed timeline visualization or resource relationship diagrams just from logs available on your log backend." – Kakeru Ishii [04:20]

Features and Functionality

KHI offers several key features that simplify the troubleshooting process:

Timeline Visualization: Creates a detailed timeline of events from logs, making it easier to trace issues.
Resource Relationship Diagrams: Illustrates relationships between various Kubernetes resources based on log data.
No Installation Required: Operates as a Docker container, allowing users to quickly deploy without modifying their clusters.
Extensible Architecture: Utilizes a Directed Acyclic Graph (DAG) based log parser system, enabling easy extension and customization for different log types.

Notable Quote:

"KHI is basically based on the directed acyclic graph. It's DAG based log parser system. So that makes KHI to be extensible." – Kakeru Ishii [12:30]

Use Cases and Benefits

One prominent use case discussed is troubleshooting intermittent authentication errors related to Workload Identity in Google Kubernetes Engine (GKE). KHI enabled Kakeru to correlate logs from multiple sources—customer pods, ContainerD, third-party security tools, and system workloads—thereby identifying that a third-party security product restarting ContainerD was the root cause.

Notable Quote:

"This kind of difficult problem involving multiple components on the cluster can be solved with Ketchi easily." – Kakeru Ishii [09:15]

Technical Insights

The hosts explore the technical underpinnings of KHI:

WebGL Interface: KHI leverages WebGL for rendering its graphical interface, providing a rich and responsive visualization experience.
In-Memory Processing: Designed for rapid investigation, KHI processes logs in memory to deliver quick insights without the need for persistent storage.
Parser Dependencies: Implements a sequence of parsers based on their dependencies to accurately correlate related log entries.

Notable Quote:

"For me, the usual web development is harder than the WebGL for me because I needed to learn AngularJS to build this application." – Kakeru Ishii [14:25]

Open Source and Community Engagement

KHI is available on GitHub under the Google Cloud organization, encouraging community contributions and star ratings. Kakeru emphasizes the tool's readiness for extension, hinting at future documentation to guide users in customizing log parsers to fit their specific needs.

Notable Quote:

"We'll leave the link in the show notes. Go check it out. Go try it out. Give it a star on GitHub." – Abdel Sghiouar [17:04]

Future Developments and AI Integration

The discussion touches upon the potential integration of Large Language Models (LLMs) with KHI. While Kakeru acknowledges the possibilities, he underscores the continued importance of visualization in understanding complex issues, suggesting that AI can complement but not entirely replace visual tools.

Notable Quote:

"The visualization is still important for the LLM." – Kakeru Ishii [16:36]

Conclusion and Key Takeaways

KHI emerges as a powerful tool for Kubernetes administrators and support engineers, simplifying the complex task of log analysis through intuitive visualizations. Its open-source nature invites community collaboration, promising continual enhancements and broader applicability.

Key Takeaways:

Simplified Troubleshooting: KHI reduces the complexity of diagnosing Kubernetes issues by visualizing log data.
Collaborative Utility: Facilitates better collaboration among support teams through shared visual insights.
Extensibility: The DAG-based parser system allows users to tailor KHI to their unique logging environments.
Open Source Advantage: Availability on GitHub encourages contributions and widespread adoption.

Notable Quote:

"That's why this visualization is still important." – Kakeru Ishii [16:37]

For more information about the Kubernetes History Inspector, visit the GitHub repository and consider contributing or starring the project to support its development.

Loading summary

Transcript63 lines

[00:00]
Abdel Sighiwar
Hello and welcome to the Kubernetes podcast from Google. I'm your host Abdel Sighiwar. This week I am alone, Kathleen is on vacation and no one else was available so I have to do this by myself. This week I spoke to Kakeru Issue Kakeru is the initiator of the Kubernetes History Inspector, or khi, an open source tool that allows you to visualize Kubernetes logs and troubleshoot issues. We discussed what the tool does, how it was built and what was the motivation behind open sourcing it. But let's get to the news. The schedule for Kubecon and Cloud Native Con 2025 maintainer summit is live. The event in its new format takes place on March 31st at the Excel London. The CNCF published their 2024 review of the top 30 projects. The ranking measures the projects by number of contributions and surprisingly the podium is taken by Kubernetes, followed very closely by OpenTelemetry. Then Argo, backstage and Prometheus are all in the top five. The CNCF is looking for an end user study to highlight during the keynote of Kubecon and Cloud Native Con London this year. If you have an interesting case and you want to get an opportunity to speak about it for five minutes, fill in the form in the show notes. Applications are Open until Friday, March 7, 2020 Google AWS and Azure announced CRO or Kubernetes Resource Orchestrator. CRO is a Kubernetes native cloud agnostic framework that allows platform teams to define groupings of resources that users can consume as standard Kubernetes APIs. Check out the announcement blog and GitHub links in the show notes. AWS announced the general availability of AKS hybrid nodes. The feature was announced at re invent 2024. It allows users to connect on PREM and Edge nodes to a managed EKS cluster on aws. The company says this feature could help with modernization and migration of existing applications. CoreWeave announced the availability of Nvidia GB200 NLV72 instances on their platform. With this announcement, CoreWeave becomes the first cloud provider to make the Nvidia Blackwell platform generally available. And that's the news.
[02:25]
Unknown Host
Hello everyone, we are talking to Kakeru today. Kakero is the initiator of the Kubernetes History Inspector, a new open source project released under the Google Cloud GitHub organization. This project help visualize logs and it's already helped the support team at Google troubleshoot GKE problems through logs. Kakero Made it using his experience from working on the support team, obviously. Welcome to the show, Kakero.
[02:52]
Kakeru
Hello. Thank you for inviting me to this podcast, Abdel. I'm really excited to be here.
[02:58]
Unknown Host
I just have to say. You're based in Japan, right?
[03:00]
Kakeru
Yes, I'm based in Tokyo.
[03:02]
Unknown Host
And what time is it for you right now?
[03:04]
Kakeru
It's 5:00pm all right. So I know you are in early morning, right?
[03:08]
Unknown Host
It's 9am for me it's not too bad.
[03:10]
Kakeru
I'm sorry to wake you up earlier.
[03:12]
Unknown Host
It's fine. It's 9am I am actually. It's funny I live in Sweden, but I am in the very far north part of Sweden. I'm in a ski resort very, very far up north. So wherever I look, there is just snow everywhere right now.
[03:24]
Kakeru
Nice. In Tokyo it's rarely see the snows and I like snows.
[03:28]
Unknown Host
Oh yeah, I seen. I have a friend who went to Sapporo and I think it snows in Sapporo, right?
[03:32]
Kakeru
Yeah, Sapporo is really cold place and it's good place for the skiing maybe. But here in Tokyo it's relatively warmer than Sapporo.
[03:41]
Unknown Host
Got it, got it.
[03:42]
Kakeru
Really see snow recently. Got it.
[03:45]
Unknown Host
All right, so. All right, let's talk about this tool that you guys have open sourced recently, the Kubernetes History Inspector. Can you tell us what's that?
[03:52]
Kakeru
Yeah, sure. It's a leech log visualizer designed for troubleshooting Kubernetes issues. Before this tool existence, we needed to troubleshoot the pod troubleshoot or something by checking the content log. But if the problem can't be solved content log alone, they needed to rely on various kind of logs like Kubelet log container log Container D log Kube API server log Kube Container Manager log. There are so many various logs needed to be used in troubleshooting. So it require high expertise to craft the log filter or something to gather these logs and investigate them because it just generate least tons of the logs. And we needed to understand what's happened in the past around the port just from the lines of logs. It was so hard for us. So then this tool will provide us detailed timeline visualization or resource relationship diagram just from logs available on your log backend. Currently this is only supporting cloudlogging, but we are expanding this support to the other clusters, especially for the open source Kubernetes class.
[05:10]
Unknown Host
Nice. And so one important detail actually that I want to talk about is that it says history in the name, which means that the tool actually allows you to go back in Time.
[05:21]
Kakeru
Yes. So when I say that inspector or something related to Kubernetes, maybe users think this is a kind of agentful tool like Promises or something. But actually this is just a log visualizer. It's visualize the history of the cluster resources just from logs by parsing all of them.
[05:41]
Unknown Host
Yeah, we're going to talk about it, but it's important to understand you don't need to install anything in your cluster. I tried this yesterday. I just fired up the cloud shell in the console and started the docker container. And as long as it has permissions to pull the logs, it will just pull them, right?
[05:56]
Kakeru
Yeah.
[05:57]
Unknown Host
And give you this rich visualization as you said. Where did the idea of open sourcing the tool came from?
[06:03]
Kakeru
Well, I am working as a support engineer in Google and I needed to troubleshoot customer cluster issues when I got the ticket. But the problem is maybe that would be my first day to see the customer cluster or even the customer said my port was dead yesterday or something. But when I touch the cluster, maybe the cluster is running healthy without any issue. But the troubleshooting the current ongoing incident is easier than troubleshooting past issue because I can interact with cluster if customer allow it. However, the troubleshooting past issue require me to see through many kind of logs and it takes a really long time. So I wanted to understand the macroscopic view of the cluster. So that's why I needed to create it. And after I created the prototype, I showed this to my colleague or other support chain members and it gained popularity among my support team or many other teams in Google. And I decided to make this available to the other customers on the Internet.
[07:21]
Unknown Host
Nice. Nice.
[07:22]
Abdel Sighiwar
Yeah.
[07:22]
Unknown Host
So the tool is obviously on GitHub, so everybody who's listening to us, you should check it out and maybe give it a little star if you want. It's very easy to start. It's just a docker container. So I did a little bit of troubleshooting and support back in my days before my current role. And I remember yes, like troubleshooting past problems is difficult, but also troubleshooting problems that doesn't happen very often. Right. When you have like a transient issue. So how was this like tool helps like support engineers particularly like how does that help them in troubleshooting these kind of problems?
[07:56]
Kakeru
Well, this is a very important for the support chain because declaring the logs is also declining in the skills. Like at first we needed to understand the explicit time or the incident happens. But once this tool was used in support chain, after querying the logs by one Support agent. The support agent can show the visualized timeline with the other support engineer handed over. So they will continue troubleshooting with detailed understanding of the cluster.
[08:31]
Unknown Host
I see, I see. And so can you talk about some issues that the tool have helped you actually solve that Kubernetes history Inspector have helped solve?
[08:39]
Kakeru
Well, let me introduce a little difficult complex ticket before. All right, so this is about the GKE with workload identity, which is a feature to get the access token to access the GCP API from the port with checking the permissions granted for the Kubernetes service account. So my customer told me to troubleshoot the intermittent error over a workload identity. I mean, customers say my port got the authentication error intermittently, but it only happens very few times in the month or something. And I realized that is caused by third party security product restarting container D because workload identity needed to communicate with containerd to verify the port is actually running on the node to return the access token. But to troubleshooting this kind of issue, I needed to check customer port log, container D log, third party security product port log, and also I needed to check the workload identity system workload logs. So the crafting the log query with checking all the logs would be a little hard. And even if I could get the list of the lines of the logs, it won't make sense. But this tool will help me to show how these logs happen. And I can see the kind of line of the dots of the logs happening at the same time over the visualization. So I could easily see, oh, this is caused by container DD starting or something triggered by third party security product. So this kind of difficult problem involving multiple components on the cluster can be solved with Ketchi easily.
[10:41]
Unknown Host
Yeah, and so I think what's important in what this particular use case you're talking about is the correlation part. Like how can you correlate events happening in multiple parts of the system using logs and understand that those kind of events are related to each other. Like you are talking about like workload identity, which has its own pod. Then you have the customer pod, then you have container D and then you have the pod of this third party security tool. And you have to like pull the logs across all these things and kind of like try to understand how they line up with each other kind of. Right. So I was using the tool yesterday, as I said, I like, I fired up the docker container, I launched the interface. It has a graphical interface, a rather interesting graphical interface because it's built in WebGL. I haven't heard of WebGL for a very long time, but we're going to talk about it. But how does that correlation work? Like is it the tool that will correlate these things together or is it you that you have to like try to find that correlation?
[11:35]
Kakeru
Well, is it meaning users needed to customize correlation settings or something?
[11:42]
Unknown Host
Or does the tool find like those events and then says okay, these events are related in time? Like how does that work?
[11:49]
Kakeru
So basically KHI has various parsers implemented already on its code base. So it passed the structured log fields and it decides which resources is related to the logs or something. But the log parsers are not so such simple. Like some log parser needed to be run before the other log parser or something. For example, containerd log parser shows a container behavior with container id, but it won't show any port name or something. Yeah, but Kubelet also show container ID and port name. So I can correlate the container ID with a port ID because I'm running the container ID parser after getting the port name from the other logs. But this is a little hard to, you know, because many parsers depend on the other parsers. So KHI is basically based on the directed acyclic graph. It's DAG based log parser system. So that makes KHI to be extensible. Currently we are not publishing any document how to extend KGI log parser but I will do that later and that helps customer to extend their log parser to support their own custom control or something.
[13:17]
Unknown Host
I see.
[13:18]
Kakeru
And this extensibility makes KHI to support many kind of dialogues.
[13:24]
Unknown Host
Okay, cool.
[13:24]
Abdel Sighiwar
Yeah.
[13:25]
Unknown Host
So then I want you to talk to us quickly about how it actually works behind the scene. So I tried it. So it's essentially you start the docker container, you select, well in our case you select like a project id, then you select which cluster you want. Then there is like a little thing happening on the interface that says I'm running, I'm pulling some logs and then you get an interface. But how does it actually work behind the scene?
[13:48]
Kakeru
Well, each parser has a dependency of the form, like okay, this parser needed to have the project ID before querying or something like that. This dependency is defined on each parser as a DAG based graph. So the cache is just running this graph based task runner and generate one single log bin file and then that would be parsed from the front end and showed that a diagram.
[14:17]
Unknown Host
I see. And as I said, it uses WebGL for the rendering of the interface. Right.
[14:23]
Kakeru
Yeah.
[14:23]
Unknown Host
Did you learn WebGL to build this.
[14:25]
Kakeru
Or did you work with WebGL before joining Google? Actually I joined Google at New Grad, but my hobby was doing some open source work especially for the WebGL framework.
[14:39]
Unknown Host
Okay.
[14:40]
Kakeru
So I made a WebGL framework before joining Google. So that's why I had experience with WebGL.
[14:45]
Unknown Host
I see.
[14:46]
Kakeru
So for me the usual web development is harder than the WebGL for me because I needed to learn AngularJS to build this application. But for the WebGL side, I know really basic and I could realize this performance visualization with existing my knowledge.
[15:08]
Unknown Host
I see, I see. All right. And so I think I have to ask this question because we are in 2025 and AI is all around us these days. So before I go there, the tool is pretty much in memory only, right? So when you fire up the container, all the logs are in memory, right?
[15:24]
Kakeru
Yeah, yeah. Actually that is intended to be in memory. So if the tool was for the storage, I think that should be done in backend. But I build the application for investigate quickly. So that's why I wanted to take all the logs on the memory on front end side. Yeah, that design is intentionally and yeah, that works on the memory.
[15:52]
Unknown Host
Nice. And so do you see a future in which an LLM could be integrated into KHI and help troubleshoot issues?
[16:02]
Kakeru
Well, maybe I can consider about it. But the important role of the KHI doing taking part in AI era would be like even LLM said. Okay, this problem happened because this configuration issue or something, and this is intermittent issue happened by Tollyguard by this board or something. Maybe the people couldn't be convinced so much. So they wanted to understand why that happens with visualization not just by text.
[16:37]
Unknown Host
Yeah.
[16:37]
Kakeru
So that's why this visualization is still important for the LLM.
[16:43]
Unknown Host
I see, I see. Nice.
[16:45]
Kakeru
Cool.
[16:45]
Abdel Sighiwar
Well, this is actually pretty cool.
[16:47]
Unknown Host
So I highly recommend people to go check it out. It's on GitHub. We'll leave the link in the show notes. Go check it out. Go try it out. Give it a star on GitHub. If there is any features missing, either implement it or open an issue, I guess. Right. Kakeru is on Twitter. But your Twitter is mostly in Japanese.
[17:05]
Kakeru
Yeah. So you can follow me. But maybe it's only limited for. No, no. It will be a little hard for the non Japanese speakers.
[17:16]
Unknown Host
Yeah. So that's fine. Maybe then can talk to you on GitHub.
[17:19]
Kakeru
Yeah, yeah.
[17:20]
Unknown Host
Well, thank you for joining us on the show, Kakuru.
[17:22]
Kakeru
Thank you.
[17:25]
Abdel Sighiwar
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media ubernitespod or reach us by email at kubernetespodcastgoogle.com you can also check out the website kubernetes podcast.com where you will find transcripts and show notes and links. To subscribe. Please consider rating us in your podcast player so we can help people find and enjoy the show. Thank you for listening and we'll see you next time.