Skip to main content
 
 
 
Splunk Lantern

Detecting and resolving issues in a Kubernetes environment

 

Implement a scalable observability solution that provides an overview of Kubernetes architecture, highlighting real-time issues, and allowing you to act fast and mitigate impact.

Let’s dig into a scalable observability solution that provides an easy-to-digest overview of our Kubernetes (K8s) architecture, highlighting real-time issues, allowing us to act fast and mitigate impact.

Before diving into this article, you might want to check out this blog post on common problems you could encounter in a Kubernetes environment, their impacts, and the importance of fast resolution.

Kubernetes navigators in Splunk Infrastructure Monitoring

To resolve performance issues faster and increase the reliability of Kubernetes environments, Kubernetes navigators in Splunk Infrastructure Monitoring provide real-time insight into your K8s architecture and performance. With the Kubernetes navigators you can detect, triage, and resolve cluster issues quickly and easily.

You can enter the Kubernetes navigators from related content links throughout Splunk Observability Cloud – for example, if you’re diagnosing an issue in Splunk Application Performance Monitoring, and you want to explore your cluster health to see if it’s causing the problem. But for the purposes of this article, we'll go directly into the Kubernetes navigators.

From Splunk Infrastructure Monitoring, under Containers > Kubernetes, you can see the two navigators: one for Kubernetes nodes and one for Kubernetes workloads. The Kubernetes workloads navigator provides insight into workloads or applications running on K8s. The Kubernetes nodes navigator provides an overview of the performance of clusters, nodes, pods, and containers. Since our current focus is on cluster health, we’ll look at the Kubernetes nodes navigator.

1-K8s Nav start.png

In the Kubernetes nodes navigator, we see an overview of our clusters and their statuses and node dependencies. If we scroll down, we’ll see out-of-the-box charts that provide fast insight into common issues like resource pressure and node status.

2-K8s node start.png

Using the K8s analyzer

Click the K8s analyzer tab. Here we’ll see an overview of common problems – nodes with memory pressure, high CPU, containers that are restarting too frequently, and abnormal pod and node statuses. As with any navigator, we can filter this data to examine specific clusters and scope our overview.

Screenshot 2024-07-24 at 8.43.38 PM.png

We can dig into a specific node in the cluster by clicking on it in the heat map view of the cluster, by applying a filter, or by selecting a namespace of interest from one of the analyzer tables. While we’re viewing the specific node, we get insight into the same helpful health info, and we can quickly diagnose node status, resource pressure, pod status, and container health.

By hovering over one of the nodes in this scoped cluster, you can see that it looks like it’s experiencing some memory pressure.

Screenshot 2024-07-24 at 8.45.11 PM.png

When we click into it, we can see this node is Not Ready:

Screenshot 2024-07-24 at 8.46.20 PM.png

Scrolling down through the metrics for this node, it looks like there’s one pod consuming significantly more memory than the other. If we click on that chart, we’ll see this data view.

Screenshot 2024-07-24 at 8.53.00 PM.png

When the node condition chart is outlined in red, we know that it’s already linked to a detector and has an active alert firing. Our team was probably already paged for this alert before our casual exploration even started, meaning resolution is likely already underway. If we select the alert at the top right of the screen, we can view the open alert and expand its details to explore it further. We can also jump into Splunk Application Performance Monitoring from here and see the effect this infrastructure issue is having on our app.

AutoDetect detectors

You might have noticed this alert is tagged with "Autodetect". This type of alert is a Splunk AutoDetect detector. AutoDetect detectors are automatically created in Splunk Observability Cloud to quickly discover common and high-impact anomalies in your Kubernetes infrastructure. No manual creation of custom detectors is required, although you can do that if you want to.

Kubernetes AutoDetect detectors out-of-the-box include:

  • Cluster DaemonSet ready versus scheduled
  • Cluster Deployment is not at spec
  • Container Restart Count > 0
  • Node memory utilization is high
  • Nodes are not ready

These AutoDetect detectors alert on all common Kubernetes problems discussed here. So rather than constantly running commands or going into your Kubernetes dashboard, you’ll proactively get alerted to issues with your cluster for faster diagnosis and resolution.

Using this process you can find the problem quickly in one of your K8s nodes and allocate some additional memory so that your node and application can recover to a healthy state. But you could also dig into pods and containers either by clicking into them or applying a filter. From the pod and container views you can see information such as resource usage per pod and number of active containers.

Screenshot 2024-07-24 at 8.51.21 PM.png

By exploring the pods and containers in these views, you can compare real-time usage to limits, uncover where limits aren’t set, and ensure resources are properly allocated to prevent possible pressure.