Skip to main content

 

Splunk Lantern

Using AI for observability troubleshooting

Splunk delivers a unified observability experience. The Splunk platform is the core, a unified data platform for analyzing all data and metrics. Next, Splunk AppDynamics optimizes monitoring hybrid and three-tier environments, while Splunk Observability Cloud optimizes monitoring cloud-native environments. Finally, Splunk ITSI sits on top of everything and provides business service monitoring and AI Ops.

Even with this wide range of capabilities, there is room for improvement. AI assistants embedded in the software accelerate incident response in the following ways:

This article shows how these AI features help with three key use cases.

How to use Splunk software for these use cases

Use case 1: Eliminate siloed visibility

Scenario: You are an IT analyst who owns the monitoring and operations of tier-1 business critical services and the underlying apps and infrastructure that impacts these. When payments fail, customers are dissatisfied and the business can lose up to $100,000 per minute. You want to reduce the time it takes to find a problem, isolate its domain, and identify root cause.

Solution: Using Splunk ITSI (ITSI) powered by AI and its integrations with Splunk AppDynamics and Splunk Observability Cloud, you can detect and isolate problems faster, and accelerate root cause analysis.

As an administrator, you can set machine learning-assisted KPI thresholds. These give recommendations on healthy thresholds and saves them on a daily basis, dynamically adjusting to the pattern of the KPI without manual intervention from the users.

AI Thresholding.png

As an analyst, you can troubleshoot issues faster using the AI Assistant. In the Splunk Application Performance Monitoring service map, if a service shows as unhealthy, you can ask the AI Assistant for an explanation of the problem. It describes key issues, such as latency, errors, or dependencies so you don't have to hunt them down, improving your time to resolve. It also provides a recommendation on what you could do and even links you to other potential issues you might want to investigate.

AI Assistant.png

Use case 2: Signals from noise

Scenario: You are an IT or NOC operator who owns the centralized event management operations for alerts coming from across the entire IT estate and disparate tools. It costs $13 per incident to triage and $47 per incident to resolve it. You want to reduce the number of actionable incidents and the time it takes to triage those.

Solution: Using ITSI and its AIOps capabilities, you are able to correlate alerts quickly for reduced actionable incidents for a closed loop incident management.

Set up Event iQ to create event correlation policies based on an analysis of historical event data. After you identify the attributes you want to use in grouping policies and their relative importance, Event iQ uses machine learning algorithms to compare field values and correlate notable events into episodes. This eliminates manual rule configurations.

Then, during an investigation, the Common Fields tab in Episode Review groups events based on the fields you defined. This helps analysts differentiate real signals from noise quickly to reduce the actionable incidents. In the example below the system has been configured to group by specific alert values and by role in the service topology. Near the top of the screen, you can select Event iQ to see the configuration and make changes, if necessary. 

EventIQ.png

Use case 3: Finding root cause

Scenario: You are an application owner who owns the development and operation of services supporting a retail web store. When customers can't search and purchase products, your organization loses one million dollars per hour. You want to decrease the time it takes to find and solve issues.

Solution: Episode summarization in ITSI helps you quickly pinpoint the right problem domain. By using the AI troubleshooting agent in Splunk Observability Cloud, you can detect and identify root causes more efficiently.

The AI troubleshooting agent provides context-aware, AI-driven troubleshooting directly from incidents, helping to reduce mean time to resolve. It summarizes incidents by gathering context from alerts and logs, then combining and correlating this data to deliver actionable insights. This feature is available for applications hosted on Kubernetes infrastructure.

ITSI Episode summarization is currently only available in beta.

During an investigation, the Impact tab in Episode Review groups provides an AI-generated analysis of an episode that includes the suspected root cause, as shown in the following screenshot. It also describes impact trends and provides quick links so you can drill down into each for more detail. Finally, it describes what aspects of a system were checked, including notable events, KPIs, and elements of the service topology.

AI Generated Analysis.png

Next steps

Now that you have an idea of how Splunk Observability Cloud's AI capabilities can help you troubleshoot your applications, watch the full .conf25 Talk, Troubleshooting made easy with AI for Observability. In the talk, you'll see demos of each of these use cases in action to get a better idea of how you might deploy these AI features in your environment. The presenters also discuss future AI-native experiences with Splunk Observability Cloud that will help with root cause analysis, detection, investigation, and remediation.

In addition, you might find these Splunk resources helpful:

  • Written by Tapan Shah and Annu Kath
  • Software Observability Product Experts at Splunk