Using AI for observability troubleshooting

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Splunk delivers a unified observability experience. The Splunk platform is the core, a unified data platform for analyzing all data and metrics. Next, Splunk AppDynamics optimizes monitoring hybrid and three-tier environments, while Splunk Observability Cloud optimizes monitoring cloud-native environments. Finally, Splunk ITSI sits on top of everything and provides business service monitoring and AI Ops.

Even with this wide range of capabilities, there is room for improvement. AI assistants embedded in the software accelerate incident response in the following ways:

Faster, more accurate detection
Intelligent investigation
- Alert correlation
- Event prioritization
- AI-guided root cause analysis
- Trend discovery
- AI/LLM monitoring (On-premises or SaaS)
Automation and remediation
- Alert actions
- Similar episodes
- Suggested responders

This article shows how these AI features help with three key use cases.

How to use Splunk software for these use cases

Use case 1: Eliminate siloed visibility

Scenario: You are an IT analyst who owns the monitoring and operations of tier-1 business critical services and the underlying apps and infrastructure that impacts these. When payments fail, customers are dissatisfied and the business can lose up to $100,000 per minute. You want to reduce the time it takes to find a problem, isolate its domain, and identify root cause.

Solution: Using Splunk ITSI (ITSI) powered by AI and its integrations with Splunk AppDynamics and Splunk Observability Cloud, you can detect and isolate problems faster, and accelerate root cause analysis.

As an administrator, you can set machine learning-assisted KPI thresholds. These give recommendations on healthy thresholds and saves them on a daily basis, dynamically adjusting to the pattern of the KPI without manual intervention from the users.

AI Thresholding.png

As an analyst, you can troubleshoot issues faster using the AI Assistant. In the Splunk Application Performance Monitoring service map, if a service shows as unhealthy, you can ask the AI Assistant for an explanation of the problem. It describes key issues, such as latency, errors, or dependencies so you don't have to hunt them down, improving your time to resolve. It also provides a recommendation on what you could do and even links you to other potential issues you might want to investigate.

AI Assistant.png

Use case 2: Signals from noise

Scenario: You are an IT or NOC operator who owns the centralized event management operations for alerts coming from across the entire IT estate and disparate tools. It costs $13 per incident to triage and $47 per incident to resolve it. You want to reduce the number of actionable incidents and the time it takes to triage those.

Solution: Using ITSI and its AIOps capabilities, you are able to correlate alerts quickly for reduced actionable incidents for a closed loop incident management.

Set up Event iQ to create event correlation policies based on an analysis of historical event data. After you identify the attributes you want to use in grouping policies and their relative importance, Event iQ uses machine learning algorithms to compare field values and correlate notable events into episodes. This eliminates manual rule configurations.

Then, during an investigation, the Common Fields tab in Episode Review groups events based on the fields you defined. This helps analysts differentiate real signals from noise quickly to reduce the actionable incidents. In the example below the system has been configured to group by specific alert values and by role in the service topology. Near the top of the screen, you can select Event iQ to see the configuration and make changes, if necessary.

Use case 3: Finding root cause

Scenario: You are an application owner who owns the development and operation of services supporting a retail web store. When customers can't search and purchase products, your organization loses one million dollars per hour. You want to decrease the time it takes to find and solve issues.

Solution: Episode summarization in ITSI helps you quickly pinpoint the right problem domain. By using the AI troubleshooting agent in Splunk Observability Cloud, you can detect and identify root causes more efficiently.

The AI troubleshooting agent provides context-aware, AI-driven troubleshooting directly from incidents, helping to reduce mean time to resolve. It summarizes incidents by gathering context from alerts and logs, then combining and correlating this data to deliver actionable insights. This feature is available for applications hosted on Kubernetes infrastructure.

ITSI Episode summarization is currently only available in beta.

During an investigation, the Impact tab in Episode Review groups provides an AI-generated analysis of an episode that includes the suspected root cause, as shown in the following screenshot. It also describes impact trends and provides quick links so you can drill down into each for more detail. Finally, it describes what aspects of a system were checked, including notable events, KPIs, and elements of the service topology.

AI Generated Analysis.png

Next steps

Now that you have an idea of how Splunk Observability Cloud's AI capabilities can help you troubleshoot your applications, watch the full .conf25 Talk, Troubleshooting made easy with AI for Observability. In the talk, you'll see demos of each of these use cases in action to get a better idea of how you might deploy these AI features in your environment. The presenters also discuss future AI-native experiences with Splunk Observability Cloud that will help with root cause analysis, detection, investigation, and remediation.

In addition, you might find these Splunk resources helpful:

Splunk Blog: Introducing Event iQ: Smarter event correlation in Splunk ITSI
Splunk Lantern Article: The definitive guide to best practices for IT Service Intelligence
Splunk Resource: Cisco supercharges observability with Agentic AI for real-time business insights
Splunk Resource: Splunk Observability Product Tours
Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their Success Plan. Engage the ODS team at ondemand@cisco.com if you would like assistance.