Speeding up root cause analysis with artificial intelligence

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Most observability practices suffer from a set of common challenges. These are:

The more stacks and data you have, the less clarity there is in where an issue stems from.
You have too many tools and too little insight from those tools.
The amount of data coming in surpasses the ability of humans to analyze it.

Fortunately, artificial intelligence now alleviates some of these problems. The typical incident work workflow in observability looks like this:

Incident occurs
Detect and prioritize: Is there an issue? Is it important?
Triage and investigate: Where did the incident come from? What should you look at?
Respond: Who do you call?
Remediate: How do you fix it?
Postmortem: What did you learn? How can you improve?

Splunk AppDynamics currently includes AI and machine learning capabilities that help you through steps 3-6, providing the following benefits:

Faster investigations
- one-click RCA
- more context on your incidents
- guided troubleshooting
Insight-driven decision making
- Surface and summarize suspected root causes for health rule violations
Lower learning curve
- Remove knowledge barriers
- Democratize observability

How to use Splunk software for this use case

This article presents two use cases that show how you can benefit from AI in Splunk software to get to the root cause of an issue quickly. Both of these start on the Application Dashboard. On the right side, under the Metrics tab, Events are split into Health Rule Violations Started and Anomalies Started. This is where you'll start your root cause analysis with AI.

Application dashboard.png

How to use Splunk software for this use case

AI troubleshooting agent for health rule violations

A health rule is violated when the health rule processor detects that the health rule's critical or warning condition is true. Traditionally, you would take an average of seven clicks to dig into a health rule violation, clicking into various transactions, filtering, sorting, and reviewing the information before you could determine the root cause. The AI analysis brings that down to one click on the violation in the Application Dashboard.

The analysis screen, as in the sample shown below, provides:

a summary
the exact metrics in violation
suspected root cause

Health Rule Violation Summarization.png

The root cause is really what we're focused on here and, as stated, you learned what it was with one click. From there, you can resolve the issue. The suspected root cause shows snapshots. You can click into any one of those to see individual transactions that experienced the problem. Then, you can click down further into a potential issue to see the Call Graph, which lists the methods in a call stack and provides information about each call.

Note that beneath the title of the violation screen, you can click the name of the health rule to edit it directly. You might find you need to adjust connectors, conditions, expressions, and more.

Edit Health Rule.png

Also note that you don't have to change your existing workflows to take advantage of this. If you use health widgets in a dashboard, you can scroll down to Widget as List and click on any violation to get to the summarization screen.

Health Rule Widget.png

AI troubleshooting agent for anomaly detection

As with health rule violations, AI in anomaly detection takes away much of the clicking, sorting and analysis, getting you to root case faster. When you select an anomaly from the Application Dashboard, a screen, like the sample below, opens with the following information.

Concise AI summary: Receive an AI-generated summary of the issue, with specific times and dates.
Top deviating metrics: See the exact metrics that created the anomaly, rather than needing to hunt for them yourself.
Suspected root causes: Review causes ranked in order of probability and paired with a flow diagram that shows where the problem was and where it went.

Anomaly detection.png

When you select a root cause, you get similar AI-generated information - a summary, metrics, recommendations, and snapshots - but specifically for that issue.

Suspected root cause.png

You can activate this feature with no configuration. Toggle it on at the top left of the Anomaly Detection screen. If you do want to configure severity, sensitivity, or other parameters, you can open the settings from the top left of the screen. Note that this feature needs about 24 hours to train on the data.

Toggle On.png

Next steps

Now that you have an idea of how Splunk AppDynamics AI can help you conduct root cause analysis more quickly when events occur, watch the full .Conf25 Talk, AI-driven observability. In the talk, you'll learn more about the troubleshooting workflow, the specifics AI features in Splunk AppDynamics and Splunk Observability Cloud, and how anomaly detection and health rules work.

In addition, you might find these Splunk resources helpful:

Splunk Help: Dynamic baselines
Splunk Help: Drill into an anomaly
Splunk Help: Drill into a suspected cause
Splunk Help: What is root cause analysis?
Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their Success Plan. Engage the ODS team at ondemand@cisco.com if you would like assistance.