Knowing proper adaptive threshold configurations

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Adaptive thresholding is powerful and effective when configured properly, but confusing, noisy, and ineffective if not. You need to learn to use adaptive thresholding correctly to take advantage of it.

This article is part of the Definitive Guide to Best Practices for IT Service Intelligence. ITSI administrators and end users will benefit from adopting this practice as they work on Service Insights.

Solution

First, let's review the components of adaptive thresholds:

Training Window. How much historical data to use when training
Time Policies. Windows of time where the KPI is expected to behave differently
Outlier Exclusion. Outliers to be removed in order to clean up the training dataset
Algorithm. The machine learning math used to compute threshold values
Algorithm Parameters. User specified sensitivity of the algorithm to determine normal, high, and critical values

Now that you understand the basics, you can move on to best practices for applying adaptive thresholds.

Be cautious when using pre-configured threshold templates

There are many out-of-the-box templates available, but you cannot select a random one and hope it works. Invest some time in your data to select appropriate ones. For more information, see Building your own custom threshold templates.

Resist the urge to use a different time policy for every hour of the day

You likely do not need 168 different policies for a week. Configure as few as possible, but as many as necessary to encapsulate expected behavior changes in a KPI. In the following sample chart, the same policy has been applied on weekdays from 8 AM to 10 AM because the KPI behaves similarly across each of these windows of time.

If your KPIs contain outlier data, use it

Most KPIs will have some outlier data, for example, response time commonly includes outliers. Getting rid of the outliers tightens your threshold ranges and makes them more effective. If you aren't sure what algorithm to use to determine outliers, a good place to start is to use the standard deviation with three sigmas (σ) of sensitivity.

Percent of baseline is the preferred adaptive threshold algorithm

Among the options, percent of baseline is the best. Standard deviation is generally the second best option. The configuration for these looks like the following:

Algorithm	Critical Severity	High Severity	Medium Severity (Optional)	Base Severity
Percentile	~200%	~150%	~125%	Normal
Standard Deviation	~3.0σ	~2.5σ	~2.0σ	Normal

New UI improvements for tuning adaptive thresholds

Time range window. You can customize this window to see more of the historical behavior than only the last week.

Full granularity. Zoom in and out of the KPI results. The timechart binning has been increased to as little as one minute, instead of the previous 30 minute buckets.

Advanced display options. While zoomed out of the results, you can see maximum and minimum values in the data set, as well as the 75th and 90th percentiles.

You can also specify the y-axis boundaries. This can be used to zoom into specific ranges that might be difficult to see when the default Y axis min and max values are too big.

Compare current with historical threshold configurations. As you tune the thresholds, the panel outlined in the screenshot below is where you can see how effective your changes will be and whether your new configurations make sense. Use the percent of critical, high, and normal results compared to the historical data to decide.

Assisted adaptive threshold configurations

This is a new feature, powered by Splunk AI, that you can experiment with to decide whether it is helpful for you.

Pros

Provides automated adaptive threshold configuration recommendations
Drives upcoming “bulk” threshold tuning workflow

Cons

Will not always produce correct results
Might produce more complex configurations which are harder to tune

Next steps

This content comes from Splunk .Conf presentation, The Definitive List of Best Practices for Splunk® IT Service Intelligence: How to Configure, Administer, and Use ITSI for Optimal Results, part one presented in .Conf23 and part two presented in .Conf24 session. In the session replays, you can watch Jason Riley and Jeff Wiedemann share the many awesome best practices they've amassed for designing key performance indicators (KPIs), services, episodes, and machine learning to maximize end-user experience and insights. Whether you're new or experienced, you'll come away with tactical guidance you can use right away.

You might also be interested in the following Splunk resources:

Splunk Docs: Service insights manual
Splunk Docs: Create adaptive KPI thresholds in ITSI
.Conf Talk: Adaptive thresholding...demystified