Using the correct KPI statistical functions for alerting

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

When configuring key performance indicators (KPIs), particularly those split by entity, you need to select an entity calculation and a service/aggregate calculation. If you blindly accept the defaults or don't understand how their functions work, you will have problems selecting the ones that will create accurate alerts. You want to learn more so you can set up the correct KPIs.

This article is part of the Definitive Guide to Best Practices for IT Service Intelligence. ITSI end users will benefit from adopting this practice as they work on Service Insights.

Solution

To set up a KPI:

From the ITSI main menu, click Configuration > Services.
Select an existing service.
Go to the KPIs tab.
Click New.

In the calculation section of this window, you will select the entity and service/aggregate calculations.

For more comprehensive configuration instructions, see Configure KPI monitoring calculations in ITSI.

There are quite a few calculation options to choose from for each, but the following are recommended.

Entity calculation. When the KPI runs, many search results are aggregated into a single alert value for each entity based on this calculation type. The most useful options, along with when you should consider using them, are:

Latest. Choose this when on only the most recent value matters, for example, when you want to know:
- Process up / host up
- Disk free
- Queue depth
Count. Choose this when measuring count of raw events, for example, when you want to know:
- Number of logins
- Count of errors
Average. This is good for most other situations. For example:
- Response time
- CPU Utilization

Service/Aggregate Calculation. This is the statistical operation that ITSI performs on the alert values for each entity to produce a single aggregate KPI alert value. The most appropriate aggregate calculation to use depends on the type of KPI you are building. The most useful options, along with when you should consider using them, are:

Min/Max. This turns the KPI display red when one entity is unhealthy.
Perc90/Perc10. This turns the KPI display red when several entities are unhealthy.
Average. This turns the KPI display red when most entities are unhealthy.

The following table shows how combinations of these two recommended calculations can yield the best results for alerting to problems.

Entity Calculation	Aggregate Calculation	Example KPIs	Behavior
Latest	Min	Process/Host Up Remaining Free Disk Space	Determine if a host is currently down or if a disk is currently full.
Average	Min/Max	Response time Error rate	Determine if only one entity is experiencing issues.
Average	Perc90/Perc10	Response time Error rate	Determine if several entities are experiencing issues.
Average	Average	Response time Error rate	Determine if most entities are experiencing issues.
Count	Sum	Number of logins Count of errors	Determine the total volume across all entities into the system.
Sum	Sum	Total revenue	Determine the summarized quantity across all entities into the system.

Other pairings, as shown in the following table, might seem logical, but can result in unwanted behavior. You should avoid using the pairings shown in this table, unless you are certain the outcome is what you want.

Entity Calculation	Aggregate Calculation	Behavior
Count	Count	Returns the count of split by entities having data.
Average	Average	These are the defaults, but an “average of averages” is fuzzy and can obscure issues.
Latest	Latest	This randomly selects an entity result.
Min/Max	Min/Max	These high and low configurations are very sensitive to even one high or low outlier value from the raw data. For the entity calculation, the Average or Perc90/Perc10 is better when using Min/Max for aggregate.

Next steps

This content comes from Splunk .Conf presentation, The Definitive List of Best Practices for Splunk® IT Service Intelligence: How to Configure, Administer, and Use ITSI for Optimal Results, part one presented in .Conf23 and part two presented in .Conf24 session. In the session replays, you can watch Jason Riley and Jeff Wiedemann share the many awesome best practices they've amassed for designing key performance indicators (KPIs), services, episodes, and machine learning to maximize end-user experience and insights. Whether you're new or experienced, you'll come away with tactical guidance you can use right away.

You might also be interested in the following Splunk resources:

Splunk Docs: Service insights manual