Skip to main content
 
 
Splunk Lantern

Using the correct KPI statistical functions for alerting

 

When configuring key performance indicators (KPIs), particularly those split by entity, you need to select an entity calculation and a service/aggregate calculation. If you blindly accept the defaults or don't understand how their functions work, you will have problems selecting the ones that will create accurate alerts. You want to learn more so you can set up the correct KPIs.

This article is part of the Definitive Guide to Best Practices for IT Service Intelligence. ITSI end users will benefit from adopting this practice as they work on Service Insights

Solution 

To set up a KPI:

  1. From the ITSI main menu, click Configuration > Services.
  2. Select an existing service.
  3. Go to the KPIs tab.
  4. Click New.

In the calculation section of this window, you will select the entity and service/aggregate calculations.

For more comprehensive configuration instructions, see Configure KPI monitoring calculations in ITSI.

There are quite a few calculation options to choose from for each, but the following are recommended.

Entity calculation. When the KPI runs, many search results are aggregated into a single alert value for each entity based on this calculation type. The most useful options, along with when you should consider using them, are:

  • Latest. Choose this when on only the most recent value matters, for example, when you want to know:
    • Process up / host up
    • Disk free
    • Queue depth
  • Count. Choose this when measuring count of raw events, for example, when you want to know:
    • Number of logins
    • Count of errors
  • Average. This is good for most other situations. For example:
    • Response time
    • CPU Utilization

Service/Aggregate Calculation. This is the statistical operation that ITSI performs on the alert values for each entity to produce a single aggregate KPI alert value. The most appropriate aggregate calculation to use depends on the type of KPI you are building. The most useful options, along with when you should consider using them, are:

  • Min/Max. This turns the KPI display red when one entity is unhealthy.
  • Perc90/Perc10. This turns the KPI display red when several entities are unhealthy.
  • Average. This turns the KPI display red when most entities are unhealthy.

The following table shows how combinations of these two recommended calculations can yield the best results for alerting to problems.

Entity Calculation Aggregate Calculation Example KPIs Behavior
Latest Min
  • Process/Host Up
  • Remaining Free Disk Space
Determine if a host is currently down or if a disk is currently full.
Average Min/Max
  • Response time
  • Error rate
Determine if only one entity is experiencing issues.
Average Perc90/Perc10
  • Response time
  • Error rate
Determine if several entities are experiencing issues.
Average Average
  • Response time
  • Error rate
Determine if most entities are experiencing issues.
Count Sum
  • Number of logins
  • Count of errors
Determine the total volume across all entities into the system.
Sum Sum
  • Total revenue
Determine the summarized quantity across all entities into the system.

Other pairings, as shown in the following table, might seem logical, but can result in unwanted behavior. You should avoid using the pairings shown in this table, unless you are certain the outcome is what you want.

Entity Calculation Aggregate Calculation Behavior
Count Count Returns the count of split by entities having data.
Average Average These are the defaults, but an “average of averages” is fuzzy and can obscure issues.
Latest Latest This randomly selects an entity result.
Min/Max Min/Max These high and low configurations are very sensitive to even one high or low outlier value from the raw data. For the entity calculation, the Average or Perc90/Perc10 is better when using Min/Max for aggregate.

Next steps

This content comes from the .Conf23 session, The Definitive List of Best Practices for Splunk® IT Service Intelligence: How to Configure, Administer, and Use ITSI for Optimal Results. In the session replay, you can watch Jason Riley and Jeff Wiedemann share the many awesome best practices they've amassed for designing key performance indicators (KPIs), services, episodes, and machine learning to maximize end-user experience and insights. Whether you're new or experienced, you'll come away with tactical guidance you can use right away.

You might also be interested in the following Splunk resources: