Using the correct KPI statistical functions for alerting
When configuring key performance indicators (KPIs), particularly those split by entity, you need to select an entity calculation and a service/aggregate calculation. If you blindly accept the defaults or don't understand how their functions work, you will have problems selecting the ones that will create accurate alerts. You want to learn more so you can set up the correct KPIs.
This article is part of the Definitive Guide to Best Practices for IT Service Intelligence. ITSI end users will benefit from adopting this practice as they work on Service Insights.
Solution
To set up a KPI:
- From the ITSI main menu, click Configuration > Services.
- Select an existing service.
- Go to the KPIs tab.
- Click New.
In the calculation section of this window, you will select the entity and service/aggregate calculations.
For more comprehensive configuration instructions, see Configure KPI monitoring calculations in ITSI.
There are quite a few calculation options to choose from for each, but the following are recommended.
Entity calculation. When the KPI runs, many search results are aggregated into a single alert value for each entity based on this calculation type. The most useful options, along with when you should consider using them, are:
- Latest. Choose this when on only the most recent value matters, for example, when you want to know:
- Process up / host up
- Disk free
- Queue depth
- Count. Choose this when measuring count of raw events, for example, when you want to know:
- Number of logins
- Count of errors
- Average. This is good for most other situations. For example:
- Response time
- CPU Utilization
Service/Aggregate Calculation. This is the statistical operation that ITSI performs on the alert values for each entity to produce a single aggregate KPI alert value. The most appropriate aggregate calculation to use depends on the type of KPI you are building. The most useful options, along with when you should consider using them, are:
- Min/Max. This turns the KPI display red when one entity is unhealthy.
- Perc90/Perc10. This turns the KPI display red when several entities are unhealthy.
- Average. This turns the KPI display red when most entities are unhealthy.
The following table shows how combinations of these two recommended calculations can yield the best results for alerting to problems.
Entity Calculation | Aggregate Calculation | Example KPIs | Behavior |
---|---|---|---|
Latest | Min |
|
Determine if a host is currently down or if a disk is currently full. |
Average | Min/Max |
|
Determine if only one entity is experiencing issues. |
Average | Perc90/Perc10 |
|
Determine if several entities are experiencing issues. |
Average | Average |
|
Determine if most entities are experiencing issues. |
Count | Sum |
|
Determine the total volume across all entities into the system. |
Sum | Sum |
|
Determine the summarized quantity across all entities into the system. |
Other pairings, as shown in the following table, might seem logical, but can result in unwanted behavior. You should avoid using the pairings shown in this table, unless you are certain the outcome is what you want.
Entity Calculation | Aggregate Calculation | Behavior |
---|---|---|
Count | Count | Returns the count of split by entities having data. |
Average | Average | These are the defaults, but an “average of averages” is fuzzy and can obscure issues. |
Latest | Latest | This randomly selects an entity result. |
Min/Max | Min/Max | These high and low configurations are very sensitive to even one high or low outlier value from the raw data. For the entity calculation, the Average or Perc90/Perc10 is better when using Min/Max for aggregate. |
Next steps
You might also be interested in the following Splunk resources:
- Splunk Docs: Service insights manual