Some of your services have ten or twelve KPIs. You've noticed that even when several of those KPIs are at critical status, the overall service health score still appears normal. You need better awareness of when your services have problems.
This article is part of the Definitive Guide to Best Practices for IT Service Intelligence. ITSI end users will benefit from adopting this practice as they work on Service Insights.
The simplest solution is to limit your services to three to six KPIs so that each one has more impact on the overall service health. To do this:
- Use only the most important key performance indicators.
- Break your services into sub-services.
- Consolidate similar KPIs. For example, error count and error rate present similar information, but error rate provides more context.
However, if you must have more than six KPIs, you can reduce the KPI importance value or threshold the KPIs to only the Info severity level so that less important ones don't change the health score and cause unnecessary concern.
A service health score is calculated by adding up the scores of all assigned KPIs. Each KPI is calculated by multiplying the score by the importance and dividing by the total importance of all KPIs. So by increasing the importance of a KPI, you also increase its influence.
The following example shows how this works with the simulator on the Settings tab of Configuration > Services. In the first screenshot, though the Simulated Severity of CPU Utilization is at Critical, the Simulated Health Score is still at 100. Because the importance of CPU Utilization is at 0, this critical status doesn't affect the service health overall.
Now, when KPI title is set to Critical, the Simulated Health Score drops significantly because the importance of KPI title is at 5.
You can use the simulator to weigh your KPIs and validate that the service health score will change as appropriate.
This content comes from the .Conf23 session, The Definitive List of Best Practices for Splunk® IT Service Intelligence: How to Configure, Administer, and Use ITSI for Optimal Results. In the session replay, you can watch Jason Riley and Jeff Wiedemann share the many awesome best practices they've amassed for designing key performance indicators (KPIs), services, episodes, and machine learning to maximize end-user experience and insights. Whether you're new or experienced, you'll come away with tactical guidance you can use right away.
You might also be interested in the following Splunk resources:
- Splunk Docs: Service insights manual