Using caution when cascading service health scores upwards
Service health scores that are configured to cascade up a service tree can be misleading. To understand why, look at the following example. The service tree shows some unhealthy scores on services. Where should you start your investigation? Make your guess, then pull the slider for the answer.
After you identify that the Cascading Database is the best place to start, open it up. In this example, you can see that CPU utilization is critical.
Even with that information, you shouldn't assume that the whole tree is unhealthy. If you open the Cascading Client, you might see that service has no KPIs. So maybe it is unhealthy, but maybe it's not.
So why are the other services showing yellow? CPU utilization is critical (1), and because this service tree has been configured to cascade scores upward (2), everything above it, like the Cascading Client (3), shows as unhealthy as well. This configuration might be misleading visually and cause your team to waste incident response resources.
This article is part of the Definitive Guide to Best Practices for IT Service Intelligence (ITSI). ITSI end users will benefit from adopting this practice as they work on Service Insights.
Solution
When configuring service dependencies, you want more service health computations with local KPIs rather than service health scores configured to cascade upwards from lower level services.
Cascading behavior is controlled by importance scores. The default importance scores vary, depending how you create a service dependency, as shown in the table below.
Service Dependency Creation | Default Importance | Cascading Behavior |
---|---|---|
Dependent service is manually added in service configuration |
11 | Parent service completely impacted by child service health |
Dependent service is automatically created via CSV or search | 11 | Parent service completely impacted by child service health |
Dependent service is manually added in service sandbox UI |
5 | Parent service partially impacted by child service health |
Dependent service is added via content pack (CP) installation | Depends on the CP | Depends on the CP |
Rather than using those defaults, a best practice is to set them all to zero and, instead, create meaningful KPIs for each part of a service to tell you whether that part is healthy.
However, in some situations, you might have to cascade service health scores upwards. In this case, you must be absolutely certain that an unhealthy child means an unhealthy parent.
Say, for example, you have a database that runs on VMware. Your VMware environment is huge, which means there are many reasons it could be unhealthy, and it's possible none of those reasons affect the database. In this case, the cascading behavior can be misleading.
When you do configure service health scores to cascade upward, use the simulator to understand behavior for various anticipated failure scenarios.
Next steps
Now that you have this information, review service health score cascading configurations in existing service trees to understand current behavior. Then, adjust them as needed.
You might also be interested in the following Splunk resources:
- Splunk Docs: Service insights manual
- Splunk Docs: Overview of creating KPIs in ITSI
- Splunk Docs: Use the service sandbox in ITSI
- Splunk Docs: Add service dependencies in ITSI