Skip to main content
 
 
 
Splunk Lantern

Using caution when cascading service health scores upwards

 

Service health scores that are configured to cascade up a service tree can be misleading. To understand why, look at the following example. The service tree shows some unhealthy scores on services. Where should you start your investigation? Make your guess, then pull the slider for the answer. 

 

After you identify that the Cascading Database is the best place to start, open it up. In this example, you can see that CPU utilization is critical.

clipboard_ecc2758c662add092589c24b75eddc30c.png

Even with that information, you shouldn't assume that the whole tree is unhealthy. If you open the Cascading Client, you might see that service has no KPIs. So maybe it is unhealthy, but maybe it's not.

clipboard_ebc9efe2a6ba3988a0874e0f6d81733fe.png

So why are the other services showing yellow? CPU utilization is critical (1), and because this service tree has been configured to cascade scores upward (2), everything above it, like the Cascading Client (3), shows as unhealthy as well. This configuration might be misleading visually and cause your team to waste incident response resources.

clipboard_eb11ba09d65d9504239666ec610822e00.png

This article is part of the Definitive Guide to Best Practices for IT Service Intelligence (ITSI). ITSI end users will benefit from adopting this practice as they work on Service Insights.

Solution

When configuring service dependencies, you want more service health computations with local KPIs rather than service health scores configured to cascade upwards from lower level services.

Cascading behavior is controlled by importance scores. The default importance scores vary, depending how you create a service dependency, as shown in the table below.

Service Dependency Creation Default Importance Cascading Behavior
Dependent service is manually added in service
configuration
11 Parent service completely impacted by child service health
Dependent service is automatically created via CSV or search 11 Parent service completely impacted by child service health
Dependent service is manually added in service
sandbox UI
5 Parent service partially impacted by child service health
Dependent service is added via content pack (CP) installation Depends on the CP Depends on the CP

Rather than using those defaults, a best practice is to set them all to zero and, instead, create meaningful KPIs for each part of a service to tell you whether that part is healthy.

clipboard_eba9f6ca5a1dbd06c7981d98dfc6d0950.png

However, in some situations, you might have to cascade service health scores upwards. In this case, you must be absolutely certain that an unhealthy child means an unhealthy parent.

Say, for example, you have a database that runs on VMware. Your VMware environment is huge, which means there are many reasons it could be unhealthy, and it's possible none of those reasons affect the database. In this case, the cascading behavior can be misleading.

clipboard_e8678c8df8b5de1b702e88fd6554ffc5a.png

When you do configure service health scores to cascade upward, use the simulator to understand behavior for various anticipated failure scenarios.

clipboard_eeddff981d154218b05372375ada23d04.png

Next steps

Now that you have this information, review service health score cascading configurations in existing service trees to understand current behavior. Then, adjust them as needed.

This content comes from Splunk .Conf presentation, The Definitive List of Best Practices for Splunk® IT Service Intelligence: How to Configure, Administer, and Use ITSI for Optimal Results, part one presented in .Conf23 and part two presented in .Conf24 session. In the session replays, you can watch Jason Riley and Jeff Wiedemann share the many awesome best practices they've amassed for designing key performance indicators (KPIs), services, episodes, and machine learning to maximize end-user experience and insights. Whether you're new or experienced, you'll come away with tactical guidance you can use right away.

You might also be interested in the following Splunk resources: