Skip to main content
 
 
Splunk Lantern

Understanding anomaly detection in ITSI

 

Anomaly detection can be a powerful tool which many users want to leverage. However, the detection algorithms are widely misunderstood and need clarification to understand the use cases for them and how to deploy them effectively.

This article is part of the Definitive Guide to Best Practices for IT Service Intelligence. ITSI administrators and end users will benefit from adopting this practice as they work on Service Insights.

Solution

Entity cohesion algorithm

The entity cohesion algorithm detects when similarly performing or grouped entities have different data patterns, as shown in the following diagram.

clipboard_ea7f1278cb919b876780c85cd205c9cee.png

The common misconception is that this entity relies on magnitude. For example, if one server runs at 80 percent disk utilization and the other servers in the KPI run at 50 percent, users assume an anomaly will be detected. This is not true. An entity cohesion anomaly would be detected in this case only of that pattern of one at 80 percent and three at 50 percent changed.

Requirements

  • Entity cohesion operates on a key performance indicator (KPI).
  • The KPI needs to be split by entity.
  • There need to be at least four entities for the entity cohesion algorithm to function.

Detection

The algorithm normalizes the data for each entity in the KPI.

  • The median value over the time window is calculated.
  • A noise value is calculated for each entity by analyzing the difference between the median in the window and the actual values.
  • The normalized value is calculated from the trend and noise, and that is what is used to calculate anomalies.

Trending algorithm

The trending algorithm detects unusual trending patterns in KPI value changes, even if those values are all in the healthy range for the KPI, as shown in the following diagram.

clipboard_ecd8ffff8b578be315dcab245f889ca5f.png

A common misconception is that the trending algorithm will trigger an anomaly when a value falls outside the normal range for a KPI. However, that's not true - this describes thresholding, not trending anomaly detection.

Requirements

  • The trending algorithm operates on a key performance indicator (KPI).
  • Trending works at the aggregate KPI level, not on a per-entity basis.
  • If there is an entity split on the KPI, that does not affect trending anomaly detection.

Detection

  • Trending algorithm anomaly detection looks for changes in values, not the values themselves.
    • The algorithm normalizes the data for the KPI.
      • The median value over the time window is calculated.
      • A noise value is calculated for the KPI, analyzing the difference between the median in the window and the actual values.
      • The normalized value is calculated from the trend and noise, and that is what is used to calculate anomalies.
  • The Canberra Distance method is used to calculate the distance between normalized data points.

To better understand how the trending algorithm works, scroll through the following example. Hover over each image for an explanation.

Next steps

In summary, keep in mind the following points.

  • General behavior
    • Anomalies do not influence a service score or KPI.
    • Detection in ITSI is dependent on changes in patterns, not raw values.
    • Notable events are created but must be acted on via NEAP and appropriate policies.
  • Entity cohesion
    • Entity cohesion does not detect differences in magnitude.
    • At least 4 entities with similar patterns are needed.
    • Entity cohesion requires that the KPI include an entity split.
  • Trending
    • Trending does not detect values outside of normal behavior.
    • Trending detection works on the aggregate and not entity level.
    • An entity split for the KPI is not a factor.

This content comes from Splunk .Conf presentation, The Definitive List of Best Practices for Splunk® IT Service Intelligence: How to Configure, Administer, and Use ITSI for Optimal Results, part one presented in .Conf23 and part two presented in .Conf24 session. In the session replays, you can watch Jason Riley and Jeff Wiedemann share the many awesome best practices they've amassed for designing key performance indicators (KPIs), services, episodes, and machine learning to maximize end-user experience and insights. Whether you're new or experienced, you'll come away with tactical guidance you can use right away.

You might also be interested in the following Splunk resources: