You are swamped with alerts in your ITSI instance. Even though you use event aggregation policies and are able to group alerts by well understood factors, you still need help identifying the alerts or groups of alerts that appear to be unusual compared to what you normally see.
The article augments the content pack for monitoring and alerting by helping you identify the “unknown unknown” in your alert storms, helping you determine if your alerts are particularly unusual. We are going to cover two approaches, by looking at:
- The different source type combinations against each service
- The different alert descriptions against each service (alert description analysis)
Recognize unusual source type combinations
An unusual source type combination can indicate activity you want to investigate. This search looks for a count of the different source types seen in the tracked alerts index every five minutes for each service. The ratio of the count to the total will indicate how likely this source type combination is for each service.
index=itsi_tracked_alerts | bin _time span=5m | stats values(orig_sourcetype) AS sourcetypes BY _time service_name | eval sourcetypes=mvjoin(mvsort(sourcetypes),"|") | stats count by sourcetypes service_name | eventstats sum(count) AS total BY service_name
The results of the search will look something like this:
Recognize unusual alert descriptions
Here we’re going to use the Smart Ticket Insights App for Splunk to determine the likely alert descriptions in our data. Start with a search that returns the alert descriptions, IDs, and services from your correlation search index.
This app is not Splunk supported.
index="itsi_tracked_alerts" | table event_id, service_name, description | eval type="everything"
Then, select the following fields from the dropdown to get a report on the data.
- Incident ID Field: event_id
- Category Field: type
- Subcategory Field: service_name
- Description Field: description
After the report panels have populated, select a single threshold from the dropdown and click identify frequently occurring types of tickets. On the next dashboard, you might want to modify some of the selections depending on the groups that are identified.
After you have checked over the groups, save the model and move on to manage Smart Groups dashboard. Selecting a group gives you an open in search button. Click it to run a search that provides information about expected alert descriptions for each service that we have some correlation searches against. You might want to make a few changes to the provided search, such as adding in the _time field throughout the search or replacing the last few lines to calculate some statistics.
In summary, we have taken you through how you can use statistical analysis to see how similar techniques can be applied to non-numeric data to see if descriptions and source type combinations also appear unusual. Through combining a few different techniques, we have been able to find event storms that appear to correlate with service degradation, hopefully guiding you toward the alerts that really matter. You can also apply some unsupervised machine learning to your event data to see what hidden patterns exist in your correlation search results.
Note that these results only tell you about unusual volumes of alerts. They don’t have much context about what has made up those alerts.
The content in this article comes from a previously published blog, one of the thousands of Splunk resources available to help users succeed. These additional Splunk resources might help you understand and implement the recommendations in this article: