Summary indexing techniques for efficient KPI searches in busy ITSI environments
As your Splunk ITSI (ITSI) deployment evolves and manages more services over time, the ability to ensure ITSI is running as efficiently as it can becomes increasingly important.
This article shows you three different summary indexing techniques to ensure your KPI searches are searching over as small a dataset as possible to get the information they require. These techniques are especially useful when previously two ITSI base searches have been created to search over the same data, but the analysis period is different (for example, KPI 1 analyzes the last 15 minutes of data and KPI 2 analyzes the last 24 hours of data).
Overall, the techniques covered here can be useful in situations where ITSI is being asked to run large historical searches over large datasets.
Process
In this example, you'll track the ingestion of your local Splunk instance by source type. To achieve this you need to create two KPIs:
- A count of events of each source type from the _internal index each 5 minutes
- A count of the total number of events received in to the _internal index over the last 24 hours
To do this, you could create two KPI base searches - one to look back each 5 minutes and take a count by source type to feed KPI 1, the second to look back over 24 hours to feed KPI 2. But searching over the last 24 hours of the _internal index, in a large environment, will take a long time. All the information you need for a 24 hour count is actually taken care of in the 5 minute base search. You just need to be able to take the counts collected from that search to find out the 24 hour count.
There are three techniques you can use to achieve this.
- Method 1 - Search the itsi_summary index. Use the built-in itsi_summary index to take a count of all the 5 minute counts it has collected via one KPI base search that is searching over 5 minute periods. You can read more about this index in Splunk Docs.
- Method 2 - Search the itsi_summary_metrics index. Use the built-in itsi_summary_metrics index (introduced in 4.6.0) to take a count of all the 5 minute counts it has collected via one KPI base search, searching over 5 minute periods. Searching over this metrics index introduces even more savings than searching the summary index due to the small size of the metrics data and the ability to search the tsidx files directly using
mstats
. You can read more about this index in Splunk Docs. - Method 3 - Search a custom metrics index. Push the output of the KPI base search to a metrics index that KPI 2 will use to ascertain a 24 hour count. This might be useful in a very large, busy ITSI environment where you don’t want to search over the itsi_summary or itsi_summary_metrics index. To do this, you need to create metrics from log searches that can then be searched much more quickly than traditional log searches, using the
mstats
command.
Monitoring design and configuration
Whichever method you choose to use, you'll need to do three things first.
- Create an entity search to create entities for each source type in order to be able to see the event count per source type in ITSI. This also means you can set per entity thresholds to drive alerting, if needed.
- Create a KPI base search that searches over the last 5 minutes of the _internal index and returns a count of the number of events per source type.
- Create a service containing entities and the count per source type KPI.
After you've completed these steps, you can move on to using methods 1, 2 or 3 to find out the 24 hour count.
Create an entity search
Under Entity management > Create entity > Import from search, use a simple entity import search to create the source types:
index=_internal | stats count by sourcetype | eval custom_entity_type="splunk_sourcetype", splunk_sourcetype=sourcetype
Extra configuration is applied at creation time, so each entity looks like the screenshot below. Note the use of the “Aliases” and “Info Fields” . Each source type found has an info field called “custom_entity_type”, used to filter the entities into the service.
Create a KPI base search to count events over the last 5 minutes of the _internal index
- Use this search to create your first KPI:
index=_internal | stats count BY sourcetype
- For the Split by Entity field, select Yes.
- For the Filter to Entities in Service field, select Yes.
- Create one metric which is the count of events.
Create a service that contains entities and the count per source type KPI
- Create a service named “Splunk - Ingestion” with entity rules that look like this. Use the entity filter created at entity import time.
- Select the base search you created above and the EventCount metric.
- Set the KPI schedule according the values shown in the following screenshot:
The resulting service tracks the event count per source type for the last 5 minutes, every 5 minutes.
Now, you'll need to continue configuration by choosing methods 1, 2 or 3 to learn the count of events for the last 24 hours.
Method 1 - Search the itsi_summary index to find out the 24 hour count
Create a new KPI with the following configuration.
- Select Ad hoc Search and use a search similar to the one below. Specify the KPI name you previously created to collect 5 minute count data, and ignore the events associated with the entity_title “service_aggregate” as these will skew your results.
index=itsi_summary kpi="Splunk - Ingestion - Sourcetype Event Count" entity_title=* entity_title!=service_aggregate | stats sum(alert_value) AS daily_event_count
- Leave the defaults as-is on the “Entities” screen. You don't need to split this KPI by entity.
- Configure this search to run every 15 minutes, searching back over the last 24 hours.
You will now see your KPI populated with values from the internal itsi_summary index, saving you the need to create another base search to search over the raw indexed data just to get a simple count.
Method 2 - Search the itsi_summary_metrics index to find out the 24 hour count
- To view the ITSI metric and dimension data, use the following search. The metric you are looking for is
alert_value
:| mcatalog values(_dims) WHERE index=itsi_summary_metrics AND metric_name=* BY metric_name
- Create a new KPI. Select Ad-hoc Search. Use the following search:
| mstats sum(_value) AS value WHERE index=itsi_summary_metrics AND metric_name=alert_value AND entity_title!=service_aggregate AND kpi_base_search=6085592a194ba56eeb52e486 span=5m BY entity_title, kpi_base_search, itsi_kpi_id | stats sum(value) AS daily_event_count
Note the use of the internal reference for the base search.
To find this ID, run the following command:| inputlookup service_kpi_sbs_lookup
| mvexpand kpis.base_search
| search kpis.base_search!=*itsi_summary*
| dedup kpis.base_search
| mvexpand kpis.base_search_id
| dedup kpis.base_search_id - On the "Entities" screen, leave the defaults as is. You don't need to split this KPI by entity.
- Configure this search to run every 15 minutes, searching back over the last 24 hours.
-
You do not need to add any filtering, so accept the presented configuration to complete.
You should now see a new KPI with matching values in the service analyzer.
Method 3 - Using a custom metrics index
In a busy ITSI environment, searching over the built-in ITSI summary and metrics indexes might be difficult, as they can get too large.
Method 3 still makes use of a KPI base search like the other methods, but pushes the metrics into a custom metrics index at the same time. They will still be housed in the ITSI indexes, but moving the metrics into your own indexes can help segregate data and improve performance as your requirements grow. Having disparate, smaller metrics indexes helps search performance and allows you to focus specific KPIs on specific indexes.
- Create a metrics index to house the metrics. The indexes.conf configuration should look like this:
[kpi_metrics_5m] coldPath = $SPLUNK_DB/kpi_metrics_5m/colddb enableDataIntegrityControl = 0 enableTsidxReduction = 0 homePath = $SPLUNK_DB/kpi_metrics_5m/db maxTotalDataSizeMB = 512000 thawedPath = $SPLUNK_DB/kpi_metrics_5m/thaweddb repFactor = auto datatype = metric frozenTimePeriodInSecs = 2629746
- Similar to the KPI base search above, set up a KPI to gather a count of events each 5 minutes for each source type using this search.
- The results are piped to the
mcollect
command and an index is specified. This pushes the metrics created into your custom metric index. - The field for the metric name is called
metric_name:event_count
. You must use this format so the metrics index can successfully index the payload.index=_internal | stats count BY sourcetype | rename count AS metric_name:event_count, sourcetype AS splunk_sourcetype | mcollect index=kpi_metrics_5m
- The results are piped to the
- Now that you have your base search, create your KPIs. In the KPI creation screen, create the first KPI using the base search you have just created. This step is identical to the creation of the first KPI base search in the other sections of this document.
- Now you have created your first KPI and the base search is running, you can use the statistics it is gathering into your custom metrics index to drive the second KPI to get a 24 hour count of events in the _internal index. To do this, create a new KPI using the following search, and select Ad hoc Search:
| mstats sum(_value) AS event_count WHERE index=kpi_metrics_5m AND metric_name=event_count
- You do not need to split or filter by entities so leave those options as-is. Set the search schedule to the last 24 hours:
You should now see a new KPI with matching values in the service analyzer.
Note that there is a small discrepancy in values of the mcollect KPI due to the different internal scheduling of the searches.
Next steps
Thinking about the most efficient way to drive KPIs in a large ITSI environment becomes ever more important as the ITSI workload increases.
Some good questions to ask yourself when approaching KPI onboarding in environments like this are:
- Do you really need to create another base search?
- Can you amend an existing base search to cover a new requirement from the same dataset?
- Can you re-use ITSI summarized data?
- Does the data already exist in summarized form?
- Can you summarize the data prior to running a base search or KPI search over it?
- Does the data lend itself to creating metrics? If so, consider summarized metric data prior to creating KPIs.