Skip to main content
 
 
 
Splunk Lantern

Reducing Smartstore cache churn with smart Workload Management rules

 

When using Smartstore in the Splunk platform, data indexed is stored in S3 storage, while the local hard drives (or SSD) on the indexers become Smartstore "cache" for the indexed buckets. This is demonstrated in the following architecture diagram:

 

Overview of Smartstore architecture

 

Efficient cache management is crucial when using Smartstore. The effectiveness of the cache manager is optimized when queries use recent data over a short time duration (for example, from 24 hours up to a few weeks, depending on the available cache) and have spatial and temporal locality.

However, users often run queries that deviate from these criteria. For instance, a user might search for the last 180 days of data on an index with high data ingestion. This can lead to a high cache miss rate, potentially causing the eviction of a large number of buckets from other indexes to make space for retrieving the requested data. This impacts not only the performance of that specific query but also any other queries, and even the performance of data ingestion.

The following chart shows a comparison of cache churn versus search time range. The higher the search time range, the higher the number of cache misses.

Cache churn vs Search time range. The higher the Search time range, the higher the number of Cache misses.

Solution

To address this challenge proactively, Splunk administrators can leverage a recently released feature in Workload Management (WLM). Starting with Splunk Cloud Platform 9.0.2305, administrators can define WLM rules under admission or workload rules using the search_time_duration predicate. 

search_time_duration>24h AND role=novice_user → do not run the query

This rule filters queries written by users in the novice_user role that search for more than a 24-hour time window. It allows administrators to manage queries scanning large data sets across multiple buckets.

You can also tie this search to indexes to boost its effectiveness. Different indexes receive varying daily data volumes, so each one must scan different numbers of buckets over the same search time duration. For example, an index receiving firewall logs will have a significantly higher volume than an index with application build logs.

index=security_events AND search_time_duration>48h AND role!=security_user → do not run the query

In this example, the security_events index, a high-volume index, experiences significant cache churn when searching for more than 48 hours of data. The rule ensures that queries from non-security users searching beyond this duration are not executed.

If your system experiences cache churn issues, utilizing WLM rules tied to indexes provides a proactive approach to reduce risk and maintain a healthy Splunk instance.

Next steps

These resources might help you understand and implement this guidance:

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at ondemand@splunk.com if you would like assistance.

  • Written by Troy Wollenslegel and Karthikeyan Sabhanatarajan
  • Engineering Team Members at Splunk