Sampling data with ingest actions for data reduction

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

As a Splunk admin, there are many reasons you might not want to index all the data sent to your Splunk instance. Common reasons are either as a cost-saving measure (via storage or ingest license costs) or reducing noise of certain log sources (to make search time exploration a bit easier). With ingest actions (Splunk Enterprise or Splunk Cloud Platform) it is possible to set up sampling with a UI that enables both the creation of the sampling logic and the deployment of these changes so they can take immediate effect at whatever tier you want to sample. You want to learn how to implement a few different sampling strategies available with ingest actions.

Solutions

A direct sampling strategy

In the simplest case, you might want to index 10% of your events. This would reduce ingest volume by 90%, which could be quite a large cost saving.

With the Filter using Eval Expression rule, you can do a 10% sample of data with this eval expression:

(random() % 10) > 0

There are two things happening in this expression:

random() % 10) ← generate a random number between 0 and 9.
> 0 ← if the number is 0, then return false, otherwise return true.

The Drops Events Matching Eval Expression rule is why we need to invert the matching check here to get a 10% filter. We want to drop 90% of the data, so the data we want to drop is when the number is NOT 0. If we had done = 0 instead, we’d be keeping 90% of the data.

What this ruleset does:

Calculate a 10% filter and drop 90% of the events - the Filter using Eval Expression rule.
Index the remaining 10% that passed through the filter - the Final Destination rule.

Expanded sampling strategies

The above example is great in its simplicity; however, it is very rarely sufficient as a general sampling strategy. You could have compliance requirements for all data to be stored, you could have keywords that you always want indexed, or you might need to meet both of these requirements and others at the same time. With ingest actions, you can meet these requirements by adding additional rules.

Store all data in s3, index a sample

As an example of how to implement the requirement to store all data, but still save on indexing storage costs, we could write all data to s3 but only index a sample. This can be accomplished by adding the Route to Destination rule to our sampling strategy from before.

Add a Route to Destination rule to our rulesets before the filter.
Set the condition to None because you want to send everything to s3.
Set the Immediately send to option to a bucket you want to receive these events under the S3 heading.
Toggle the Clone events and apply more rules option.

That process creates a rule that sends all data to your configured S3bucket but also keeps the data to get processed by other rules. The prior Filter using Eval Expression rule is unchanged. It filters out 90% of the data and is exactly what you want for this example.

What this ruleset does:

Sends all data to the configured S3 bucket - the Route to Destination rule.
Calculates a 10% filter and drops 90% of the events - the Filter using Eval Expression rule.
Indexes the remaining 10% that passed through the filter - the Final Destination rule.

Always index certain kinds of data, sample the rest

Sampling all data is great in its simplicity, but it is admittedly a blunt method. Realistically, there are always certain kinds of data you cannot ever leave to chance to be indexed and subsequently detected in monitoring. For example, let's set up the ingest actions ruleset to always index events with the keyword error in any sort of case pattern, for example, Error, ERROR, eRroR.

Start with a Route to Destination rule with a Regex condition of: (?i)error.
Set the Immediately send to set to Default Destination. This should already be pre-filled, but is under the SPLUNK heading in the drop-down menu for this field.
Leave the clone events not toggled.
Keep the original Filter using Eval Expression sample rule in place.

What this ruleset does:

Checks if an event matches on the case-insensitive regex of (?i)error . If it does, then indexes the event - the Route to Destination rule.
For the events that don’t have some variant of error, calculates a 10% filter and drops 90% of those events - the Filter using Eval Expression rule.
Indexes the remaining 10% that passed through the above filter - the Final Destinationrule.

Store all data in s3, index all of certain kinds of data, sample the rest

Combining all of the above, we could use the below set of rules to make a ruleset that stores all data in s3, always indexes certain kinds of data, and then indexes a sample of the remaining data.

Route to Destination rule with the following configuration:
1. Set the condition to None.
2. Set the Immediately send to option to a bucket you want to receive these events under the S3 heading.
3. Toggle the Clone events and apply more rules option.
Route to Destination rule with Regex toggled and this regular expression:
```
(?i)error
```
Filter using Eval Expression rule with Drop Events Matching Eval Expression set to:
```
(random() % 10) > 0
```
Final Destination rule that is inherent to all IA rulesets. There is no action for you to take for this.

Next steps

The content in this guide is just one of the thousands of Splunk resources available to help users succeed. These additional resources might help you understand ingest actions and implement data reduction strategies:

Product Tip: Reducing low-value data ingestion to improve license usage
Tech Talk: Introducing ingest actions: Filter, mask, route, repeat
Splunk Blog: Ingest actions: Data access when, where and how you need it