Sampling data with ingest actions for data reduction
As a Splunk admin, there are many reasons you might not want to index all the data sent to your Splunk instance. Common reasons are either as a cost-saving measure (via storage or ingest license costs) or reducing noise of certain log sources (to make search time exploration a bit easier). With ingest actions (Splunk Enterprise or Splunk Cloud Platform) it is possible to set up sampling with a UI that enables both the creation of the sampling logic and the deployment of these changes so they can take immediate effect at whatever tier you want to sample. You want to learn how to implement a few different sampling strategies available with ingest actions.
Solutions
A direct sampling strategy
In the simplest case, you might want to index 10% of your events. This would reduce ingest volume by 90%, which could be quite a large cost saving.
With the Filter using Eval Expression
rule, you can do a 10% sample of data with this eval
expression:
(random() % 10) > 0
There are two things happening in this expression:
random() % 10)
← generate a random number between 0 and 9.> 0
← if the number is 0, then returnfalse
, otherwise returntrue
.
The Drops Events Matching Eval Expression
rule is why we need to invert the matching check here to get a 10% filter. We want to drop 90% of the data, so the data we want to drop is when the number is NOT 0. If we had done = 0
instead, we’d be keeping 90% of the data.
What this ruleset does:
- Calculate a 10% filter and drop 90% of the events - the
Filter using Eval Expression
rule. - Index the remaining 10% that passed through the filter - the
Final Destination
rule.
Expanded sampling strategies
The above example is great in its simplicity; however, it is very rarely sufficient as a general sampling strategy. You could have compliance requirements for all data to be stored, you could have keywords that you always want indexed, or you might need to meet both of these requirements and others at the same time. With ingest actions, you can meet these requirements by adding additional rules.
Store all data in s3, index a sample
As an example of how to implement the requirement to store all data, but still save on indexing storage costs, we could write all data to s3 but only index a sample. This can be accomplished by adding the Route to Destination
rule to our sampling strategy from before.
- Add a
Route to Destination
rule to our rulesets before the filter. - Set the condition to
None
because you want to send everything to s3. - Set the
Immediately send to
option to a bucket you want to receive these events under theS3
heading. - Toggle the
Clone events and apply more rules
option.
That process creates a rule that sends all data to your configured S3
bucket but also keeps the data to get processed by other rules. The prior Filter using Eval Expression
rule is unchanged. It filters out 90% of the data and is exactly what you want for this example.
What this ruleset does:
- Sends all data to the configured
S3
bucket - theRoute to Destination
rule. - Calculates a 10% filter and drops 90% of the events - the
Filter using Eval Expression
rule. - Indexes the remaining 10% that passed through the filter - the
Final Destination
rule.
Always index certain kinds of data, sample the rest
Sampling all data is great in its simplicity, but it is admittedly a blunt method. Realistically, there are always certain kinds of data you cannot ever leave to chance to be indexed and subsequently detected in monitoring. For example, let's set up the ingest actions ruleset to always index events with the keyword error
in any sort of case pattern, for example, Error
, ERROR
, eRroR
.
- Start with a
Route to Destination
rule with aRegex
condition of:(?i)error.
- Set the
Immediately send to
set toDefault Destination.
This should already be pre-filled, but is under theSPLUNK
heading in the drop-down menu for this field. - Leave the
clone events
not toggled. - Keep the original
Filter using Eval Expression
sample rule in place.
What this ruleset does:
- Checks if an event matches on the case-insensitive regex of
(?i)error
. If it does, then indexes the event - theRoute to Destination
rule. - For the events that don’t have some variant of
error
, calculates a 10% filter and drops 90% of those events - theFilter using Eval Expression
rule. - Indexes the remaining 10% that passed through the above filter - the
Final
Destination
rule.
Store all data in s3, index all of certain kinds of data, sample the rest
Combining all of the above, we could use the below set of rules to make a ruleset that stores all data in s3, always indexes certain kinds of data, and then indexes a sample of the remaining data.
Route to Destination
rule with the following configuration:- Set the condition to
None
. - Set the
Immediately send to
option to a bucket you want to receive these events under theS3
heading. - Toggle the
Clone events and apply more rules
option.
- Set the condition to
Route to Destination
rule withRegex
toggled and this regular expression:(?i)error
Filter using Eval Expression
rule withDrop Events Matching Eval Expression
set to:(random() % 10) > 0
Final Destination
rule that is inherent to all IA rulesets. There is no action for you to take for this.
Next steps
The content in this guide is just one of the thousands of Splunk resources available to help users succeed. These additional resources might help you understand ingest actions and implement data reduction strategies: