Optimizing data model acceleration for better performance

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

If you’ve been using Splunk Enterprise Security for a long time, you’ve probably experienced a lot of staff turnover. As people leave, they take context with them, like why a scheduled search that is running every hour looking for why foo=bar was ever a good idea. When a new user starts using the Splunk system, they make changes without that original context and sooner or later, the system starts to slow down or behave not as you would expect:

Missed security detections
Skipped scheduled searches
Incomplete search results
Data models that never finish accelerating
Slow user interface
High CPU, memory, and SVC usage

The easy answer might seem to buy more hardware or purchase more Splunk Virtual Compute. At first, it might look like the extra capacity is helping, but most likely it is masking underlying issues and making them worse. Eventually, adding more compute won’t help.

A better first step is to optimize the system to perform better data model acceleration by following these guidelines.

Prerequisites

The following background information will be helpful before continuing with this article:

To review Splunk Common Information Model (CIM) data models, see Overview of the Splunk Common Information Model.
To review data model acceleration, see Accelerate data models.
To review search time field normalization, see The sequence of search-time operations.

Solutions

While optimizing accelerated data model performance might seem daunting, you can solve most issues, including misconfigured data model acceleration jobs, leftover artifacts from old apps and searches, and poor data normalization. At a high level, the process is:

Let Splunk run more scheduled searches.
Configure data model acceleration.
1. Turn off unused data model acceleration.
2. Configure cim_<datamodel>_indexes macros to only include relevant indexes.
3. Set the backfill time to 4-12 hours to limit number of index buckets scanned.
4. Ensure that data model acceleration (DMA) configurations are the default, or that there’s a good reason they’re not.
Remove unused content from your environment.
1. Remove legacy searches and unused apps.
2. Refine the DMA optimized search string.
  - Examine the optimized search used to populate a data model.
  - Disable event types and tags from being set on out-of-scope data sources.
  - Add source types to the cim_{macro}_indexes to refine results.
Review search time normalization.
1. Profile time to query each data source.
2. Review how each data source contributes to data model acceleration time and optimize its underlying normalization.
3. Test normalization using a private source type renaming.
Configure search earliest time correctly to ensure that all in scope events are searched.

Let the Splunk platform run more scheduled searches

Splunk Enterprise Security (ES) is a batch heavy job processor that runs a lot of scheduled searches to detect security events, for threat intelligence matching, to build lookup tables, and to accelerate data models. Considering the number of detection searches most environments run, and that each accelerated data model can initiate more than one job at the same time, it is important is to increase the number of scheduled searches that an ES search head can concurrently handle.

The Splunk platform determines the maximum number of concurrent scheduled searches that it can run based on search head core count. By default, 50% of the searches can run scheduled jobs and 50% of those can accelerate data models. By default, a 24-core search head can run 30 total searches, 15 can be scheduled, and 7 of those 15 can be data model acceleration jobs.

Change the system defaults to allow more scheduled jobs to run

In the Splunk platform, go to Settings > Server settings > Search preferences. Here you will adjust two settings.

Relative concurrency limit for scheduled searches. Increase this to 75%. This is the maximum number of searches the scheduler can run, expressed as a percentage of the maximum number of concurrent searches. Now our example 24 core search head can run 22 scheduled searches and 11 data model acceleration searches.
Relative concurrency limit for summarization searches. Increase this to between 75% and 100%. This is the maximum number of scheduled searches that can be used for auto summarization, expressed as a percentage of the maximum concurrent searches that the scheduler can run. If we set this to 100% (a best practice) then all 22 scheduled searches in our example could be used for data model acceleration.

Configure data model acceleration

Turn off unused data models, limit the data model search to specific indexes, and set backfill time to a reasonable period between 4 and 12 hours.

Turn off unused data model acceleration

If a data model isn’t needed, turn off acceleration. Most CIM data models are accelerated by default, and they all consume resources. DMA is constructed using scheduled searches that run every 5 minutes, scanning all indexes, looking for tagged events over a backfill period time.

In the Splunk platform, go to Settings > Data inputs > Data Model Acceleration Enforcement Settings. If acceleration isn’t needed select Disable in the Status Column. For example, if you have no use cases for the Network Sessions data model, then select Disable in the Status column.

Configure indexes searched by the data model

Configure data models to only search indexes that contain relevant events. By default, data models search for events in all indexes (index=* OR index=_*). Use the ES CIM Configuration tool to configure the cim_<datamodel>_index macros to limit searches by index. For better control, manually edit the macros in the Splunk_SA_CIM application to include additional filters like source type.

Set the backfill time to between 4 and 12 hours

Backfill time tells the Splunk platform how far to look back for modified buckets of indexed data. The shorter it is, the fewer buckets your deployment has to scan for tags. Why scan through thousands of buckets when a hundred will do?

New events no matter their timestamp (past, current, future) are written to hot buckets when they are ingested. Under optimal conditions, a backfill time of 10 minutes could work, but by setting it to 4 or 12 hours, you add resiliency to the underlying build process.

Longer backfill times (12+ hours) are helpful to prepopulate an accelerated data model with historical events. Most security use cases care about data going forward, so 4 hours is often appropriate. If you need to prepopulate a DMA, start with an initial longer backfill time, expect delays, and as soon as the DMA is built, then reduce the backfill to 4 hours.

To review the number of buckets that are queried during an accelerated datamodel search, run |dbinspect against the relevant indexes for the past 4 hours hrs vs 24 hours vs 7 days.

Verify default data model configurations

Data models are defined in JSON files and data model acceleration is stored in configuration files. If any default configuration has been changed, then it isn’t modified during upgrades and might be configured incorrectly. This is how the Splunk platform treats all configurations that are locally modified.

For optimal data model acceleration search performance, verify that these parameters are set correctly as they can impact performance. Some parameters to verify are:

Accelerate until maximum time: Unchecked
Max summarization search time: 3600
Max concurrent summarization searches: 4
Manual rebuilds: Unchecked
Schedule priority: Highest

Remove unused content from your environment

Remove legacy searches and unused apps

The Splunk platform is built for speed and flexibility by a variety of users and can easily become cluttered. Improve system performance, reduce the size of the knowledge bundle, and speed up DMA searches by finding and removing unused apps (that are filled with knowledge objects), terminating unused scheduled searches, and removing unused lookup tables. In one Splunk environment, Global Services reduced the underlying optimized search that the Splunk platform uses to query the network traffic search from 75,000 characters to under 25,000 by removing old apps and cleaning up unused configuration elements. The smaller search was much easier to read.

Identify searches that are being run by users who no longer work for the company or have no relevant purpose. Retire them if they aren’t serving a clear purpose.
If you aren’t using an app, disable or remove it. Every Splunk app has configuration files and parameters that are associated to it. These configuration artifacts can affect system performance, especially if they contain tags that are used by data models.
Look for large unused lookup files. These typically get replicated to the indexers and consume resources. You can use the| bundlefiles command that is provided by the Admins Little Helper for Splunk app.
Bonus: Move all production artifacts to be owned by a common service account. This simplifies knowing what’s real and expected versus what is someone’s experiment.

Refine the DMA optimized search string

A data model populating search has details required to construct the tags that are required to locate matching events. Tags are the final stage of search time normalization and might be present in source types that aren’t required by the data models. Removing unnecessary event types improves DMA performance.

Examine the optimized search used to populate a data model: Run the following search, substituting the data model type as needed.

|datamodel Network_Traffic search

When the search completes, look at the Search Job Inspector > Job Details Dashboard view and locate the underlying optimized search string. Look for source types that you don't expect and look for opportunities to disable them or to refine the configuration.

Disable unused event types and tags: Determine what event types are setting tags and turn off the ones that aren’t needed. Tags are the most expensive component of data normalization, and they are used to populate data models.

Determine what tags are needed for a data model. For example: Network_Traffic uses both tag=networkandtag=communicate.

Run a search to find all event types that are used to set the tag you are reviewing. The following is a sample search where search tag.network=* finds the network tag.

| rest splunk_server=local /services/saved/fvtags 
| search tag.network=* 
| table eai:acl.app, title, tags 
| search title=*eventtype* 
| rex field=title mode=sed "s/eventtype=//" 
| join title type=left 
    [| rest splunk_server=local /services/saved/eventtypes 
    | table eai:acl.app, title, search 
    | table title, search ]
| eval tags=mvjoin(tags,", ")

If an event type isn’t needed, disable it.

Add source types to cim_macro indexes: The ES CIM configuration wizard in Splunk Enterprise Security simplifies configuring a data model to limit the search based on an index name. By editing the cim_<datamodel>_indexes macros directly, you can add criteria to limit the search to a specific source type or value of some other field.

Review search time normalization

Determine how each data source contributes to data model acceleration performance and review the underlying data normalization configuration for ways to improve efficiency. In one extreme condition, we found 4 DMA searches running bad Regex extractions while trying to accelerate 1 million received events per second.

Step 1: Profile time to query each data source that is contained in a data model

Identify all indexes and source types contained in a data model, using a search like the following:
| tstats summariesonly=t count FROM datamodel=Network_Traffic BY index sourcetype
Determine how long it takes to query each index and source type separately. Use a sample period of 5 minutes or longer and run a verbose query.
Using the Search Job Inspector, record how long a search took and the number of events reviewed. For example:
index=security sourcetype=firewall tag=network tab=communicate

Step 2: Perform a deep dive review of normalization for each source type

Run the search (from Step 1) against one data source.
Review the optimized search, and question its contents.
Review the search logs for warnings and errors.
Review the configuration used to perform search time field based extraction.
- EXTRACT - Bad regular expressions can cause problems on search times and burn cash fast, especially when run billions of times per day.
- REPORT – Another way to run regular expressions and perform field extractions that are stored in transforms.conf.
- LOOKUP – Great for enrichment, such as enriching assets (src, dest, dvc) and identities (user, src_user) or for providing fancy names for error codes.
- EVAL – Multivalue commands can do magic, can be nested, can be resource intensive, and can even contain a bad regex in a replace or match command.

An easy way to review the configuration that is used to normalize events is with the Config Quest app. It's a lightweight utility that can help you search and review configurations on any Splunk server directly from your search head.

Test normalization using a private source type renaming

When reviewing normalization performance, consider the resources required to return each field. Run a search in verbose mode using the fields command to limit the fields returned and watch the difference in the Search Job Inspector.

While troubleshooting slow searches, start returning fields that begin with wildcarded characters like a* b* and refine to the field level when you locate a potential bottleneck. By performing this, you can narrow down what area of the underlying configuration needs to be reviewed.

Test data normalization in prod: The Splunk platform has features that let you safely test data normalization in production. By cloning a source type and using private source type renaming, you can search for and make changes to your new or renamed source type without affecting existing operations.

From a testing app (which can be Search and Reporting):

Clone, or create, a source type to work on. Example: testing sourcetype=firewall was cloned to firewall-test.
Create a private configuration to rename the source type. Example: Rename sourcetype firewall to firewall-test.
Search for the new source type from your testing app: Example:index=* sourcetype=firewall-test.
Make any configuration change you want to thefirewall-test source type and it will only be seen by you.

Ensure that searches have earliest time set correctly

Now that your data models are optimized and should be accelerating all the events you need, make sure that your searches are configured to use those accelerated events.

Set the earliest time for a search long enough to account for the time it takes for event ingestion, data model acceleration lag, and a buffer for maintenance periods, event spikes, and late arriving events.

Try setting the earliest time to data model acceleration backfill + 10 minutes. That way, the DMA backfill becomes your SLO for event availability.

Security detection or alert searches usually run frequently over overlapping windows of time. To prevent duplicate findings or alerts, use throttling or suppression and include _time as a field if you need to alert on each occurrence of a finding.

How can you verify that all events are really in the data model, even if it says "fully accelerated"? Run a brute force search to compare accelerated data model events with the candidate events. A search like this one, run over small windows of time, shows a count of events that are in the accelerated versus the unaccelerated data model.

| tstats summariesonly=f count FROM datamodel=Network_Traffic WHERE earliest=-60m@m latest=-55m@m BY _time span=1m  
| eval series="indexed" 
| append 
    [| tstats summariesonly=t count FROM datamodel=Network_Traffic where earliest=-60m@m latest=-55m@m BY _time span=1m 
    | eval series="accelerated" ] 
| timechart span=1m max(count) BY series 
| eval GAP=indexed-accelerated 
| eval ago=now()-_time

Next steps

Improving performance for data model acceleration should be the start of your ongoing performance improvement journey. DMA is one of the most resource intensive parts of a Splunk Enterprise Security deployment, and improving its performance usually provides returns that outweigh the effort.

Hopefully this article gave you the tools to get more value out of your Splunk deployment. The same techniques can be applied across all your Splunk searches and there is always something that can be improved. Recursively follow this advice, and if you look long, you will eventually find the issue that is right in front of you but masked by years of technical debt.

The following additional resources might help you implement the guidance in this article:

Splunk Docs: Common Information Model Add-on Manual
Splunk Docs: Accelerate CIM data models
Product Tip: Writing better searches with the Common Information Model