Performance tuning the search head tier

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

This article explores various options for optimizing the search head tier in the Splunk platform. Enhancing this tier is crucial for efficient search performance and overall system stability.

There are a number of settings you can tune on the search head tier. This article shares tips for ensuring optimal performance in each of the following areas:

Challenges with the search head tier
Search optimization
Scheduler optimization
Search concurrency
Knowledge object backups
Knowledge object management
Orphaned scheduled searches
Lookup files
KV Store collections
Lookup definitions
Macros
Calculated fields
Field aliases
Field extractions
Field transformations
Detecting scheduled searches accessing non-existent indexes
Frequently updated lookups
Tracking Splunkbase app updates
Configuration settings
Disk recommendations for the search head tier
Linux-specific options
Controlling search head cluster (SHC) restarts

This article is one of three that explore performance improvements for the indexing tier, forwarding tier, and search head tier. Check out the other articles to gain insights into optimizing each tier.

Challenges with the search head tier

The search head tier is where you have the most control over searches run on the indexing tier. At this level, you can dictate how many concurrent searches to run in parallel on the indexing tier, along with managing the user base's daily activities.

Several challenges exist with the search head tier:

The UI can slow down as the number of knowledge objects increases. A search head with 8000 saved searches responds more slowly to both REST requests and UI browsing than a search head with under 1,000 saved searches. See this Splunk Idea (log in required) for more information.
The scaling issue also applies to large dashboard numbers (also a Splunk Idea).
Saved searches are easy to "set and forget"; they could remain active for extended periods, long after the data they search becomes irrelevant.

While new Splunk platform versions have improved scalability, these challenges persist. Solutions are documented later in this article under the Knowledge object management section.

The knowledge object bundle replicated to the indexers can also present issues concerning both total size and frequency of replication. Solutions to address these are discussed later in the article.

Search optimization

Optimizing searches involves understanding the Search Job Inspector, as detailed in Splunk Docs and this blog post. Ideally, users will tune searches before scheduling them. In reality, the administration team might need to impose guardrails to prevent performance issues. Workload management (WLM) is useful for this, particularly with WLM admission rules that can prevent a search from running. You can also limit the total search runtime using srchMaxTime in authorize.conf.

Some resources to help educate users to tune their searches include:

Splunk Community site and Splunk Community Slack
Splunk4Champions2 workshop
Power of SPL
Splunk SPL examples
The Sideview UI app, which provides a method to assess search expense, aiding in identifying heavy queries.
The Alerts for Splunk Admins app, which includes "SearchHeadLevel — Maximum memory utilization per search” to detect memory usage on a per-search basis.
Dashboards like “troubleshooting_resource_usage_per_user” and “troubleshooting_indexer_cpu” can help find searches that use the most CPU, memory, or disk I/O.
Finally, if you would like a more interactive “fun” session, consider the SPLing bee app.

If you, as a user or administrator, are interested in analyzing and managing Splunk indexes, you can take advantage of features and resources provided by Splunk, such as:

_introspection records information on searches if they are running long enough to be sampled by the introspection process (approximately 10 seconds or longer). There is also the newer search_telemetry source type which records performance information around searches outside the sampling windows for CPU, memory, or I/O with a few exceptions (subsearches, for example).
_audit records general information about search runtime, total bucket downloads in a SmartStore environment, and time spent downloading buckets (for example “SearchHeadLevel — SmartStore cache misses — combined”).
Maximizing Splunk Core: Analyzing Splunk searches using audittrail and native Splunk telemetry on GitHub is a good resource.
Splunk app for Redundant or Inefficient Search Spotting is a promising app for improving search head performance.

Scheduler optimization

Scheduling searches to a specific hour is simple but can result in a busy scheduler. The allow_skew feature in the Splunk platform helps distribute search load more evenly across time, as detailed in the Splunk Docs page Offset scheduled search start times, which covers this topic as it relates to the savedsearches.conf config file. You can try a skew of 5 or 10 minutes to start with.

Adjusting acceleration.allow_skew in datamodels.conf can also help to balance the scheduler.

There are also dashboards that can assist in improving a busy scheduler:

Search concurrency

The limits.conf file provides settings for the number of searches that can run simultaneously:

base_max_searches
max_searches_per_cpu
total_search_concurrency_limit

While increasing these settings can improve concurrency, it might not enhance user experience due to the indexing tier handling the vast majority of searches.

Increasing concurrent searches can eventually slow down individual search runtimes, so it's important to calculate both search and indexing tier CPU usage and test thoroughly before implementing changes.

Knowledge object backups

Backup strategies are essential before automating object deletion. Resources that detail best practices around this include:

Knowledge object management

Over time, knowledge objects might be created and abandoned. To address this, Alerts for Splunk Admins (also available on GitHub) includes saved searches that track object usage. Although this approach is effective for most knowledge objects, scheduled reports and alerts require special attention, as they seem to be "always in use", making it difficult to determine whether an alert is still active.

Criteria in the app for confirming whether a scheduled report is no longer useful and a candidate for disabling or deletion include:

Searching indexes that no longer exist.
Orphaned status due to the report author's departure.

For other reports, an annual report review process runs in the app with the following criteria:

Reports or alerts scheduled and not updated in the past 6 months.
A lookup tracks whether the user or a colleague has indicated the report is still needed.
A dashboard shows the logged-in user the list of reports they own and need to validate.
Individuals can validate reports on behalf of others.
A “lookup check” report compares a backup copy of the lookup with the latest update to prevent accidental overrides.
Three reminders are sent 30 days apart, and after 100 days, the report is disabled from scheduling but not removed, allowing another chance to reschedule if needed.

Tracking for non-scheduled reports and dashboards is more straightforward. Reports in the app help find nearly all usage of saved searches via:

“SearchHeadLevel — platform_stats access summary”
“SearchHeadLevel — Search Queries summary loadjob and savedsearch usage in audit logs”

If you summary index this data, you can query over long periods, such as nine months, to determine which dashboards are in use using queries like:

| rest splunk_server=local timeout=900 /servicesNS/-/-/data/ui/views f=eai:acl* search=eai:acl.removable=1
| rename eai:acl.app AS app, eai:acl.sharing AS sharing
| stats latest(updated) AS updated by app, sharing, title

For saved searches, the query is similar but uses a different endpoint:

| rest splunk_server=local timeout=900 /servicesNS/-/-/saved/searches f=eai:acl* search=eai:acl.removable=1
| rename eai:acl.app AS app, eai:acl.sharing AS sharing
| stats latest(updated) AS updated by app, sharing, title

Orphaned scheduled searches

You can use this search to find orphaned scheduled searches:

| rest /servicesNS/-/-/saved/searches count=0 search="disabled=0" search="is_scheduled=1" f=next_scheduled_time splunk_server=local f=title f=eai:* 
| search next_scheduled_time=""
| table author, eai:acl.owner, eai:acl.app, title, next_scheduled_time, id| rename eai:acl.app AS app, eai:acl.owner AS owner

Finding other orphaned knowledge objects can be more complicated as lookups replicate to the indexing tier by default and are more difficult to track. “Automatic lookups” do not appear in the logs so they are excluded from any deletion operations in Alerts for Splunk Admins.

Lookup files

Before automating deletion processes for lookup files, it's essential to establish a robust backup system, as lookups are not recorded in any log files. Consider using a script to download lookups via the Splunk app for lookup file editing REST endpoints or performing a simple filesystem backup of lookup files. The Splunkconf-backup app on Splunkbase can also assist with this.

If you're interested in tracking lookup size or details over time, the TA-Alerts for SplunkAdmins app on Splunkbase includes a modular input called lookup_watcher. This feature measures lookup sizes on a per-lookup basis and works in search head clusters.

One option to manage lookup files is excluding larger lookups from the knowledge bundle using the excludeReplicatedLookupSize setting in distsearch.conf. While this approach can be effective, it has limitations. Users might encounter errors if the lookup file is absent from the indexing tier. To mitigate this, it might be necessary to include local=true in search queries.

Alternatively, you can move the lookup into the KV Store and avoid the replicate=true setting (false by default), which excludes the lookup file from the knowledge bundle.

The Alerts for Splunk Admins search “SearchHeadLevel — Detect lookups that have not been accessed for a period of time” relies on three other searches. When summary indexed, these searches can help detect unused lookups:

IndexerLevel — RemoteSearches — lookup usage
SearchHeadLevel — audit.log — lookup usage
SearchHeadLevel — Lookup Editor lookup updates

Additional reports in the app ensure that lookups relied upon by saved searches, dashboards, or regularly updated files are not deleted:

SearchHeadLevel — Lookups within dashboards
SearchHeadLevel — Lookups within saved searches
SearchHeadLevel — Lookup Watcher Recent Modification Summary

Removing excess lookup files can reduce the knowledge object bundle size and the number of knowledge objects on the search head. The app uses a schedule where zero access for nine months is followed by three months of notice before actually removing the file. Any access to the lookup resets the deletion clock, and an “exclusion” list is maintained for lookups that should not be removed.

KV Store collections

The Alerts for Splunk Admins report “SearchHeadLevel — User created kvstore collections” can be used to find the KV Store collections that might be in scope. You can then check these against the results of “SearchHeadLevel — access logs kvstore usage”, along with previous lookup related checks.

Lookup definitions

Removing lookup files and KV Store collections might result in broken lookup definitions which are no longer useful. The Alerts for Splunk Admins saved search “SearchHeadLevel — Lookup definitions with no lookup file or kvstore collection” handles this.

Macros

The Alerts for Splunk Admins report “SearchHeadLevel — macros in use” can be utilized to track the usage of macros via the audit.log file. You can run a query such as:

| rest /servicesNS/-/-/data/macros count=0 timeout=900 splunk_server=local search=eai:acl.removable=1 
| rename title as name, eai:acl.app as app, eai:acl.owner as owner, eai:acl.removable as is_removable, eai:acl.sharing AS sharing

This query identifies macros not originating from the default directory within a search head cluster, which can then be compared to summary indexed data to determine if they are candidates for removal.

Calculated fields

For calculated fields, you can check if the source type is in use by utilizing summary indexed data from the Alerts for Splunk Admins report “SearchHeadLevel — Sourcetypes usage from search telemetry data”:

| rest /servicesNS/-/-/data/props/calcfields count=0 timeout=900 splunk_server=local search=eai:acl.removable=1 f=stanza f=eai:*
| rename title as Name, eai:acl.app as App, eai:acl.owner as Owner, eai:acl.removable as is_removable stanza AS sourcetype
| eval sourcetype=trim(sourcetype)

(combined with)
| search NOT
    [| metadata type=sourcetypes index=*
    | table sourcetype ]

The metadata query serves as additional safety since telemetry data might miss some searches, and should exclude any source type that remains in use.

Field aliases

Detection searches for source type usage remain the same as in the Calculated fields section. This search can be employed to find field aliases potentially suitable for removal:

| rest /servicesNS/-/-/data/props/fieldaliases count=0 timeout=900 splunk_server=local search=eai:acl.removable=1 f=stanza f=eai:*
| rename title AS Name, eai:acl.app AS App, eai:acl.owner AS Owner, eai:acl.removable AS is_removable, stanza AS sourcetype, eai:acl.sharing AS Sharing
| eval sourcetype=trim(sourcetype)

(combined with)
| search NOT
    [| metadata type=sourcetypes index=*
    | table sourcetype ]

Field extractions

Similar to the Calculated fields section, this search can identify field extractions that are candidates for removal:

| rest /servicesNS/-/-/data/props/extractions count=0 timeout=900 splunk_server=local search=eai:acl.removable=1 f=stanza f=eai:*
| rename title as Name, eai:acl.app as App, eai:acl.owner as Owner, eai:acl.removable as is_removable, stanza AS sourcetype, eai:acl.sharing AS Sharing
| eval sourcetype=trim(sourcetype)

(combined with)
| search NOT
    [| metadata type=sourcetypes index=*
    | table sourcetype ]

Field transformations

Field transformations can also utilize the query mentioned in the Calculated fields section. You must additionally check if any field extractions reference the transform in question:

| rest /servicesNS/-/-/data/transforms/extractions count=0 timeout=900 splunk_server=local search="eai:acl.removable"=1 f=stanza f=eai:*
| rename title AS name, eai:acl.app AS app, eai:acl.owner AS owner, eai:acl.removable AS is_removable, eai:acl.sharing AS sharing
    ``` exclude source:: as it's hard to determine usage ```
``` exclude any field transforms actively in use by extractions, this originally included sharing level but I later excluded it to reduce the chance of missing an edge case ```
| search NOT
    [| rest /servicesNS/-/-/data/props/extractions splunk_server=local search=type!="Inline" ``` search=eai:acl.removable=1 ``` count=0 f=value f=eai:*
    | rename eai:acl.app AS app, eai:acl.owner AS owner, eai:acl.sharing AS sharing, value AS name
    | stats count BY app, owner, name
    | eval name=split(name,",")
    | mvexpand name
    | fields - count ]

Detecting scheduled searches accessing non-existent indexes

The Alerts for Splunk Admins search “SearchHeadLevel — indexes per savedsearch” can be used to determine indexes used on a per-search basis. Compare the result against a REST call:

| rest /services/data/indexes-extended count=0 f=title datatype=all 
| stats values(title) AS index

If all the indexes in question no longer exist, it might be safe to disable the scheduled search.

Frequently updated lookups

A common challenge for the knowledge bundle is the frequency of new knowledge bundles sent to the indexing tier. When a lookup within the knowledge bundle is regularly updated, a “delta” bundle is transmitted to the indexing tier, creating new copies of the bundle with required updates for every indexer.

Frequent updates can lead to performance challenges and difficulties in maintaining synchronization between the "common bundle" and the indexing tier.

You can use the monitoring console to determine how often bundles are pushed and how long they take to “apply”. Additionally, in Alerts for Splunk Admins, reports like “IndexerLevel — Knowledge bundle upload stats” and “SearchHeadLevel — Knowledge bundle replication times metrics.log” provide further insights.

If you are interested in the contents of the knowledge bundle itself, the search “SearchHeadLevel — Knowledge Bundle contents” provides this information, although note that it also requires the Admin's little helper app from Splunkbase.

For detecting lookup replication, the report “SearchHeadLevel — Lookup updates within SHC” exists, along with the more generic “SearchHeadLevel — SHC conf log summary” to detect general updates within the search head cluster.

A common solution is moving the lookup file to a KV Store collection, defaulting to not replicating to the indexing tier (replicate=false). However, this might incur a performance cost for some queries, so it's not always the best option.

If the issue is the size of the lookup rather than replication frequency, you can use a gzip-compressed lookup to reduce size.

Tracking Splunkbase app updates

The Admin Ninja App combined with the Admin Ninja TA works well for tracking app versions across different instances. The lookup created by the Admin ninja app can be compared against installed versions on various Splunk instances.

Alternatives include Regularly index Splunkbase app listings or Analysis of SplunkBase apps for Splunk to obtain lists from Splunkbase and compare them to installed app versions.

Configuration settings

You can test setting a higher limit for lookup indexing to avoid the indexing of lookup files. Additionally, you might wish to allocate extra file descriptors to the input process.

limits.conf

#Increase to 1000MB lookups to avoid the indexing of lookups unless we really need to
[lookup]
max_memtable_bytes = 1048576000

#Allow the indexers to read the various on-disk files they now track (such as telemetry)
[inputproc]
max_fd = 4000

In larger Splunk environments, you might wish to disable the fetching of remote_search_log in limits.conf and utilize indexed_realtime mode. Additionally, in Splunk platform versions 8 and above, enable the mcollect setting:

[search]
fetch_remote_search_log = disabled
# the memory tracker can be used to limit large searches from using memory
# enable_memory_tracker = true

# post 8.2.x performance tweak
async_quota_update = true

[realtime]
indexed_realtime_use_by_default = true

[mcollect]
always_use_single_value_output = false

You can also use the enable_memory_tracker with a set MB limit on search memory usage; problematic searches have been seen exceeding 50GB (per-search) of memory.

Within limits.conf, the [kvstore] section allows KV Store limit tuning, with a max_mem_usage_mb setting adjustable by default or per-search command. You can try increasing it to 10,000MB by default.

For slow indexers, consider the slow_peer_disconnect parameter, documented on Handle slow search peers.

server.conf

In server.conf, many settings can be tuned, for example:

[shclustering]
# recommended to reduce SHC load
conf_replication_include.ui-prefs = false
# We have increased this on busy SHC members
max_peer_rep_load = 140

# I prefer jobs to retry multiple times, with the defaults I've seen failures on concurrency that were fixed 30 seconds later
# refer to https://ideas.splunk.com/ideas/EID-I-7 for further information
remote_job_retry_attempts = 5

# This should be enabled if you have an SHC of 5 or more members
#captain_is_adhoc_searchhead = true

Many SHC level timeouts can be configured; default settings might suffice unless the SHC is large or busy.

distsearch.conf

distsearch.conf includes settings related to the SH -> indexing tier; slower indexers might require increased timeouts. You can denyList the bin directories of applications by default, allowing only those needing distribution to the indexing tier. These settings are provided in the idea Splunk indexer knowledge bundle — only replicate the bin directory where required.

Disk recommendations for the search head tier

A high-performance disk for /opt/splunk/var is recommended, as the directory is heavily write-intensive in busy environments, impacting search runtime.

Minimizing the number of search head cluster members is advised since more members do not necessarily improve performance. While additional indexers are usually beneficial, more search head clusters can have the opposite effect.

Linux-specific options

Newer systemd unit files set reasonably high limits. If not using systemd, set limits in the Linux limits.conf file.

Other settings to consider adjusting:

LimitCORE=infinity
LimitDATA=infinity
LimitNICE=0
LimitFSIZE=infinity
LimitSIGPENDING=385952
LimitMEMLOCK=65536
LimitRSS=infinity
LimitMSGQUEUE=819200
LimitRTPRIO=0
LimitSTACK=infinity
LimitCPU=infinity
LimitAS=infinity
LimitLOCKS=infinity
LimitNOFILE=1024000
LimitNPROC=512000

Additionally, you can set this on all enterprise instances in sysctl:

kernel.core_pattern: "/opt/splunk/%e-%s.core"

This ensures core dumps are written by the Splunk process to a writable directory. Use /opt/splunk/var on Kubernetes instances, as the var partition persists to a larger disk, allowing investigation into core dumps of Splunk Enterprise.

Consider disabling transparent huge pages, as recommended in Splunk Docs.

Controlling search head cluster (SHC) restarts

SHC-level restart processes can be controlled by pushing a bundle and delaying the restart, as detailed in the Use the deployer to distribute apps and configuration updates section “Control the restart process”.

You can prevent bundle application if it might cause a restart until a particular change window. The GitHub script "check_if_restart_required.sh" should be effective for most environments.

To minimize unnecessary restarts, it's best to organize configurations into separate applications based on whether they trigger restarts or not. For instance, you could store files like authentication.conf and authorize.conf in an application that doesn't initiate restarts. You can also place files like limits.conf in a different application designed to always trigger rolling restarts.

Configuration files like distsearch.conf are more complicated. Stanzas such as replicationDenyList and replicationAllowList can be reloaded, while other stanzas require restarts. Because of this, you might wish to maintain multiple distsearch.conf files, dividing them into a restart-required version and a reloadable version.