Performance tuning the indexing tier

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

This article explores various tuning options for the indexing tier in the Splunk platform. Optimizing this tier is crucial for efficient data ingestion and overall system performance.

There are a number of settings you can tune on the indexing tier. This article shares tips for ensuring optimal performance in each of the following areas:

Queue size tuning
Asynchronous load balancing
Parallel ingestion pipelines
Maximizing indexer disk performance
OS and Splunk platform version
Indexer cluster size
Disabling the KV Store
Cluster manager — server.conf tuning
Indexer settings — limits.conf and server.conf
Indexer settings — indexes.conf
SmartStore issues
Linux-specific options
Data parsing
Avoiding hot buckets on a single indexer

To determine how many indexers might be required, refer to the Splunk Validated Architectures Topology selection guidance. You might also want to read the Lantern article Planning for infrastructure and resource scalability.

This article is one of three that explore performance improvements for the indexing tier, forwarding tier, and search head tier. Check out the other articles to gain insights into optimizing each tier.

Queue size tuning

You can tune inputs.conf (queueSize) for the data receiving queue, and server.conf settings for most other in-memory queues. Larger values than those shown in examples can be used if that suits your needs, although opting for lower values might encourage upstream forwarders to shift toward less congested indexers when queues begin to fill.

Using very large queues can mask issues and result in longer ingestion times overall.

server.conf settings

[queue]
maxSize = 4MB

#default 1MB
[queue=aggQueue]
maxSize = 10MB

# default 6MB
[queue=parsingQueue]
maxSize = 10MB

# default not set (500KB?)
[queue=indexQueue]
maxSize = 20MB

Asynchronous load balancing

Asynchronous load balancing can dramatically improve performance on the indexing tier when data is evenly balanced among indexer peers, in terms of ingestion performance and replication queues, which can be used as a proxy measure for ingestion health. This improved distribution of data can also enhance search performance.

You can find more information on asynchronous load balancing in Performance tuning the forwarding tier.

Parallel ingestion pipelines

parallelIngestionPipelines are documented on the indexing tier in terms of index parallelization. Limits exist around how many pipelines can be used on the indexing tier, and testing has shown that migrating to Kubernetes can also improve performance beyond what parallelIngestionPipelines can achieve. See the article Understanding how to use the Splunk Operator for Kubernetes for more information.

The Splunk Docs guidance Manage pipeline sets for index parallelization provides further detail on how parallel ingestion pipelines work. Each pipeline has a unique set of hot buckets, affecting per-pipeline tuning settings such as queue sizes.

Although the Splunk Operator for Kubernetes can optimize hardware utilization, be aware that Kubernetes is a complex technology and the decision to move into that space should involve careful consideration.

Maximizing indexer disk performance

In server.conf, maintaining free space is crucial for performance reasons. The minFreeSpace setting allows a percentage to be used, and combining it with the eviction_padding setting can prevent temporary pauses in searching during SmartStore bucket evictions.

server.conf example

[diskUsage]
minFreeSpace = 5%

[cachemanager]
eviction_padding = 2180170

Benchmarking has shown ext4 might have lower latency for production workloads than XFS. Conducting your own testing is advised. The article Benchmarking filesystem performance on Linux-based indexers provides insights into this comparison. In this article and others, 20% free space was the cutoff point where latency started to degrade as the disk filled further.

OS and Splunk platform version

Kernel versions across indexers can impact disk write latency, with newer versions tending to correlate with lower latency. Keeping versions up-to-date, for example by using an n-1 version, generally improves performance.

Indexer cluster size

Running larger indexer clusters can slow recovery times to achieve a valid and complete state. Multiple indexer clusters with fewer nodes are advised to mitigate issues with high bucket counts and single-threaded components in cluster managers.

Having unique sets of indexes per indexer cluster and minimizing the number of search heads accessing required indexes can enhance performance. Clusters with identical configurations can distribute workloads effectively.

Advantages of this approach include:

When an indexer restarts, you can isolate which cluster is affected, allowing you to reboot multiple indexers in one window since each is within a different cluster.
Cluster managers take less time to recover from restarts of the cluster manager, and also from indexer restarts, due to the smaller total bucket count per cluster.
The impact of a single cluster manager failure is reduced as it only effects a subset of the overall environment.

One disadvantage of this approach is that you'll have more cluster managers to watch and maintain, but this might not cause you any real issues.

Disabling the KV Store

The KV Store is required on search heads and forwarders for checkpoint tracking, but it is generally not needed on indexers. Disabling it is done in server.conf:

[kvstore]
disabled = true

Cluster manager — server.conf tuning

Several parameters can be tuned for a moderate-sized indexer cluster (for example 80 indexers). It is advised to research before changing any settings:

[clustering]
max_peers_to_download_bundle = 20
send_timeout = 300
rcv_timeout = 300
cxn_timeout = 300
heartbeat_timeout = 120
restart_timeout = 120
percent_peers_to_restart = 6
heartbeat_period = 10
backup_and_restore_primaries_in_maintenance = True
rolling_restart_condition = up
constrain_singlesite_buckets = false
searchable_rolling_peer_state_delay_interval = 120
localization_based_primary_selection = auto
# You will want these settings to be lower such as 2 or 3 if you want a slower
# recovery with better performance. I've priortised restoring rep/search factor
max_peer_build_load = 20
max_peer_rep_load = 50
#max_peer_sum_rep_load = 2

# throttle the amount of bandwidth used for non-hot (warm/cold) replication
# defaults to 0 or unlimited
#max_nonhot_rep_kBps = 10000

Settings like backup_and_restore_primaries_in_maintenance and localization_based_primary_selection might enhance performance when using SmartStore.

Indexer settings — limits.conf and server.conf

limits.conf

[search]
# this relates to a support case, added for consideration only
max_rawsize_perchunk = 500000000

# also related to a support case of buckets failing to localize in time
bucket_localize_max_timeout_sec = 600

# related to regex issues
idle_process_regex_cache_hiwater = 210000

# Increase to 1000MB lookups to avoid the indexing of lookups unless we really need to
[lookup]
max_memtable_bytes = 1048576000

# Allow the indexers to read the various on-disk files they now track (such as telemetry)
[inputproc]
max_fd = 4000

# spath is distributed and it does not work as expected from SH if the setting is not on the indexer
[spath]
extraction_cutoff = 300000

server.conf

[clustering]
# these can be increased if you are seeing indexer to indexer timeouts
#rep_max_send_timeout = 180
#rep_max_rcv_timeout = 180

[general]
# regex related default is 2500
regex_cache_hiwater = 210000

[httpServer]
# allow just under 5GB of bundle to be uploaded
max_content_length = 5000000000
# these timeouts are based on older versions and may no longer be required
streamInWriteTimeout = 30
busyKeepAliveIdleTimeout = 180

Indexer settings — indexes.conf

Here are example default settings for all indexes:

[default]
tsidxWritingLevel = 4
journalCompression = zstd
repFactor = auto

For non-SmartStore setups, review how indexer clusters handle report and data model acceleration summaries. summary_replication might be useful for you to use.

On SmartStore clusters, the summaries are uploaded and this setting is not required. For details, see How is the replication of summary bucket managed in Splunk Smartstore?

SmartStore issues

Within SmartStore environments, downloads can be tuned:

[cachemanager]
max_concurrent_downloads = <unsigned integer>

You should test how different download settings (for example, going to to 12 from the default of 8) affect your system performance. Heavy SmartStore downloads can max out CPU or block ingestion queues, with effects less noticeable in newer Splunk platform versions.

You can visit the Getting smarter about Splunk SmartStore GitHub repository for dashboards. The “SmartStore Stats” dashboard in Alerts for Splunk Admins and the s2_traffic_report dashboard on GitHub are also recommended.

To track problematic searches, use the “SearchHeadLevel — SmartStore cache misses — combined” or “IndexerLevel — SmartStore cache misses — remote_searches” dashboards to detect an actively running search downloading many buckets.

Linux-specific options

The newer systemd unit files set high limits. If you not using systemd, set limits in the limits.conf file instead.

Settings to adjust within this file are:

LimitCORE=infinity
LimitDATA=infinity
LimitNICE=0
LimitFSIZE=infinity
LimitSIGPENDING=385952
LimitMEMLOCK=65536
LimitRSS=infinity
LimitMSGQUEUE=819200
LimitRTPRIO=0
LimitSTACK=infinity
LimitCPU=infinity
LimitAS=infinity
LimitLOCKS=infinity
LimitNOFILE=1024000
LimitNPROC=512000

You can also set this on all enterprise instances in sysctl:

kernel.core_pattern: "/opt/splunk/%e-%s.core"

This ensures core dumps are written to a directory the process can write to. Use /opt/splunk/var on Kubernetes instances, as the var partition is persisted to a larger disk. This allows investigation into core dumps of Splunk Enterprise.

Consider disabling transparent huge pages, as recommended in Splunk Docs.

Data parsing

The quality of data parsing directly impacts indexer performance. Articles like Improving data onboarding with props.conf configurations and Clara-fication: Data onboarding best practices cover this topic well.

Setting SHOULD_LINEMERGE to False and using an appropriate LINE_BREAKER setting can relieve pressure on the aggregation queue.

Use the “indexer_max_data_queue_sizes_by_name” dashboard in Alerts for Splunk Admins, or dashboards in the monitoring console to view queue-based performance.

Avoiding hot buckets on a single indexer

Automating the rolling of buckets can help avoid issues with data loss when instances using local NVMe disks fail and the bucket is not replicated. Consider creating roll_and_resync_buckets_v2.sh.

Next steps

The primary challenge on the indexing tier is ensuring enough indexers with reasonable hardware capacity, where hardware capacity is influenced by search and ingestion workloads.

Tuning involves testing different filesystems, setting valid limits in configurations, and adjusting queue sizes to improve performance.

If issues arise with cluster managers in terms of the “all data is searchable” state, consider creating smaller indexer clusters or relocating the cluster manager to a faster CPU.