Skip to main content

 

Splunk Lantern

Performance tuning the forwarding tier

 

This article explores various tuning options for the forwarding tier in the Splunk platform. Optimizing this tier is crucial for efficient data ingestion and overall system performance.

There are a number of settings you can tune on the forwarding tier. This article shares tips for ensuring optimal performance in each of the following areas:

This article is one of three that explore performance improvements for the indexing tier, forwarding tier, and search head tier. Check out the other articles to gain insights into optimizing each tier.

Universal or heavy forwarders for an intermediate tier?

The debate between using universal forwarders (UF) or heavy forwarders (HF) as an intermediate tier is ongoing. While a 2016 blog post suggested UFs for many purposes, discussions such as this post on Community have shown that using a UF isn't always the best choice.

Heavy forwarders might be good to use as the intermediate tier due to:

  • Lack of control over all universal forwarders, making appropriate EVENT_BREAKER settings implementation impossible.
  • Network complications preventing the majority of UFs from connecting to the indexing tier.
  • The ability to configure HFs with appropriate LINE_BREAKING settings and asynchronous forwarding to optimize data distribution over the indexing tier.

Note that all intermediate tiers are susceptible to data loss upon restart because their in-memory queues are cleared during shutdown (or lost if the platform crashes). This can occur if the downstream indexing tier is unavailable during forwarder restarts.

The useACK setting has the same limitation if the source forwarder is restarted (or crashes) before flushing its output queue to the downstream forwarder/indexer.

Splunk platform 9.4 introduces a new TcpOut persistent queue, potentially helping prevent this issue during restarts.

useACK — indexer acknowledgement

While often recommended, enabling useACK generally reduces the forwarding tier's overall performance. Disabling it might increase the risk of losing "intermediate" data but can improve throughput performance.

Asynchronous load balancing

While older versions of the Splunk platform rotate connections between indexers based on either the autoLBVolume or autoLBInterval setting, newer versions support asynchronous load balancing, an advanced forwarding option. This solution improves both indexer ingestion and search time performance. If you are using a load balancer, ensure you check the Caveats section of this Splunk Docs page, specifically the "When indexers are behind an NLB" bullet for more information.

You can combine tuning asynchronous load balancing with the maxSendQSize setting, as discussed in the Splunk Community post Slow indexer/receiver detection capability.

The “splunk_forwarder_data_balance_tuning” and “splunk_forwarder_output_tuning” dashboards in Alerts for Splunk Admins can assist in tuning outputs from forwarders.

Parallel ingestion pipelines

parallelIngestionPipelines are documented on the indexing tier in terms of index parallelization. This setting is equally useful on the forwarding tier, and the limitations around not exceeding 2–3 pipelines do not apply here.

You could go as high as 12 parallelIngestionPipelines and notice improved performance without CPU issues. This HEC tuning article advises it is safe to go as high as the number of cores on the machine.

Increasing the parallelIngestionPipelines setting can be useful in scenarios such as:

  • A universal forwarder needing to read multiple larger files which utilize the batch processor. Multiple pipelines allow one pipeline to work on batch files while others handle remaining files. Consider increasing the min_batch_size_bytes setting in limits.conf for batch files.
  • A heavy forwarder needing to receive and/or send to multiple destinations. Improvements in forwarding performance have been observed in modern versions of the Splunk platform with more than one pipeline.

HEC-specific tuning

The Splunk Community post What are the best HEC perf tuning configs? covers this topic.

The dashboard “hec_performance” in Alerts for Splunk Admins relates to HEC tokens.

useACK with HEC

useACK at the HEC tier advises that the forwarder has successfully received the data, not that the data itself is indexed.

useACK within HEC differs from the useACK setting in outputs.conf. You can reach a per-token request limit when using useACK on HEC inputs that you would not reach without the acknowledgement feature. An error status code is returned to the client using the specified token if the client fails to complete the acknowledgment process promptly.

Non-HEC forwarder tuning

In addition to asynchronous load balancing, tuning the queue sizes of both the forwarders' output queues and the queues in general can be helpful.

Making the queues too large can mask issues at the forwarding tier, resulting in long delays where data sits inside a forwarder's in-memory queues for an extended period. There is also a risk of data loss on restart when the UF or HF flushes the data (unless there is a persistent queue) or in a crash scenario.

Example queue settings:

[queue]
maxSize = 4MB

[queue=aggQueue]
maxSize = 10MB

[queue=parsingQueue]
maxSize = 10MB

[queue=indexQueue]
maxSize = 20MB

To view queue statuses, use the monitoring console in the Splunk platform or the dashboard “heavyforwarders_max_data_queue_sizes_by_name”. There is also the indexing tier equivalent, “indexer_max_data_queue_sizes_by_name”.

Persistent inputs

Inputs on forwarders are not persistent, and data loss on restart can occur if the upstream tier is unavailable. This issue might be resolved by newer persistent output features in Splunk platform 9.4.

For Windows inputs, there is a persistent input solution, as discussed in the Splunk Community post Splunk Input Persistent Queue or the Use persistent queues to help prevent data loss page in Splunk Docs.

Disabling the KV Store

The KV Store is required on search heads and on forwarders using the KV Store for checkpoint tracking; however, if the forwarder's purpose is an intermediate forwarding tier, the KV Store can be safely disabled. This is done in server.conf:

[kvstore]
disabled = true

This setting only applies to heavy forwarders, as universal forwarders do not have a KV Store.

Determining when data has stopped forwarding

There are a lot of options for finding hosts or sources that stop submitting events:

Linux-specific options

On modern Linux systems, universal forwarders might run under systemd, providing advantages such as:

  • CAP_DAC_READ_SEARCH, allowing access to the entire filesystem without manually setting permissions, as described in Splunk UF 9.0 and POSIX capabilities.
  • Sensible limits used in the new systemd unit files, such as setting the number of file descriptors to a high value.

Other settings you might choose to tweak within this file are:

LimitCORE=infinity
LimitDATA=infinity
LimitNICE=0
LimitFSIZE=infinity
LimitSIGPENDING=385952
LimitMEMLOCK=65536
LimitRSS=infinity
LimitMSGQUEUE=819200
LimitRTPRIO=0
LimitSTACK=infinity
LimitCPU=infinity
LimitAS=infinity
LimitLOCKS=infinity
LimitNOFILE=1024000
LimitNPROC=512000

Additionally, you can set this on all enterprise instances in sysctl:

kernel.core_pattern: "/opt/splunk/%e-%s.core"

This ensures that core dumps are written by the process to a directory it can write to. You can use /opt/splunk/var on the K8s instances, as the var partition is persisted to a larger disk. This allows investigation into core dumps of Splunk Enterprise.

Consider disabling transparent huge pages, as recommended in Splunk Docs.