Using the Performance Insights for Splunk app: Troubleshooting

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Troubleshooting

Symptom	Possible causes	Diagnose	Remediate
Slow search times	High concurrency: Any search load above the system capacity necessarily waits, leading to increased search times.	Look for search concurrency values above the single indexer CPU count. Both spikes and consistent values above system capacity are problems. Go to the Search concurrency section to see if this applies to your situation.	Flatten search load by identifying spikes in scheduled searches; these often occur at a regular cadence (for example, every 15 minutes, the top of the hour, or midnight). Where possible, move scheduled searches to less busy times. Use scheduled search windowing or skewing to spread search start times over larger ranges. Find more information in the blog Schedule windows vs. skewing.
	Inefficient SPL: Slow searches can lead to higher search concurrency.	Using the search performance metrics, look for searches with a large total run time (a combination of run times and frequency). If the total run time for a search is a significant percentage of the selected time range (for example, 360s over a 1 hour range, or 10%), this is a good candidate for improvement. Go to the Search metrics section to see if this applies to your situation.	To improve search SPL: Decrease time ranges. Use only necessary indexes. Filter out as much data as possible before non-streaming or transforming operations. Filter on indexed fields and use the `::` operator. Use the `TERM()` operator where possible for text searches. Use `tstats` instead of `stats` if you don’t need the raw data. Find more information in these links: Optimizing search Learn SPL command types: Efficient search execution order and how to investigate them Writing better queries in SPL Use CASE() and TERM() to match phrases
	Too many buckets: Searches that span large time ranges need to open and inspect many buckets, which can be inefficient.	Check your bucket upload/download or disk read/write rates. If network transfer rates or disk I/O are significant or cache hit rates are low, you might benefit from fewer buckets. Go to the Buckets or Resource monitoring sections to see if this applies to your situation.	Increase the `maxDataSize` setting in `indexes.conf` to a MB value larger than your typical warm bucket size. This is an index-specific setting. Find more information in Splunk Help.
	Not enough buckets: Large bucket writes can be slow, and a low number of files means less parallelization and higher latency on those writes.	Check your bucket upload/download or disk read/write rates. If network transfer rates or disk I/O are not significant and cache hit rates are high, you might benefit from more buckets. Go to the Buckets section to see if this applies to your situation.	Decrease the `maxDataSize` setting in `indexes.conf` to a MB value smaller than your typical warm bucket size. This is an index-specific setting. Find more information in Splunk Help.
	Connectivity issues: Unstable network I/O from saturation or latency issues can lead to excessive retries.	Scan errors and warnings looking for connectivity issues or communication retries. Go to the Environment diagnose section to see if this applies to your situation.	For any error or warning that looks like a network timeout with a retry, determine the cause. Check your network throughput and latency, and ensure you have adequate bandwidth. Check packet routing between systems to ensure data is not flowing through unexpected paths or being held up by unnecessary network rules.
	Undersized deployment: When search concurrency is not exceeding system capacity, but there is high CPU usage on indexers or search heads, the deployment might be too small.	Check for CPU usage over time and compare with search load. When not under search pressure, CPU usage should be low. At peak search load, CPU usage should be under 80%. Go to the Search resource usage, Search concurrency, and Search metrics sections to see if this applies to your situation.	The best practice for sizing a deployment is to set up a test environment using a representative subset (for example, 1/10th) of your ingest data. Then, size your search heads and indexers to run optimally at peak search load (no more than 80% CPU, and all other resources stable). Your optimal deployment will use the exact same search head configuration, but with indexers scaled linearly to handle the full ingest data set (for example, 10x larger ingest deployment). If a test environment is not an option, add resources as needed. Note that searches put load on both indexers and search heads, while ingest generally only puts load on indexers. Also, if your concurrent search counts exceed your single indexer core count, you will likely benefit from fewer indexers with a larger CPU count. For example, four 32-core indexers will outperform eight 16-core indexers.
Skipped searches	“Maximum number of a class of search types (for example, historical) reached”: This indicates that your system is trying to process too many searches at the same time.	Look for searches that are scheduled for the same time. Look for searches that are scheduled to run shortly after searches with long durations. Look for scheduled searches with long durations. Go to the Search skip details and Slow search times sections to see if this applies to your situation.	Reducing search runtimes will help reduce search concurrency. Ensure long-running searches are optimized. Spread scheduled searches out to use less busy times. Do not increase search concurrency limits. While this might be tempting to make the problem disappear, the effects are temporary and will likely worsen the situation. Find more information in these links: Reducing skipped searches Schedule windows vs. skewing Learn SPL command types: Efficient search execution order and how to investigate them Optimizing search Use CASE() and TERM() to match phrases
	“Maximum number of this search reached”: This usually means a scheduled search ran longer than its scheduled interval.	Look for scheduled searches where runtimes come close to, or exceed, the interval between runs. Go to the Search skip details and Slow search times sections to see if this applies to your situation.
Unexpected search head restarts or unavailability	Out of memory: The most frequent cause of search head restarts is memory pressure. While out-of-the-box system default limits are configured to minimize memory issues, poorly designed searches run under service accounts can exhaust memory quickly.	Look for searches that require a lot of search head memory. Go to the Search resource usage section to see if this applies to your situation.	Reduce result sets in searches to only the necessary data. Use summary indexing or `tstats`. Front-load streaming commands to put more load on the indexers. Find more information in these links: Learn SPL command types: Efficient search execution order and how to investigate them Optimizing search
	CPU: If server load causes CPU saturation, servers might fail to respond to health checks in a timely manner. The search head cluster captain will stop sending jobs to that search head, which could overload the remaining search heads.	Look for periods where fewer than the expected number of search heads are running. Look for a cadence of spikes in search load. Look for long-running ad hoc searches. Go to the Search resource usage and Running search heads sections to see if this applies to your situation.	Smooth out search spikes by moving scheduled searches to less busy times. Set user’s disk and search quota limits to avoid excess CPU load. Find more information in these links: Schedule reports Use cron expressions for alert scheduling
Ad hoc search errors	Timeouts: For most SPL searches, there are many inefficient ways to perform a search for every efficient way. Inefficiencies lead to long-running searches, which might hit timeouts.	Look for excessive concurrent searches leading to long search queues. Look for logs indicating that searches failed to complete. Go to the Search metrics and Environment diagnose sections to see if this applies to your situation.	Tips for improving search SPL: Decrease time ranges. Use only necessary indexes. Filter out as much data as possible before non-streaming or transforming operations. Filter on indexed fields and use the `::` operator. Use the `TERM()` operator where possible for text searches. Use `tstats` instead of `stats` if you don’t need the raw data. Find more information in these links: Learn SPL command types: Efficient search execution order and how to investigate them Optimizing search Use CASE() and TERM() to match phrases
	Incomplete results: Incomplete results are often due to a part of the search (for example, subsearch) failing, or an unexpected restart of an underlying server. Note: Incomplete results can also occur when join or subsearch limits are reached. Increasing these limits can have negative performance effects.	Look for exhausted system resources. Look for unstable systems and server restarts. Check logs for errors indicating data truncation. Go to the Resource monitoring, Environment diagnose, and Search metrics sections to see if this applies to your situation.	Reduce result sets in searches to only the necessary data. Use summary indexing or `tstats`. Front-load streaming commands to put more load on the indexers. Ensure your deployment is sized correctly. The cause of server instability might exist outside of your Splunk platform configuration. If you suspect this, involve your site reliability engineers. Find more information in these links: Learn SPL command types: Efficient search execution order and how to investigate them Optimizing search
Unexpected search results	Bundle replication lag: Large bundles (large lookups with frequent changes) and network saturation can lead to slow or missed bundle replications. This can, in turn, lead to the use of stale data from lookups.	Look for bundle replication times consistently higher than 100ms. Look for bundle replication delta sizes consistently over 10KB. Go to the Bundle replication section to see if this applies to your situation.	Ensure changes to large lookup files are infrequent. Use a larger number of smaller lookups rather than monolithic all-in-one lookups. Ensure network traffic bandwidth is adequate. Find more information at Splunk Docs.
	Configured search limits reached: Out-of-the-box system default limits are configured to reduce resource saturation. These limits can cause subsearches to return fewer than expected results.	Look for errors and warnings returned to the end user in the search UI.	Ensure subsearches return a reasonably small amount of data. Where necessary, increase subsearch limits. Saved searches, subsearches, joins, and commands like `top` might all have their own limit settings.
Uneven search head server load	Captain limits: If one search head differs from the rest, this might be the captain, and that is expected.	Look for a single search head that is processing fewer searches. If only one shows lower search counts, and the search run times are not drastically different for that server, no further action is likely needed. You can confirm the search head captain using this search: `\| rest /services/shcluster/status splunk_server=local \| fields captain.label` Go to the search counts section to see if this applies to your situation.	No action.
Uneven server load (any server)	Unhealthy server: A server stuck in a restart loop or suffering heavy CPU usage will likely see lower search counts and higher average search runtimes than other servers.	Look for servers (other than the search head captain) that are using far more or far fewer resources (CPU, memory, network, disk) than others. Check logs for any server reporting errors, or servers that have no error logs at all. Go to the Resource monitoring and Environment diagnose sections to see if this applies to your situation.	Identify causes of the problematic server behavior, either through error logs, external server diagnostic tools, or by checking server configurations. When the issue has been corrected, reduce load temporarily, then restart the problematic server. Reducing load is important so that restarting servers does not cause excess load on the remaining servers, potentially putting them in a bad state. Find more information in Splunk Help.
	Load balancer configuration: If the load balancer has incorrect information about the cluster servers or their statuses, their load might be routed incorrectly.	Look for servers (other than the search head captain) that are using far more or far fewer resources (CPU, memory, network, disk) than others. Ensure all servers are healthy with no obvious error. Go to the Resource monitoring and Environment diagnose sections to see if this applies to your situation.	If you find search heads that are suspect, you can confirm the cluster status using the CLI, or restart the search heads through the UI. This often corrects any search head issues. If you find indexers that are suspect, restarting them is usually the best option. Find more information at Splunk Help. Be sure to reduce load on your system before restarting servers.
	Load balancer bypassing: If forwarders or inter-server communication has been configured with IP addresses or hard-coded machine names, machine load might be unbalanced.	Look for servers (other than the search head captain) that are using far more or far fewer resources (CPU, memory, network, disk) than others. Ensure all servers are healthy with no obvious error. Go to the Resource monitoring and Environment diagnose sections to see if this applies to your situation.	Scan your Splunk platform configuration files looking for hard-coded IP addresses or server names. Ensure those addresses and names refer to load balancing systems or have a valid reason to be hard-coded. Find more information in Splunk Help.
System hangs	Faulty subsystem on an indexer, leading to slow or hanging search results. This can lead to a number of searches getting stuck over time, and can halt the system.	Look for sudden increases in search concurrency and deployment-averaged indexer memory, coupled with a decrease in deployment-averaged search head and indexer CPU activity. A single indexer with poor storage or network performance can degrade the entire cluster if there are no other health issues. Go to the Search concurrency and Resource monitoring sections to see if this applies to your situation.	Identify any indexer that is performing sub-optimally in terms of I/O. It is usually the first one to show a drop in CPU usage before the system performance degrades entirely. Stop the affected indexer, correct the I/O issues, and restart the indexer.