Skip to main content

 

Splunk Lantern

Using the Performance Insights for Splunk app

 

This guide explains how to use the app Performance Insights for Splunk (PerfInsights) to diagnose and identify potential performance issues in Splunk platform deployments and premium applications. By diagnosing performance issues in your own environment you can reduce time to resolution, improve user satisfaction, lower reliance on Splunk Support, and address potential problems before they become critical.

PerfInsights can be installed on both Splunk Cloud Platform and Splunk Enterprise deployments. It uses internal indexes to gather and calculate key metrics, providing a good overview of system stability and performance. It does not collect new data; instead, it uses data already gathered by the Splunk platform. This ensures there is no performance overhead when installing the tool in your environment.

PerfInsights uses additional resources during active investigations because its dashboards generate extra searches. Resource usage varies with the investigation's length and the time span of indexed data. To minimize this overhead, the dashboards are designed for efficiency, utilizing indexed field equality operators and search result reuse.

PerfInsights is not an active health monitoring app. It will not generate notifications or make assumptions about your deployment's health. The metrics it provides are open to interpretation because a potential problem in one environment could be perfectly normal in another.

How to use this guide

This guide is split into a number of different sections that you can jump to according to your needs:

  • Installation and diagnostic process: If you're new to PerfInsights, start with the Installation and Diagnostic process sections.
  • Troubleshooting: If you have a specific symptom of performance degradation, (for example, slow search times), refer to the Troubleshooting table for common causes and immediate remediation steps.
  • Diagnoses: For a comprehensive understanding of performance metrics or proactive monitoring, explore the Diagnoses section. It explains what each metric indicates and what patterns to look for.

Installation

PerfInsights is available on Splunkbase. To install PerfInsights, access your Splunk platform UI as an administrator and select Apps > Find More Apps. Select Generic, Utilities and search for “Performance Insights”. Locate the application and install it.

After installation, you will need to wait for the application to be replicated to the search heads before it is fully functional. You might see search errors if you try to use it too soon. If this happens, wait a few minutes and try again.

Diagnostic process

PerfInsights is typically used with either a reactive approach or a proactive approach to diagnose performance issues.

  • Reactive diagnosis occurs when a problem has already appeared, and the goal is to discover its cause. In these cases, the system's actual behavior, from a user’s perspective, has deviated from expected behavior. An example might be searches reporting warnings that some data is missing.
  • Proactive diagnosis involves using PerfInsights to examine system metrics and identify trends that could lead to future issues. For example, increasing search times could lead to increased search concurrency and, eventually, skipped searches.

Reactive diagnosis

To conduct a reactive diagnosis:

  1. Focus the tool's dashboards around the point in time when the issue first appeared.
  2. Observe what was happening in the system at that time to see what might have caused the unexpected behavior. Look for:
    • High or saturated resources (for example, CPU, memory, network I/O, disk I/O).
    • High or saturated queues (for example, search or indexing queues, replication queues).
    • Plateaus (for example, search or indexing throughput).
    • Sudden spikes (for example, search concurrency, error rates).
    • Anything that looks unusual.
  3. Separate cause and effect. Adjust the time range to determine the order of abnormal events.
  4. Investigate abnormal events starting from the earliest.
  5. For each abnormal event, determine if it is a contributing cause, a symptom, or an unrelated event. This often requires additional searches or tools beyond PerfInsights.

Proactive diagnosis

To conduct a proactive diagnosis:

  1. Start with a large time range, ending at the current time (for example, past 7 days to now).
  2. Look for charts that are trending upwards (for example, CPU, memory, search times, queue lengths).
  3. Look for spikes in charts that last longer or become more frequent (for example, search concurrency, bundle replication).
  4. Repeat with smaller time ranges (for example, past 1 day or 1 hour). While long time ranges can show trends, they can also hide granular data. Spikes that might be aggregated away over a week might become visible over an hour.
  5. For each potentially problematic behavior observed, try to understand the cause. For example, if search times are increasing, you might also find that your data ingestion rate has been increasing. Whether this becomes an issue depends on the system's current state and any further increases in data ingestion rates.
  • Proactive diagnosis is subject to the observer effect, where the act of observing the system affects its behavior.
  • Be cautious when observing unstable systems, as the extra search load could further destabilize the system.
  • When reading PerfInsights charts and tables, consider the additional search load the tool is adding.

Deeper dives

PerfInsights aims to be a single application that helps identify all potential performance issues and provides as much detail as possible to resolve them. However, in some cases, PerfInsights might not provide sufficient information to complete the performance analysis. When this occurs, use PerfInsights as a starting point for further investigation.

Expand existing charts

Sometimes existing charts are close to what you need but require some adjustments. Using the magnifying glass link on a chart opens the search in a new window, allowing for customization. All dashboards are editable, so if you find yourself making these customizations frequently, you can modify the search directly in the dashboard or add a new chart. Be aware that reinstalling the application will overwrite your changes. If your change could benefit others, consider sending a message to the support team with the suggestion.

Add system data to indexes

By default, not all OS-level logging is available in indexes. This information can be very useful when investigating performance issues. On Splunk Cloud Platform deployments, consider adding local forwarders on your servers configured to read system logs and index them.

Forming and testing hypotheses

Performance issues are often complex and multifaceted. Determining root causes can be tricky, and even sound and logical reasoning can lead to incorrect conclusions if not all data is considered. Resolving issues can be an iterative process.

After you’ve observed the situation and developed a reasonable explanation for the behavior, it’s time to test that hypothesis. Even if you started with a reactive diagnosis, you will iterate through proactive diagnoses here.

For each iteration:

  • Start by identifying a minimal set of changes that could correct the issue.
  • Change only one thing at a time, if possible.
  • Using the same charts that helped you build your hypothesis, verify a positive change over similar time periods that exposed the issue.

Troubleshooting

Symptom Possible causes Diagnose Remediate
Slow search times

High concurrency: Any search load above the system capacity necessarily waits, leading to increased search times.

Look for search concurrency values above the single indexer CPU count. Both spikes and consistent values above system capacity are problems. Go to the Search concurrency section to see if this applies to your situation. Flatten search load by identifying spikes in scheduled searches; these often occur at a regular cadence (for example, every 15 minutes, the top of the hour, or midnight). Where possible, move scheduled searches to less busy times. Use scheduled search windowing or skewing to spread search start times over larger ranges. Find more information in the blog Schedule windows vs. skewing.
  Inefficient SPL: Slow searches can lead to higher search concurrency. Using the search performance metrics, look for searches with a large total run time (a combination of run times and frequency). If the total run time for a search is a significant percentage of the selected time range (for example, 360s over a 1 hour range, or 10%), this is a good candidate for improvement. Go to the Search metrics section to see if this applies to your situation.

To improve search SPL:

  • Decrease time ranges.
  • Use only necessary indexes.
  • Filter out as much data as possible before non-streaming or transforming operations.
  • Filter on indexed fields and use the :: operator.
  • Use the TERM() operator where possible for text searches.
  • Use tstats instead of stats if you don’t need the raw data.

Find more information in these links:

  Too many buckets: Searches that span large time ranges need to open and inspect many buckets, which can be inefficient. Check your bucket upload/download or disk read/write rates. If network transfer rates or disk I/O are significant or cache hit rates are low, you might benefit from fewer buckets. Go to the Buckets or Resource monitoring sections to see if this applies to your situation. Increase the maxDataSize setting in indexes.conf to a MB value larger than your typical warm bucket size. This is an index-specific setting. Find more information in Splunk Help.
  Not enough buckets: Large bucket writes can be slow, and a low number of files means less parallelization and higher latency on those writes. Check your bucket upload/download or disk read/write rates. If network transfer rates or disk I/O are not significant and cache hit rates are high, you might benefit from more buckets. Go to the Buckets section to see if this applies to your situation. Decrease the maxDataSize setting in indexes.conf to a MB value smaller than your typical warm bucket size. This is an index-specific setting. Find more information in Splunk Help.
  Connectivity issues: Unstable network I/O from saturation or latency issues can lead to excessive retries. Scan errors and warnings looking for connectivity issues or communication retries. Go to the Environment diagnose section to see if this applies to your situation. For any error or warning that looks like a network timeout with a retry, determine the cause. Check your network throughput and latency, and ensure you have adequate bandwidth. Check packet routing between systems to ensure data is not flowing through unexpected paths or being held up by unnecessary network rules.
  Undersized deployment: When search concurrency is not exceeding system capacity, but there is high CPU usage on indexers or search heads, the deployment might be too small.

Check for CPU usage over time and compare with search load.

  • When not under search pressure, CPU usage should be low.
  • At peak search load, CPU usage should be under 80%.

Go to the Search resource usage, Search concurrency, and Search metrics sections to see if this applies to your situation.

The best practice for sizing a deployment is to set up a test environment using a representative subset (for example, 1/10th) of your ingest data. Then, size your search heads and indexers to run optimally at peak search load (no more than 80% CPU, and all other resources stable).

Your optimal deployment will use the exact same search head configuration, but with indexers scaled linearly to handle the full ingest data set (for example, 10x larger ingest deployment). If a test environment is not an option, add resources as needed.

Note that searches put load on both indexers and search heads, while ingest generally only puts load on indexers. Also, if your concurrent search counts exceed your single indexer core count, you will likely benefit from fewer indexers with a larger CPU count. For example, four 32-core indexers will outperform eight 16-core indexers.

Skipped searches “Maximum number of a class of search types (for example, historical) reached”: This indicates that your system is trying to process too many searches at the same time. Look for searches that are scheduled for the same time. Look for searches that are scheduled to run shortly after searches with long durations. Look for scheduled searches with long durations. Go to the Search skip details and Slow search times sections to see if this applies to your situation.

Reducing search runtimes will help reduce search concurrency. Ensure long-running searches are optimized. Spread scheduled searches out to use less busy times.

Do not increase search concurrency limits. While this might be tempting to make the problem disappear, the effects are temporary and will likely worsen the situation.

Find more information in these links:

  “Maximum number of this search reached”: This usually means a scheduled search ran longer than its scheduled interval. Look for scheduled searches where runtimes come close to, or exceed, the interval between runs. Go to the Search skip details and Slow search times sections to see if this applies to your situation.
Unexpected search head restarts or unavailability Out of memory: The most frequent cause of search head restarts is memory pressure. While out-of-the-box system default limits are configured to minimize memory issues, poorly designed searches run under service accounts can exhaust memory quickly. Look for searches that require a lot of search head memory. Go to the Search resource usage section to see if this applies to your situation.

Reduce result sets in searches to only the necessary data. Use summary indexing or tstats. Front-load streaming commands to put more load on the indexers.

Find more information in these links:

  CPU: If server load causes CPU saturation, servers might fail to respond to health checks in a timely manner. The search head cluster captain will stop sending jobs to that search head, which could overload the remaining search heads. Look for periods where fewer than the expected number of search heads are running. Look for a cadence of spikes in search load. Look for long-running ad hoc searches. Go to the Search resource usage and Running search heads sections to see if this applies to your situation.

Smooth out search spikes by moving scheduled searches to less busy times. Set user’s disk and search quota limits to avoid excess CPU load.

Find more information in these links:

Ad hoc search errors Timeouts: For most SPL searches, there are many inefficient ways to perform a search for every efficient way. Inefficiencies lead to long-running searches, which might hit timeouts. Look for excessive concurrent searches leading to long search queues. Look for logs indicating that searches failed to complete. Go to the Search metrics and Environment diagnose sections to see if this applies to your situation.

Tips for improving search SPL: Decrease time ranges. Use only necessary indexes. Filter out as much data as possible before non-streaming or transforming operations. Filter on indexed fields and use the :: operator. Use the TERM() operator where possible for text searches. Use tstats instead of stats if you don’t need the raw data.

Find more information in these links:

  Incomplete results: Incomplete results are often due to a part of the search (for example, subsearch) failing, or an unexpected restart of an underlying server. Note: Incomplete results can also occur when join or subsearch limits are reached. Increasing these limits can have negative performance effects. Look for exhausted system resources. Look for unstable systems and server restarts. Check logs for errors indicating data truncation. Go to the Resource monitoring, Environment diagnose, and Search metrics sections to see if this applies to your situation.

Reduce result sets in searches to only the necessary data. Use summary indexing or tstats. Front-load streaming commands to put more load on the indexers. Ensure your deployment is sized correctly.

The cause of server instability might exist outside of your Splunk platform configuration. If you suspect this, involve your site reliability engineers.

Find more information in these links:

Unexpected search results Bundle replication lag: Large bundles (large lookups with frequent changes) and network saturation can lead to slow or missed bundle replications. This can, in turn, lead to the use of stale data from lookups. Look for bundle replication times consistently higher than 100ms. Look for bundle replication delta sizes consistently over 10KB. Go to the Bundle replication section to see if this applies to your situation. Ensure changes to large lookup files are infrequent. Use a larger number of smaller lookups rather than monolithic all-in-one lookups. Ensure network traffic bandwidth is adequate. Find more information at Splunk Docs.
  Configured search limits reached: Out-of-the-box system default limits are configured to reduce resource saturation. These limits can cause subsearches to return fewer than expected results. Look for errors and warnings returned to the end user in the search UI.

Ensure subsearches return a reasonably small amount of data. Where necessary, increase subsearch limits.

Saved searches, subsearches, joins, and commands like top might all have their own limit settings.

Uneven search head server load Captain limits: If one search head differs from the rest, this might be the captain, and that is expected. Look for a single search head that is processing fewer searches. If only one shows lower search counts, and the search run times are not drastically different for that server, no further action is likely needed. You can confirm the search head captain using this search: | rest /services/shcluster/status splunk_server=local | fields captain.label Go to the search counts section to see if this applies to your situation. No action.
Uneven server load (any server) Unhealthy server: A server stuck in a restart loop or suffering heavy CPU usage will likely see lower search counts and higher average search runtimes than other servers. Look for servers (other than the search head captain) that are using far more or far fewer resources (CPU, memory, network, disk) than others. Check logs for any server reporting errors, or servers that have no error logs at all. Go to the Resource monitoring and Environment diagnose sections to see if this applies to your situation. Identify causes of the problematic server behavior, either through error logs, external server diagnostic tools, or by checking server configurations. When the issue has been corrected, reduce load temporarily, then restart the problematic server. Reducing load is important so that restarting servers does not cause excess load on the remaining servers, potentially putting them in a bad state. Find more information in Splunk Help.
  Load balancer configuration: If the load balancer has incorrect information about the cluster servers or their statuses, their load might be routed incorrectly. Look for servers (other than the search head captain) that are using far more or far fewer resources (CPU, memory, network, disk) than others. Ensure all servers are healthy with no obvious error. Go to the Resource monitoring and Environment diagnose sections to see if this applies to your situation. If you find search heads that are suspect, you can confirm the cluster status using the CLI, or restart the search heads through the UI. This often corrects any search head issues. If you find indexers that are suspect, restarting them is usually the best option. Find more information at Splunk Help. Be sure to reduce load on your system before restarting servers.
  Load balancer bypassing: If forwarders or inter-server communication has been configured with IP addresses or hard-coded machine names, machine load might be unbalanced. Look for servers (other than the search head captain) that are using far more or far fewer resources (CPU, memory, network, disk) than others. Ensure all servers are healthy with no obvious error. Go to the Resource monitoring and Environment diagnose sections to see if this applies to your situation. Scan your Splunk platform configuration files looking for hard-coded IP addresses or server names. Ensure those addresses and names refer to load balancing systems or have a valid reason to be hard-coded. Find more information in Splunk Help.
System hangs Faulty subsystem on an indexer, leading to slow or hanging search results. This can lead to a number of searches getting stuck over time, and can halt the system. Look for sudden increases in search concurrency and deployment-averaged indexer memory, coupled with a decrease in deployment-averaged search head and indexer CPU activity. A single indexer with poor storage or network performance can degrade the entire cluster if there are no other health issues. Go to the Search concurrency and Resource monitoring sections to see if this applies to your situation. Identify any indexer that is performing sub-optimally in terms of I/O. It is usually the first one to show a drop in CPU usage before the system performance degrades entirely. Stop the affected indexer, correct the I/O issues, and restart the indexer.

Diagnoses

This section provides detailed explanations of the information available within PerfInsights app that you can use to pinpoint and analyze performance issues.

Performance Trend

The Performance Trend section is a good starting point to get a sense of system activity, though it is often too high-level to determine the root cause of any issue.

Event Ingestion Volume

By default, this chart shows ingestion volume over the past week, grouped by day. Knowing when and how much data flows into your system allows for proper sizing and tuning to handle peak load. You should investigate spikes or troughs in this chart.

Spikes: When looking at a long time range, aggregated over a small number of groupings, look for groups that deviate. For example, if one day is significantly larger than others in a week's data, narrow in on that day. Spikes can cause performance problems if resources are already running close to capacity. 

An example of a spike in event ingestion volume per day is shown below. An increase in ingestion rate will likely cause performance changes elsewhere in the system.

unnamed - 2025-09-04T113322.847.png

If you see spikes, first ensure the data is expected. If it is expected and you are experiencing issues starting at that point, determine if the incoming data can be reduced or broken into smaller chunks spread over longer time periods.

Troughs: Similar to spikes, troughs can indicate an unexpected change in data flow. Troughs are unlikely to cause performance issues but should be investigated to ensure data is not missing.

Average Search Runtimes

By default, these charts show average search times over the past week, grouped by day. Search load will likely have periods where load and type are somewhat consistent, and in those periods, average search times should also be consistent.

You should investigate any gradual increases in search runtimes. An example of an increase in scheduled search runtime is shown in the image below. Over long periods, the charts should remain flat, so upward trends indicate a potential problem. 

unnamed - 2025-09-04T113827.563.png

When search runtimes increase over time, use the Search metrics section below to determine which searches are taking longer, and act accordingly. If all searches are taking longer, look for increases in search concurrency (detailed in the Search metrics section) or exhaustion of resources (detailed in the Resource monitoring section).

Given consistent search load and ingest volume, search runtimes should also be consistent. While some variability is expected, any noticeable increase in runtimes over time is a concern. Left unchecked, searches will eventually time out or exhaust resources or queues.

Search Execution Counts

Search execution counts indicate user activity or how your scheduled searches are distributed. Scheduled searches will likely be nearly equal from day to day, so variance is caused by user-driven ad hoc searches.

You can use this chart to look back over days, weeks, or even years to get a sense of when your system experiences the most user load. Peak search load can be driven by different factors, such as Monday morning scheduled reports, month-end accounting, or seasonal activity cycles. Identify your patterns and ensure you size and tune for those peak times.

Average CPU Usage

The performance trend for average CPU usage tells you how busy your deployment is overall. Looking at long time ranges here can show you how much room you have for growth.

You can use this chart to focus on times with the highest CPU usage and ensure your system does not often exceed 80% CPU on either indexers or search heads. You should ensure any periods where your system is near or at 100% are short-lived (seconds or less). Running above 80% puts your system close to resource exhaustion and can quickly lead to failures. Running at 100% means your system is likely queuing requests, which can lead to even longer times spent at high CPU usage, resulting in skipped searches or server restarts. If you see high CPU usage, use the Resource monitoring and Search metrics sections to identify which servers or searches are consuming the memory.

Average Memory Usage

The performance trend for average memory usage tells you how data-intensive your searches are. As your system ingests more data, the potential to use more memory increases.

A cold system always consumes memory rapidly as it fills caches. After several days running in a steady state, look for increases in memory. If memory continues to increase, use the search metrics section to determine which searches are consuming more memory (or taking longer to run). A common cause is when searches use the “all time” time range; as more data is ingested, these searches can scan over and return larger results.

System and Environment Data

System Environment

The System Environment tables list the versions of the installed Splunk platform software and add-ons. You can use this information to search Splunk Support or Splunk Community for any issues with the versions you are running, and upgrade them if necessary.

Data Inputs

Data inputs provide a view into where data originates and where it is stored. This can help you find any discrepancies in data routing. Often, data from a particular source type will be directed to a specific index. You might notice that the distribution of events into different indexes is not what you expect. Large indexes can negatively impact performance and lead to incorrect search results.

Indexing

The indexing section shows activity in the various indexing queues. Ideally, these queues are relatively small and similar in size. If any of the four queues (parsing, aggregation, typing, or indexing) is much larger than the rest, it can suggest a bottleneck in that queue, which can reduce ingest capacity.

  • If the parsing queue is large, consider ingesting structured data types when possible. Also look for complex data transforms, especially those involving regex.
  • If the aggregation queue is large, look for issues in the data that could lead to poor timestamp extraction.
  • If the typing queue is large, consider simplifying regex replacement logic or annotations.
  • If the indexing queue is large, look for bottlenecks at the destination (local or remote disk speed, or network transfer speeds). Increasing the number of hot buckets might also help when using storage with higher latency.

The final possibility is that the deployment is undersized, and the queues are growing because the system is too small to handle the demand. If all attempts to reduce queues don’t help, consider adding indexers.

Buckets

This section shows the bucket counts and sizes in your system and can inform you about timestamp issues leading to poor search performance.

A large number of small buckets can indicate an issue. The fewer files that need to be opened for each search, the better. If warm bucket sizes are significantly below the bucket size settings in your configuration, this suggests an issue parsing timestamps or that older events are being ingested along with newer ones in the same time span. Ensure timestamps are correct in your indexed data, and ingest event data with homogeneous time spans.

When disk performance is an issue, large buckets can also be a potential performance problem. In some cases, storage performs better with more frequent writes of smaller amounts of data. If this is the case in your environment, consider lowering the maximum bucket size to roll hot buckets to warm sooner. The SplunkOptimize process will consolidate these buckets to restore the benefits of fewer buckets.

You should optimize your bucket size for the typical time ranges over which you search.

Learn more about buckets in Splunk Help.

Search Metrics: Metrics overview

Search Concurrency

This is arguably the most important metric to monitor for a stable system. Search concurrency is a search head (or search head cluster) statistic that shows how many searches are being processed or waiting to be processed. A spike in search concurrency above the system’s capacity will quickly degrade performance and put the system in a state that can take a long time to recover from. Your system can process as many concurrent searches as CPU cores on a single indexer (if non-uniform, the smallest one); everything beyond that gets queued. When search queuing starts, the recovery time back to stable depends on how many searches were queued during the spike and what the stable incoming search rate is. Think of it like going to grab a coffee just as a busload of tourists arrives in front of you: in just a few moments, a long line has formed, and it will take time for that line to shrink again.

During times of high concurrency, system resources become overburdened, and search failures become increasingly common. Other processes also take longer, leading to unexpected behaviors in other parts of the system (for example, missing health check data, slower bundle replication).

You should look for the following indicators:

  • Any spike in concurrency above the CPU core count of a single indexer will lead to search queuing. Due to the method used to gather search concurrency metrics and the very fast nature of some searches, a small amount of search queuing can be tolerated. The magnitude and duration of the spike, and the period between spikes, can amplify negative effects. It only takes a small amount of extra load to increase a spike from a few seconds to several minutes.
  • When using many saved searches or apps like Splunk Enterprise Security, it is common to see spikes in searches every 15 minutes, with the largest at the top of the hour. Spikes could also happen daily or even yearly, so inspect an appropriate time range. To reduce these spikes, review your saved searches (including data model acceleration (DMA) and correlation searches) and, where possible, change their CRON schedule so they start at a less busy time. You can also use schedule skewing to let the Splunk platform spread the searches out over a defined time range.

The red line in the chart below represents the number of CPUs on a single indexer: the maximum safe concurrency limit. unnamed - 2025-09-04T123616.748.png

As concurrent search counts begin to exceed the indexer CPU count, CPUs are at capacity, as shown in the chart below.unnamed - 2025-09-04T123644.770.png

An increase of 16 searches per hour (less than 1%) can be the difference between no skipped searches and several, as shown in the chart below.

unnamed - 2025-09-04T123829.648.png

The key to a stable Splunk platform system is sizing the deployment for your maximum search concurrency. Ideally, search load can be smoothed to almost flat, and the average search concurrency and 90th percentile search concurrency will be close to each other.

You can add an overlay to your search concurrency chart in PerfInsights showing your single indexer CPU count. Keep your search concurrency below that line for best performance.

Search Runtimes

Search runtimes provide an overview of search performance. Generally, you want search run times to be as small as possible. Long ad hoc searches are what users will likely notice most. Ask yourself how long you’d be willing to wait for a search to return (for example, maximum 60s), and try to keep your searches under that value. This might mean reducing the time period over which your searches run, improving the search SPL using more efficient search commands, or using base searches or job loading in dashboards.

It is normal for DMA and correlation searches to run longer, but they should still be less than one minute on average. Long-running searches risk running into the time allotted for the next search of their kind, leading to skipped searches. Disable any unnecessary and unused DMAs and correlation searches.

Watch out for searches that are getting longer over time. At first glance, the runtime of a search over time might appear flat. Adjust the timeline and chart scale to confirm this. Searches that increase in runtime over time are likely scanning over all time. This can eventually lead to exhausted resources or search timeouts.

Over one hour, with the chart scaling to 100 seconds, all searches look stable, as shown in the chart below.

unnamed - 2025-09-04T124045.326.png

Over one day, with the chart limited to 40 seconds, one search shows a potential problem, as shown in the chart below.

unnamed - 2025-09-04T124049.046.png

Search Counts

Search counts provide a quick view of the trend in the currently selected time span compared with the previous span. This can tell you if recent changes to search load have taken effect. You should look for unexpected increases or decreases in search load from the previous period.

Skipped and Failed Searches

Skipped and failed search charts provide a view into system stability. While there are several reasons for failures, both can indicate performance issues.

You should check to see if any skipped or failed searches correlate with excess search concurrency. If they do, address that problem first; you might find this problem goes away. If search concurrency is not an issue, check search runtimes and failure causes for scheduled searches, especially for DMA and correlation searches. It could be that searches are skipping because they are taking too long and overlapping the next scheduled run time.

Search Counts and Runtimes per Search Head

Breaking out search metrics by search head allows you to see if any search head is experiencing a problem. Depending on your configuration, in clustered search head environments, you might see one search head (the captain) with lower search volume than the others. This is normal. In any other case, a search head expected to be a peer of the others should be processing nearly the same amount of load.

Ideally, all things being equal and ignoring the search head cluster captain, search load should be evenly distributed across all search heads.

It is normal for the search captain to differ in load, but the other search heads should be roughly equal.

If a search head is processing a very different amount of load than its peers, as shown in the chart below, configuration settings might be controlling that. Compare the server.conf values for the search heads to ensure you have the correct rules.

unnamed - 2025-09-04T125058.883.png

Search Metrics

These metrics help you identify slow and inefficient searches and can provide clues on how to improve them. Search efficiency is a topic too extensive for this document, but when you find searches that are taking a long time or using a lot of system resources, use other Splunk platform resources to improve them. Find more information in Learn SPL command types: Efficient search execution order and how to investigate them and Optimizing search.

Search Runtime Trend (DMA, correlation, saved, and ad hoc searches)

This chart shows the top 10 most expensive searches (in terms of time) and how their runtimes vary over time. These are the searches that will have the most impact when you aim to improve system efficiency.

You should look for any search times that are increasing over time. This could indicate that resources are becoming exhausted, search concurrency is too high, or searches are querying over all time.

A small percentage increase in search counts, at the wrong time, can lead to a large increase in runtimes. In the chart shown below, a 1% increase in search load led to a 15% increase in runtimes.

unnamed - 2025-09-04T125640.539.png

Search Performance

The search performance table allows you to compare search time averages and extremes. The table also shows scan counts and results counts.

The closer the runtime extremes are to the averages, the better. Large gaps can indicate potential performance problems. Look for times when searches were running longer and compare them with other system behaviors at that time. High scan counts, especially with low event counts, can indicate inefficiencies in your searches. When you see searches with high scan counts, try to lower them by adding more filters (particularly indexed fields) and tightening time spans where possible.

If your event or result counts are unexpectedly low, you might be hitting configuration limits. Limits exist for saved searches, subsearches, and commands like join or top. To configure limits, edit the advanced search properties or create local *.conf files to override defaults.

Find more information in the Admin Manual on limits.conf and savedsearches.conf.

Search Resource Usage (DMA, correlation, saved, and ad hoc searches)

Search CPU and memory usage charts show which searches are consuming the most resources. Since all searches compete for the same resources, anything you can do to reduce resource usage positively affects the system as a whole.

For the most expensive searches, in terms of CPU and memory, look for efficiencies. For example, ensure indexed fields use the index equality operator (::), or avoid inefficient commands like join.

Searches consuming a lot of resources, like those shown in the table below, are good candidates for optimization.

unnamed - 2025-09-04T125733.611.png

Find more information in Splunk Help on Using summary indexing and Troubleshooting high memory usage.

Search Skip/Fail Rate/Details (DMA, correlation, and saved searches)

Skipped searches are never a good sign. Ideally, the skipped search rate should be zero. For any search type with a non-zero skip/fail rate, check the details to see the reason for the failure.

Too much search concurrency of one type of search, as shown in the table below, can cause other types to fail.

unnamed - 2025-09-04T125825.626.png

If you see that the maximum concurrent instances of that search have been exceeded, this usually indicates that the previous search is running too long and overlapping the scheduled time for the next search. You can make the search more efficient or, if your searches allow it, reduce the time range of the search or increase the time between scheduled searches.

If you see that the maximum concurrent searches for the cluster have been reached, you are likely experiencing other resource issues too. You can add admission rules or user limits on concurrency to avoid this error, but remember that this will cause more ad hoc searches to fail. A better approach is to try to flatten any search spikes by spreading scheduled searches out more. You might also need to consider that your system is undersized for the amount of search load you are generating.

Long-Running Searches (saved and ad hoc searches)

The Long-Running Search table provides insight into which searches could be causing performance issues on the system. It is not uncommon to have searches that take several minutes to complete, such as those covering large time ranges. Searches of that nature, while sometimes necessary, can negatively affect the system as a whole, as they consume resources and reduce capacity for other concurrent searches.

If you see a long-running search that is repeated frequently like the one in the table below, you can optimize the SPL, or run it less frequently or over a shorter time span.

unnamed - 2025-09-04T125918.150.png

If you see a search that has been running for an unexpectedly long time (for example, hours or days), check to ensure that your search limits and timeouts are being enforced. If possible, modify the SPL to avoid runaway searches in the future.

Frequently Run Searches (ad hoc searches)

Frequently run searches are a good area for optimization. Because these are ad hoc searches, you might not be able to change the SPL, but you can change how users access that data.

If you see expensive and frequently run ad hoc searches, consider creating a saved search with a macro to allow users to access the data without having to run the search themselves.

Resource Monitoring

Search Head CPU

Search head CPU is not often a problem area. Most of the heavy lifting in search is done by the indexers. The search heads are responsible for non-streaming and non-distributable streaming commands but often work on much smaller, filtered datasets

If you see high CPU usage on search heads, see if you can restructure your searches to filter out more data and move distributable streaming commands ahead of any non-streaming or non-distributable commands. Search optimization is a large topic and can take time to master. Generally, you should follow these guidelines:

  • Decrease time ranges.
  • Use only necessary indexes.
  • Filter out as much data as possible before non-streaming or transforming operations.
  • Filter on indexed fields and use the :: operator.
  • Use the TERM() operator where possible for text searches.
  • Use tstats instead of stats if you don’t need the raw data.

Find more information in Learn SPL command types: Efficient search execution order and how to investigate themOptimizing search, and Use CASE() and TERM() to match phrases.

Search Head Memory

Search head memory grows as search queues fill and with searches that return many results for non-streaming processing. Job retention times also affect search head memory.

Search head memory growth can indicate a potential problem and is often a symptom of high search concurrency. When searches are completed and their job lifetime has elapsed, memory is released. If memory is growing, then either search concurrency is increasing, or current searches are returning more results than usual.

If you see search head memory growth, try to restructure searches to do more filtering and aggregation on the indexers. You might also want to consider reducing the default job retention settings to allow memory to be freed sooner. As always, try to reduce the amount of concurrent search spikes.

The Key/Value Store is a possible exception to increasing memory concerns. For performance reasons, the KVStore is configured to use a large percentage of system memory. It is normal to see that grow for days to weeks after starting a system. Check that memory use is tapering and eventually plateaus.

Search Head CPU and Memory By Process

By looking at the per-process values for CPU and memory, you can narrow down root causes. When investigating an issue, examining the resource usage for each process can show you what types of searches are problematic.

In a stable system, you should expect that CPU and memory are also stable. One exception would be when a system has recently started or undergone a major change; in those cases, you might expect to see either a step (up or down) in the graphs, or a gradual increase in KV Store memory.

An increase in KV Store memory might be caused by caches continuing to fill. The database that backs the KV Store is allowed to use a large amount of system memory. It is not uncommon for this memory to grow for days or even weeks after a system restart. If memory grows beyond 20% of existing RAM, monitor it more closely. You might have some lookups that are growing quickly in an unbounded way.

Rolling Restarts

Rolling restarts tell you when your system was intentionally restarted. System behaviors after a restart do not reflect the steady state, so knowing when a restart happened allows you to formulate more informed explanations for observed behaviors.

You can look for a single bar that will appear on this chart when the search heads were intentionally restarted.

Running Search Heads

Sometimes system behavior might resemble behavior after a restart, even if no restart occurred. The running search heads chart looks for gaps in search head logs that might indicate an unexpected search head restart.

The chart should remain steady, showing the total count of search heads in the cluster. Any dip below that value might indicate a search head crash. It is possible, though rare, that this chart gives false positives showing a dip even when all servers were running. Use this chart to confirm suspicions of search head crashes.

Indexer CPU

Searches are split across all indexers. If indexes are well balanced, each part of the search will complete in a similar amount of time. While processing a single search part, the indexer dedicates one CPU fully to that task.

During times of high search concurrency, well-balanced indexers will all be running at or near 100% CPU. Your goal is to minimize the time spent in this state to allow more searches to run, thus increasing search throughput. Minimizing time spent on the indexers involves reducing the time range over which the search runs and using as many indexed fields as possible for filtering. Ensure your searches do not query for excess data by searching beyond the range you need.

Spikes in CPU usage could indicate too many scheduled searches starting at the same time (for example, scheduled searches running at the top of the hour). If searches do not need to start at a particular time, try to spread their start times out.

Searches that run on the same schedule can cause CPU spikes, as shown in the chart below, where a small amount of extra load can cause significant increases in search times and lower success rates.

unnamed - 2025-09-04T130648.473.png

If any indexer differs drastically in CPU usage, you might have unbalanced indexes. Ensure that your forwarders are sending data to all indexers instead of targeting a particular indexer, which means do not use IP addresses for indexers in your forwarders. You might also consider performing a data rebalance, but be aware that this is a lengthy process.

Indexer Memory

Indexer memory is directly related to the combined ingest and search load.

If indexer memory is nearing its limits, ensure that you need (or might need in the future) all the data you are ingesting. Reduce the data range of your searches and use more filters when possible. Also consider an instance type with more RAM.

Indexer CPU and Memory By Process

By looking at the per-process values for CPU and memory, you can narrow down root causes. When investigating an issue, examining the resource usage for each process can show you what types of searches are problematic.

In a stable system, you should expect that CPU and memory are also stable. If CPU or memory are gradually increasing, your searches might be scanning more data over time. If you don’t need it, try changing searches over "all time" to a fixed time period instead.

User Limits

The user limits page allows you to see values that are often quota-restricted, either system-wide or per user. Sometimes users will experience failures and not know why, or possibly even be unaware of a failure at all. This can often be the result of a quota limit causing a silent failure behind the scenes.

Match user usage against user quotas to see if any quotas are being exceeded. Adjust quotas if necessary, or consider running saved searches under a different account and giving users access to the data via a dashboard or consolidated view.

You can modify the chart SPL to include actual quotas as an overlay.

Splunk Features

Bundle Replication Size/Time

The charts in the Bundle Replication section show how knowledge objects are transferred between search peers. Replication times are usually fast (milliseconds), but an occasional replication taking hundreds of milliseconds is not uncommon. Changing lookups, modifying configurations, or installing apps will trigger longer replications.

Stability is key here. If replication times or bundle sizes are increasing, check to ensure your lookups are not growing unexpectedly. While lookups can be dynamic, be careful not to allow unlimited growth. For example, if you add 20 bytes each to 10 fields in a lookup four times an hour, after a year, you will be replicating over 7MB every 15 minutes for this lookup alone. While 7MB can replicate quickly, it increases your overall replication times, and during peak times, missed replications can cause other usability issues.

Modifying lookups with tens of thousands of entries can cause bundle replication sizes to balloon from hundreds of bytes to megabytes, as shown in the chart below. Try to keep lookups as small and as static as possible.

unnamed - 2025-09-04T130951.516.png

Increased bundle replication size leads to longer replication times, as shown in the chart below. They can also cause increased data transfer costs.

unnamed - 2025-09-04T131011.650.png

Erratic bundle replication times could indicate network saturation, especially if many search heads are involved (since each bundle needs to be replicated to every search peer).

SmartStore Performance

If using SmartStore, these values show your data transfer rates and warm bucket eviction rates.

Slow data transfer speeds will affect performance. Work with your traffic engineers to ensure data transfer is configured optimally. Slow eviction speeds might indicate disk resource contention on the local disk.

Cache Hit Rates

A high cache hit ratio means that you aren’t having to fetch data from the SmartStore very often.

Bucket/Data Model Statistics

Bucket and data model statistics show your upload, download, eviction, and removal rates over time.

Upload and removal charts primarily relate to your ingest rate and retention settings. Pay attention to download and eviction rates. If those are too high, you might be searching back over too large a date range, so adjust your searches to look back over a smaller range, or increase your cache limits. If they are low, you might be using more local storage than intended and you should decrease your cache limits.

Assets and Identities Statistics

The assets and identities section shows the size of those lookups.

Large lookups can cause several performance issues, including slow searches and slow bundle replication. While support for these lookups exceeds hundreds of thousands of entries, if lookups start exceeding 50K entries, consider separating them into smaller ones if it makes sense.

Notable Events

Generating and processing notable events triggers many actions behind the scenes. An upward trend in notables can lead to performance issues.

Unless there is an active incident, correlation searches should be tuned so that notables remain steady over time. If there is an upward trend, or you suspect notables are too high, use the table to determine which correlation searches might need higher thresholds.

Environment Diagnose

Log Trends Overview, Distribution, and Top 100

Log trends are a good indication of potential performance issues, even if the logged event itself has nothing to do with performance.

An upward trend in warning and error logging clearly indicates a problem. Events that cause errors and warnings can often come with retry logic that consumes system resources. You should strive to eliminate all errors and as many warnings as possible. While zero is unrealistic, the goal should be as few as possible. Focus on repeated errors that occur in close proximity, as these are likely retries.

Additional resources

Download the app on Splunkbase to get started.

In addition, these resources might help you understand and implement this guidance:

Topic Link
Bucket size Indexes.conf
Bundle replication Troubleshoot knowledge bundle replication
Configuration files List of configuration files
Efficient SPL
Indexer health Restart the entire indexer cluster or a single peer node
Report scheduling
Search head health Use the CLI to view information about a search head cluster
Search windowing and skewing Schedule windows vs. skewing
User limits Authorize.conf