Skip to main content

 

Splunk Lantern

Maximizing SVC usage in Splunk Cloud Platform

If you are new to Splunk Cloud Platform, or even if you've been a customer for a while, you might not be sure what Splunk Virtual Compute is. The official definition is “A Splunk Virtual Compute (SVC) is a unit of capabilities in Splunk Cloud Platform that includes compute, memory, and I/O resources.“ But what’s more important is to understand the three different ways in which SVCs are used:

  • A purchasing SKU.
  • A number that Splunk uses to determine how much infrastructure to provision for your Splunk Cloud Platform environment. 
  • And in the case of SVC usage, a measure of entitlement usage. However, SVC usage is not a measure of health, capacity, or performance. 

This last point is especially important and is the focus of this article, which looks at actual measures of health, usage, and capacity. Optimizing your search workload and ingest workload will allow you to do more with the SVCs you already have.

This article describes an experiment (presented in the .conf25 PLA1033 talk) conducted by two Splunk Enterprise Solutions Architects to optimize SVC usage. By replicating some or all of their actions, you can improve your Splunk Cloud Platform environment as well. 

The experiment

The two architects had a theory that if a workload (in terms of data being ingested and searches being run) running in a Splunk Cloud Platform environment were optimized, additional capacity would be freed to run additional workloads (more data being ingested or searches being run). Although this additional capacity might not be reflected in SVC usage, it would be reflected in other, more meaningful metrics related to health, capacity, or performance. 

The design of the experiment involved two Splunk instances with 100 SVCs each. They both used the same data, and throughout the experiment, none of the data streams were turned off and nothing was changed about the data sent to each environment. What did change was the quality of data ingested into Splunk Cloud Platform, and the quality of the searches run.

The questions looked at in the next three sections are:

  • Define metrics that matter: Which measures and metrics really matter for Splunk Cloud Platform environment health?
  • Optimization efforts: How did the architects optimize the existing ingest and search workload?
  • Results: How much better do those meaningful metrics and measures look now?

Metrics that matter

The metrics sections are split between leading and lagging indicators.

Ingest performance

When we mention ingest performance, we really mean ingesting and indexing all data with no backpressure to the client and minimal latency. 

Leading indicator:

  • Ingestion queue fill percentage: Queues buffer data, and this is a measure of how full that buffer is. If the queue regularly fills, it’ll likely lead to backpressure being applied to the source in the form of blocked ingestion queues (primarily for S2S) and/or HEC 503 errors. Queue fill usage should be kept as low as possible. Ingestion queue fill percentage can be found on the “Indexing performance” dashboard in the Cloud Monitoring Console. 

Lagging indicator: 

  • Smartbus latency: Smartbus is a highly scalable, durable, and resilient queuing and buffering system on the Splunk Cloud Platform Victoria environment and Splunk Enterprise 10.2 and later that sits between the typing and indexing pipelines the Splunk platform. The Splunk platform uses this as a buffer when the indexers don't have enough resources to index all the data they need to, or when it is unable to roll buckets from hot to warm. This metric should also be minimized. Smartbus latency isn’t shown in the Cloud Monitoring Console at the time of writing this article, but it can be found using this base search:
    index=_internal source=/opt/splunk/var/log/splunk/metrics.log host=idx* group=smartbus name=indexer series=per_message_stats
    | eval lifecycle_time_s = lifecycle_time_ms / 1000
    | timechart span=1h avg(lifecycle_time_s) AS avg_lifecycle_time, p99(lifecycle_time_s) AS p99_lifecycle_time

Search performance

Leading indicators:

  • Search latency: The time difference between when the search was scheduled or requested and the time to that Splunk Cloud Platform actually starts running the search. Search latency can be found on the “Scheduler activity” dashboard in the Cloud Monitoring Console. Some amount of latency is expected, especially when skewing searches but usually this number should be kept as low as possible.
  • High search concurrency: This is a count of how many searches are running in parallel. Spikes in concurrency can drive CPU and I/O contention, and can lead to skipped or slow search results. Concurrency can also be found on the Scheduler activity dashboard in the Cloud Monitoring Console. Ideally, this measure is as flat over time as possible, with Splunk Cloud Platform doing the same amount of work over time instead of having to contend with a spikey workload.

Lagging indicators:

  • Skipped searches: Searches are skipped when concurrency capacity is fully utilized and unable to queue additional searches or under specific situations where individual scheduled searches overlap, preventing additional searches from starting. This can result in stale reports, dashboards, and data in data models. Skipped searches can be monitored on the Skipped scheduled searches dashboard in the Cloud Monitoring Console. Ideally, there are no skipped searches.

Miscellaneous

These are the metrics that can be used to determine system capacity, performance, and health which are not tied strictly to ingest or search performance. 

  • Workload descriptors
    • Daily ingest amounts: The amount of data being ingested by Splunk Cloud Platform. This can be found on the Overview dashboard in the Cloud Monitoring Console.
    • Daily search counts: The number of searches the environment runs each day. This can also be found on the Overview dashboard in the Cloud Monitoring Console. 
  • Resource usage
    • Average CPU usage: Ideally, this value is below 80%. This can be monitored on the indexers using this base search: 
      index::_introspection sourcetype::splunk_resource_usage source::/opt/splunk/var/log/introspection/resource_usage.log host::idx*.splunkcloud.com TERM(Hostwide)
      | eval cpu_used_pct = ('data.cpu_user_pct' + 'data.cpu_system_pct')
      | timechart span=1h avg(cpu_used_pct)
    • Average memory usage: Ideally this value is below 60%. This can be found on the Overview dashboard in the Cloud Monitoring Console.
    • CPU seconds by workload category: This is a measure of how much compute time is spent indexing data and separately, searching the data. This measure can help give you an idea of where your compute resources are being consumed in your Splunk Cloud Platform environment. This can be monitored using this search:
      index::_introspection (host::sh* OR host::idx*) sourcetype=splunk_resource_usage TERM(PerProcess) TERM(normalized_pct_cpu)
      | rename data.workload_pool_type as workload_pool_type
      | timechart span=1h sum(data.normalized_pct_cpu) as sum_normalized_pct_cpu by workload_pool_type 
    • Indexer normalized load average: This is a measure of how many tasks on average each CPU thread is being assigned. A value over 1 means the CPU threads are oversubscribed, so ideally this value is below 0.8 to allow for bursts in workloads without oversubscription. This can be measured using this search:
      index::_introspection sourcetype::splunk_resource_usage source::/opt/splunk/var/log/introspection/resource_usage.log host::idx*.splunkcloud.com TERM(Hostwide)
      | timechart span=1h avg(data.normalized_load_avg_1min)

The optimizations

These two sections describe the actions taken on the environment during the experiment to optimize both the data being ingested and searches being run. These are the steps you can take to improve performance in your environment. 

Ingest optimizations

These are the changes the architects made to improve ingest performance.

  • Data quality (aka Great 8): The is another Lantern article that explains the Great 8 configurations, so this point won't go into much detail. However, the settings in question are:
    • Event breaking: SHOULD_LINEMERGE, LINE_BRAKER, TRUNCATE, EVENT_BREAKER_ENABLE, EVENT_BREAKER
    • Time parsing: TIME_PREFIX, MAX_TIMESTAMP_LOOKAHEAD, TIME_FORMAT
  • Index-time field extractions. These are additional fields added to each event at index-time, and they require Splunk Cloud Platform to do more work during ingestion time. Ideally these are used sparingly. Specifically the INDEXED_EXTRACTIONS and KV_MODE settings in props.conf control this.
  • Reduce ingested data amounts by using technologies like Edge Processor, Ingest Processor, and ingest actions. These allow you to optimize or reduce the amount of data being ingested into the Splunk platform to include only necessary data. Edge Processor and ingest actions have no additional charge and can also be used with Splunk Enterprise 10 and more recently, whereas Ingest Processor has a free tier of up to 500 GB/day in Splunk Cloud Platform.
  • Modify data formats to write less to disk at index-time: This can involve modifying data formats from possibly overly verbose logging methods such as JSON to CSV. This is an advanced change and needs to be done with caution.
  • Optimize ingest actions rules. Ingest actions rulesets (especially those using regular expressions) can be written inefficiently, unnecessarily increasing the amount of work Splunk Cloud Platform needs to do while ingesting data. 
  • Offload ingest actions and/or props/transforms work to Edge Processor or Ingest Processor: Because ingest actions rulesets run on Splunk indexers hosted in Splunk Cloud Platform, moving the work being done by ingest actions to an external data processor (such as Edge Processor or Ingest Processor) is an effective way to reduce the amount of work being done by the indexers. 

Search optimizations

During the experiment, these were the settings the two architects changed to optimize the search workload being run. 

  • Search time ranges: Generally, the larger the time range a search looks over, the longer it will run, and the more resources the Splunk platform will consume to complete the search. This is especially true in a SmartStore-backed environment, like Splunk Cloud Platform, because searches with long time ranges can cause a high number of cache misses (contributing more to indexer cache churn). Therefore, optimizing (reducing) the time ranges of searches can be an effective way to reduce Splunk platform resource usage.
  • Pre-compute results: Pre-computing and storing repeatedly-viewed search results into a lookup (such as a KV Store or CSV) and then loading those stored search results can be an effective way to reduce the number of searches being run in the environment. For example, if a team of developers all tends to independently load the same dashboard every day, consider configuring the dashboard to load data from a lookup tabke, KV store, or saved search results, as opposed to running the searches every time the dashboard is loaded. 
  • Data model accelerations: Turn off any accelerated data models that aren’t being used regularly.  
  • Usetstats and accelerated data models: Consider rewriting searches to use tstats and/or accelerated data models. Both of these options can drastically reduce search runtime because they require that Splunk Cloud Platform only search TSDIX-style files and not raw events. 
  • Real-time searches: Real-time searches consume a significant number of resources because each real-time search requires the Splunk platform to continuously run a single expensive search across the environment, and will use one concurrent search slot forever.  The reverse is also true: each real-time search removed from the environment is like removing a very expensive, perpetually-running search. Therefore, real-time searches should be avoided whenever possible. 

The results

This section provides a summary of the results in the optimized environment compared to the unoptimized. To see the full details with visuals, watch the full .conf25 talk, Maximize Your SVCs: Optimize Performance, Reduce Costs, and Improve Efficiency in Splunk.

  • Ingest performance metrics
    • Ingest queue fill: Almost completely eliminated
    • Smartbus latency: Reduced from a peak of 1500 seconds to less than 110 seconds
  • Search performance metrics
    • Skipped searches: Reduced from 87.67% down to 0% 
    • Search average run time: Reduced from 102.69 seconds to 2.46 seconds
    • Indexer cache churn: Reduced from 84% to 2%
  • Miscellaneous
    • Indexer normalized load average: Reduced from a peak of 3.75 to a peak below .56
    • Indexer average CPU usage: Reduced from 98% down to 57%
    • Indexer average memory usage: Reduced from 24.5% to 15%
    • Total CPU seconds by category: Reduced from about 240,000 seconds down to about 115,000 seconds consumed per hour

Most importantly, note that after all these optimizations and results that clearly demonstrate the system is running more efficiently while consuming fewer resources, SVC usage did not change meaningfully. Again, this is because SVC usage is not a measure of health, capacity, or performance. 

The outcome that the architects wanted, and that they achieved, is that using the capacity freed up by optimization efforts, the Splunk Cloud Platform environment can accomplish more with the same SVCs by ingesting additional data or running additional searches.

Next steps

Check out the new Cloud Monitoring Console (CMC). It better reflects real measures of health, capacity, and performance than it has in the past, and includes many of the metrics discussed in this article.

Finally, these additional Splunk resources might help you implement the guidance in this article:

  • Written by Danial Zaki and Paul Reeves (Staff Cloud Solutions Architects)
  • Splunk