Tuning SOAR to optimize performance
As you create more use cases for Splunk SOAR, it is important that the platform can scale to meet the demands of the automation need. On the left side of the SOAR architecture diagram below are the five SOAR microservices. These run as daemons on the platform. When it comes to performance tuning, two critical microservices can be tuned through SOAR Administration settings: DECIDED
and ACTIOND
. On the right side are additional platform services that utilize open-source tools for various functions. For performance tuning, you can tune Postgres and nginx/uwsgi (web server gateway interface), updating their configuration through the SOAR command line. This article explains how to complete all these tuning options for better SOAR performance.
Playbook runners
The DECIDED daemon spawns Python runners to run playbook code, except for app actions, which are handled by ACTIOND.
- Pre 6.3.0, the default number of Python runners is 4, and the user can increase the number to 10 through the administration web UI.
- Post 6.3.0, SOAR automatically scales the number of Python runners for playbook execution between a minimum (4 by default) and maximum (20 by default) number.
With customer-managed platforms (CMPs), customers can set the value through the administration web UI (Administration > Administration Settings > Playbook Execution). For SOAR Cloud, DECIDED performance is managed by Splunk.
To understand how these values affect performance, say that your deployment ingests 100 events/minute that trigger an active playbook for each event. How long does it take to run all the playbooks?
With only four playbook runners, no more than four playbook executions can occur simultaneously. This means that the 100 playbook runs can only happen in batches of four. By increasing the number of playbook runners, you effectively increase the batch size, allowing for more simultaneous runs.
Concurrent actions
ACTIOND
is responsible for running application actions, such as checking IP reputations or creating tickets in service management tools. SOAR provides two configurable variables that influence how action runs are managed.
- The global action concurrency limit sets the maximum number of concurrent actions across all assets. The default is 150, which means that no more than 150 action blocks can run simultaneously.
- The asset concurrency limit sets the maximum number of concurrent actions for a particular asset. Starting with SOAR version 6.4.0, the default value is 50, allowing up to 50 concurrent actions for a particular asset at any one time.
Concurrent action settings can be changed in both CMP and SOAR Cloud implementations.
To understand how these values affect performance, say that your deployment ingests 100 events/minute, each triggering an active playbook that runs an action block that queries a Splunk instance. How long does it take to run all the playbooks?
With a default concurrent action limit of five, depending on the latency of the action block, this could create a bottleneck, as only five queries to the Splunk platform can be processed simultaneously. By increasing the number of concurrent actions allowed, you can enhance the number of simultaneous actions that run. However, it's crucial to consider the load on the external system being accessed by the asset. If this system is already under heavy load, increasing the SOAR calls might not improve performance and could potentially overwhelm the external system. This could slow down response times to SOAR and negatively impact its overall performance. Additionally, the external system could be configured to limit the number of incoming connections from a single host. If the system's connection limit is lower than the asset concurrency limit in SOAR, this could impact performance.
Database size
The PostgreSQL database is the backbone of SOAR. It stores all critical information such as ingested events and records of playbook runs. Most SOAR functions query the database and therefore, the performance of PostgreSQL directly impacts the overall performance of SOAR.
To maintain an optimal database size, we recommend the following strategies for CMP customers:
- Minimize data storage: Ensure that action calls return only the data necessary for analysts and playbook runs. Storing excessive data can lead to database bloat, while large action call results may consume SOAR memory, potentially causing performance issues.
- Implement data retention strategies: Use data retention policies for containers, action runs, audit logs, and other data objects. These can be configured using native SOAR command line tools to maintain an optimal database size.
- Remove unnecessary data objects: Utilize command line tools to remove containers, indicators, and audit logs that are no longer needed. Confirm that the objects are not required for active analysis, and perform a dry run to ensure only the desired objects are removed.
- Event forwarding: If event forwarding is configured, container information is stored in the Splunk platform, which can help manage database size.
- Use diagnostic tools: The SOAR diagnostic file provides insights into database size and table usage, which can be shared with Splunk support for further troubleshooting if needed.
For SOAR Cloud, database performance is managed by Splunk to ensure optimal performance.
To understand how the database affects performance, say that your deployment ingests 100 events/minute, each triggering an active playbook that runs a query: index=security_data “indicator” | fields *
. This query searches the security_data
index for specific indicators and returns all fields. If there are hundreds of fields per event, the database could quickly fill up, especially if these fields aren't necessary for analysis or playbook logic. Therefore, it's advisable to exclude unnecessary fields from the action call's return value.
UI configuration
NGINX serves as the web server for SOAR, managing all incoming web requests. The uWSGI acts as the web server gateway interface between NGINX and the SOAR application, handling all SOAR web requests. These requests can originate from active users on the platform as well as various SOAR functions that make REST calls back to the instance.
By default, the system is set to use 20 uWSGI workers. However, starting with SOAR version 6.2.0, the number of uWSGI workers is dynamically adjusted, allowing for automatic scaling based on demand. With customer-managed platforms (CMPs), customers can configure the minimum and maximum number of workers to optimize performance. For SOAR Cloud, web performance is managed by Splunk.
To understand how this value affects performance, say you have 50 concurrent users along with active automation running on the platform. If these users are highly interactive with the user interface, increasing the number of uWSGI workers could improve performance by better managing the increased load.
Next steps
In addition to the specific optimizations described above, here are some recommended best practices for performance tuning with SOAR.
- It is very important to monitor system utilization for SOAR nodes, focusing on both memory and CPU usage.
- For those using SOAR versions post 6.2.0, it's essential to configure the appropriate forwarding settings to ensure optimal performance. For versions pre 6.2.0, you should use the Splunk Add-On for Unix and Linux to achieve the same goal.
- Establishing a baseline for system utilization is crucial. By understanding the typical usage patterns, you can effectively monitor for any significant increases in resource consumption that could indicate potential issues or the need for optimization.
- When it comes to performance tuning, a general guideline is that if your system utilization is below 75%, there is room for adjustment to further enhance performance. This threshold suggests that additional tuning can be done to increase system efficiency and responsiveness. Continuously monitor utilization.
- It's important to consider the utilization of external services that interact with SOAR. These services can impact overall performance, so monitoring their resource usage is equally vital.
By following these guidelines, you can ensure that your SOAR environment runs smoothly and efficiently, providing the best possible performance.
For more guidance on how to get the most out of Splunk SOAR, try these additional articles: