Preparing for failures in the Splunk utility tier
Splunk provides product features to increase availability and recovery options for the search tier (search head clustering) and the indexing tier (indexer clusters and index replication). Administrative functions known as the utility tier (for example, the deployment server, deployer, and licensing server) also rely on best practices for resiliency and recoverability.
Impact of failures on the utility tier
The components of the Splunk utility tier are used for Splunk administration. If any of these components are unavailable or destroyed, the respective functions and resources become unavailable. Note that this does not include search heads, indexers, or forwarders.
|Component||Impact if offline||Impact if destroyed|
|Deployment server||No impact to search and indexing functions||
Source of truth of environment's configuration destroyed.
|Deployer||No impact to search and indexing functions||
Default configuration for search head cluster is lost, but can be mostly rebuilt from a SHC member.
|Manager node||No data redundancy requirements
For more information, see Managing Indexers and Clusters of Indexers.
Default configuration for indexer cluster member is lost, but can be mostly rebuilt from a member.
|License server||No impact to indexing functions
72 concurrent hours shuts down search functions
For more information, see Violations due to broken connections between license manager and peers.
System would need to be rebuilt. No impact to end users if the rebuild happens within 72 hours.
No impact to search and indexing functions
Lost health and performance visibility and monitoring for search and indexing functions.
System would need to be rebuilt. Risk to operations if health and performance visibility and monitoring for search and indexing functions is offline for a long period. Built-in summary data showing insights and long term patterns would be lost. No lasting impact to end users or the overall Splunk platform.
If any of these components are destroyed, it takes time and effort to rebuild a new instance and update references to the new host's information throughout the environment. You can avoid this by applying the following best practices.
Preserve component's state
Many customers use virtual machines instead of bare-metal hardware for utility-tier components because virtual machines provide two features that are valuable for utility-tier components:
- Dynamic resource sizing. VMs change the hardware specifications of the host as load increases.
- State preservation and transition. VMs provide host snapshots that preserve an image of the instance. Some VMs, such as VMotion from VMWare, enable you to instantiate the host image on a new virtual machine.
If you are unable to leverage these benefits from virtual machines, consider putting a configuration backup plan in place. For more information about configuration backups, see Managing backup and restore processes.
Preserve networking using DNS entries
When a utility instance fails or is destroyed, you need to update networking details to all clients, such as host name and IP. This can be impractical in large and distributed data center environments. To avoid this, you can try to rebuild a utility component with the same networking details the previous one used, but this is usually not possible. Instead, a best practice is to use DNS CName (canonical name) records as a translation service.
When you establish DNS CNames for your utility instances, you can direct all clients to those DNS entries, and then never need to rely on the true host and IP of the host hardware. If you have to replace the host hardware, you do not have to try to reuse the same hostname and IP. This also allows you to build new utility instances in parallel to the old with a simple DNS toggle as a cutover.
Applications for load balancing
You can use a similar practice for load balancing on the data collection tier or search tier. In such scenarios, a DNS A record or hardware based load balancer distributes traffic to multiple hosts, providing you an easy way to scale. Even if you have a single instance acting as your search head or data collection tier, you can use this type of networking for scalability and easy management.
For load balancing the indexing tier, however, Splunk's native load balancing feature is the best practice for forwarding data to indexers. For more information, see Set up load balancing.
Partner with someone who oversees networking at your organization and make sure they understand the goal and the technical details. Draft a disaster recovery plan and verify it with a non-impacting/non-production environment before implementing it in production.