Running a Splunk platform health check
The Splunk platform health check is designed to provide a comprehensive technical evaluation of your Splunk deployment. The goal is to uncover opportunities for configuration enhancements and performance optimizations, ensuring you get the most out of your Splunk investment. This process will help you gain valuable insights into best practices and optimization strategies so that you are well-equipped to maintain and enhance your Splunk deployment.
This Splunk platform health check is available as an engagement with Splunk Professional Services. If you do not feel comfortable completing this process on your own, or would like hands-on training with any of the concepts and processes included in this article, contact our Professional Services experts.
Stakeholders
- Sponsor: The sponsor ultimately determines the success or failure of the Splunk implementation. Your sponsor must be on board with your goals and objectives, as others might have misconceptions about what the Splunk platform is and how it can help. The sponsor is frequently the individual who decides what will be done with the Splunk platform.
- Business owner: The business owner will help the architect identify the services, customer SMEs, business metrics, and processes used for implementing the Splunk platform.
- SMEs: SMEs are the people who understand how the services and systems are deployed within your environment. They are needed to identify processes and procedures and can assist with determining how to group and handle alerts if the business owner does not do this.
- NOC manager: Network operations center (NOC) managers are responsible for the timely execution of business improvement initiatives to ensure that the networking system runs efficiently without interruption. They manage the business process in the organization and typically report to top management. They are responsible for ensuring their team is meeting or exceeding stated service level agreements (SLAs).
- NOC analyst: Network operations center (NOC) analysts provide technical support for end users of a computer network or program. Their job duties include monitoring systems operations, troubleshooting systems problems, network outages, and software issues, and reporting to development teams when persistent, unfixable problems occur. They are responsible for triaging incidents and escalating them to the appropriate teams as necessary. They are consumers/users of the Splunk platform and rely on others to perform the necessary setup and configuration.
- Project manager: Larger Splunk customers frequently have project managers (PMs) assigned to engagements. If available, the PM will track progress of your Splunk platform implementation and coordinate meetings with various customer staff as needed.
- Splunk administrator: Splunk administrators manage the day-to-day operations of their on-prem or BYOC Splunk Enterprise deployments.
Types of health checks
- Reactive: Provide a targeted assessment and analysis in response to specific incidents, failures, or observed inefficiencies within a system. This check is particularly beneficial for addressing concerns related to indexing/Indexer performance, forwarding tier, general search performance, and Splunk Enterprise Security performance issues.
- Preemptive: Evaluate your current Splunk environment, identifying any accumulated technical debt and developing strategic guidance for remediation. This check is key for organizations that plan to upgrade their existing systems or implement premium Splunk products such as Splunk ITSI (ITSI) or Splunk Enterprise Security (ES).
- Periodic: This enhanced health check is designed to optimize Splunk environments, with a special emphasis on data governance and adherence to a Center of Excellence (CoE) procedures/best practices.
The actions involved in each of these are described in more detail in the following sections.
Reactive health check
- Incident-driven analysis: This approach is tailored based on the specific incident or issue reported, ensuring a focused and efficient resolution.
- Provide a high-level overview of the specific incident or issue reported using the Cloud Monitoring Console (CMC) or Monitoring Console (MC).
- Data point analysis:
- Evaluate queue depths in the CMC or MC panel to identify transient onboarding issues.
- Cross-reference queue depths with specific log data to determine if onboarding issues result from significant failures.
- Monitor the presence and volume of quarantine buckets relative to other hot buckets.
- Real-time data onboarding assessment:
- Run real-time evaluations to identify anomalies in data onboarding and gauge its impact on indexing.
- Indexing performance insights:
- Utilize the iowait time in the MC dashboard to estimate indexing performance.
- Analyze wait times for data as an indicator of potential storage layer issues.
- Analyze object storage layer for evidence of cache thrash.
- Search performance evaluation:
- Examine dispatch times, run times for scheduled searches, skip ratios, and other indicators to determine if the issue lies in overall search performance or individual search performance.
- Utilize the job inspector for detailed analysis of individual search executions.
- Assess data distribution, time spent on lookups, kv extractions, and other metrics to diagnose search-related issues.
- Evaluate high consumption dashboards for best practices and identify efficiencies.
- Knowledge object analysis:
- Review the number and scope of knowledge objects that influence searches.
- Examine bundle sizes, frequency of bundle replications, and search patterns to identify potential inefficiencies.
- Index and volume review:
- Inspect index and volume definitions.
- Evaluate index distribution, bucket sizes, and retention policies to ensure optimal performance and capacity planning.
- Log analysis:
- Conduct a thorough review of splunkd.log and related logs for insights.
- Execute searches to identify recurring errors and potential areas of concern.
Deliverable
By the end of these steps, you should be able to produce a comprehensive report detailing the findings, potential areas of improvement, and recommended actions to optimize system performance.
Preemptive health check
- Inventory and configuration assessment: Conduct an exhaustive evaluation of your current Splunk environment, configurations, and customizations.
- Technical debt and inefficiency identification: Pinpoint areas of technical debt, inefficiencies, and deviations from best practices, with a special focus on data model constraints and memory utilization on indexers.
- Data model and DMA analysis: Analyze run times for Data Model Accelerations (DMAs) both in absolute terms and as a function of DMA size, identifying potential bottlenecks and areas for optimization.
- Memory utilization monitoring: Utilize tools to monitor memory utilization on indexers over time, ensuring optimal performance.
- Search and dashboard usage analysis: Run searches to identify usage patterns of
tstats
and other search commands, analyze access logs to see dashboard interactions, and provide insights on how to adjust DMA windows and optimize dashboard usage. - Remediation roadmap: Deliver a clear and actionable plan to address identified issues, optimize configurations, and align the environment with Splunk best practices.
- Knowledge transfer and best practices: Equip your team with essential knowledge and best practices to maintain a healthy Splunk environment.
Deliverable
By the end of these steps, you should be able to produce a comprehensive report detailing the current state of your Splunk environment, identifying areas of technical debt, and providing a step-by-step remediation plan. This proactive approach ensures that the system is fully prepared for upgrades and the seamless integration of premium products, ultimately leading to enhanced performance, reduced risks, and a stronger foundation for future growth and innovation.
Periodic health check
- Data governance analysis: Examine how well the organization tracks forwarder and data source ownership, the workflow for identifying bad/missing forwarders or data sources, and asset management practices.
- Workflow and process optimization: Invest in improving workflows outside of Splunk, including ticketing systems, data forwarding to data lakes, and logging best practices.
- Data model and DMA analysis: Analyze DMA run times in absolute terms and as a function of DMA size, identifying potential bottlenecks and areas for optimization.
- Memory utilization monitoring: Utilize tools like 'sar' to monitor memory utilization on indexers over time, ensuring optimal performance.
- Search and dashboard usage analysis: Analyze search and dashboard interactions to optimize usage patterns and adjust DMA windows as necessary.
- Assets and identities review: Conduct a thorough review of assets and identities, providing recommendations for improvements and ensuring alignment with best practices.
- Remediation roadmap and CoE/best practice alignment: Deliver a clear and actionable plan to address identified issues, optimize configurations, and align the environment with Splunk best practices and CoE procedures.
- Knowledge transfer and best practices: Equip the your team with essential knowledge, best practices, and CoE procedures to maintain a healthy Splunk environment.
Deliverable
By the end of these steps, you should be able to produce a detailed report outlining the current state of your Splunk environment, areas of technical debt, and a step-by-step remediation plan, all aligned with Center of Excellent standards.
Timeline
Regardless of the type of health check you want to perform, the expected timing should be the same for each. This is as follows:
- Day 1 - 3 Discovery. Perform the appropriate health check, as described above, to get a detailed understanding of overall condition. Regardless of which you choose, you should look to identify patterns, anomalies, or areas that deviate from best practices. In addition to configuration and servers, you should also gather a good understanding of your administrators' abilities and offer technical guidance where appropriate.
- Day 4 Write-up. Produce a detailed document outlining all the findings discovered during the discovery phase.
- Include a recommendation of prioritized actions to be taken to remediate and optimize any item of note. When prioritizing, take into account your team's initial desire for the health check, finding severity, and ease of resolution.
- Ensure that the plan is actionable, providing clear and concise steps that your team can follow to address identified issues.
- Validate that the proposed remediation plan aligns with your overarching business objectives and operational needs.
- Ensure that any recommendations made are practical, feasible, and provide tangible value.
- Day 5 - 10 Action items. Remediate and/or optimize any findings during discovery.
Additional resources
The following checklists can help you further understand the steps involved in some of the tasks above.
- Environment discover and server review
- Data source review
- Indexing performance audit
- Search activity and usage patterns audits
- Review of Splunk app and technology add-ons
- Dashboard performance review
Splunk Professional Services can assist with any of the health checks outlined in this article. Click here to learn more about working with Professional Services.