The applications you monitor in Splunk Infrastructure Monitoring use databases. Those databases are monitored by detectors that trigger alerts when the databases negatively impact application performance, availability, or reliability. As a CloudOps Engineer, SRE, Service Developer or DBA, when you receive an alert, you need to quickly navigate to Splunk APM to determine what services are contributing to infrastructure high resource usage or a performance issue.
As part of DevOps process improvement strategy, you might also be interested in answering the following questions:
- How can we correlate our database performance monitoring with application performance monitoring so when a database incident occurs, the impact to the services can quickly be visualized in the context of a service map and transactions being impacted?
- How can we break down the operational silos and enrich the collaboration between the infrastructure and application teams as part of continuous process improvement?
You can resolve this scenario with Splunk APM’s Database Query Performance capability.
This article is part of the Splunk Use Case Explorer for Observability, which is designed to help you identify and implement prescriptive use cases that drive incremental business value. It explains the solution using a fictitious example company, called CSCorp, that hosts a cloud native application called Online Boutique. In the AIOps lifecycle described in the Use Case Explorer, this article is part of Application monitoring.
In this scenario, CSCorp’s leadership wants a proactive operational workflow that starts with a database instance issue such high CPU consumption, low operations/sec from normal or low cache hit rate percentages, followed by the ability to quickly review and correlate the applications being impacted. This will ensure that appropriate incident prioritization can be applied, and collaboration with the SREs impacted can be initiated. CSCorp believe this will drive overall service and release quality improvements.
CSCorp has engineered the use of Redis enterprise clusters for global caching services in support of their microservice architecture and application development framework. CSCorp’s leadership wants to assure their tier one application portfolio, including the company's flagship sales portal Online Boutique, has no observable blind spots as it relates to the new content caching database architecture.
Overall, CS Corp wants to:
- Improve the ability of a CloudOps engineer, SRE or DBA to quickly identify and correlate database performance with application transactional performance and impact.
- Improve the DevOps process collaboration between database operations teams and the application SREs, reducing the number of war rooms required.
- Improve incident urgency and prioritization quality based upon application importance and impact radius, in short, knowing what to work on first.
CSCorp’s OnlineBoutique sales portal consumes a microservice, called cartservice, where a new version has been deployed that targets Redis database caching optimizations. The service development team just introduced the new version as part of the DevOps CI/CD pipeline using their canary deployment methodology.
A Splunk detector has been deployed and has alerted on when there is a sudden increase of the Redis database instance’s CPU utilization (%). You are the CloudOps Engineer who receives the notification.
Here is how the detector has been set up. The detector looks at Redis CPU Utilization (%) signal history and alerts when the CPU % changes 50% above the statistical mean or norm:
The detector triggers an alert, in this case, when the Redis database instance CPU utilization (%) increases 20% above the statistical mean.
- In Splunk Infrastructure Monitoring, navigate to the Redis database impacted to review the dashboard.
- Here you can see there is a spike in CPU Utilization (%), as well as spikes in Operations/sec and Network Bytes/sec.
- Click the Map for redis tab at the bottom of UI to open the Service Map where you can see the services that are consuming the Redis database instance resources.
- In the Service Map shown below, you can see that the cartservice making database calls to the Redis database is experiencing latency of 2.49s. Click the redis inferred database in the Service Map to access Database Query Performance information.
- Latency appears normal for the Redis commands. To drill down deeper, open Database Query Performance by using the expand icon < > at the top-right of this section.
- Here you can see that the amount of SCAN command requests, 96.5k, is unusually high compared to the other commands. The Total Time of 22.3min is also unusually high. From experience, you know that SCANs are not normally used in production services.
- To drill down deeper, open Tag Spotlight: Request Latency by using the expand icon < > at the top-right of this section.
- Looking at the Operation pane in Tag Spotlight you can see the high number of SCAN requests. Double-click SCAN to filter for only SCAN spans.
- On the chart showing the filtered SCAN spans, click a high peak.
- This brings up a number of traces. Click a trace to examine it further.
- Click the trace UI’s Span Performance tab. Here you can see 128 SCAN spans for a single trace, and one scan is taking over a second to complete, which is high. The SCAN is also consuming 89.7% of the total workload. Given this information, you conclude this is the probable cause for the Redis database instance CPU Utilization (%), Operations/sec and Network Bytes/sec. You conclude that the SCAN is impacting the database instance resource consumption and potentially the performance.
- At this point, you can notify the cartservice development team so they can they perform the rollback of the cartservice service version. The team can then identify what led up to this problem - for example, whether the SCAN was introduced for testing and should not have been part of the new build. In an instance like this, the team can remove the SCAN and redeploy, allowing the database instance and cartservice latency metrics to trend back to normal.
Quick identification of the services that impacted the Redis database instance performance and identification of the root cause helps accelerate MTTI (Mean Time to Identify) and MTTR (Mean Time to Recover), with minimal customer impact.
You might also be interested in Troubleshooting a service latency issue related to a database query.
To fully unlock the power of Splunk, we strongly recommend our comprehensive Splunk training. At this stage in your journey, we recommend you explore the Splunk Observability Training Courses.