Understand the Impact of Changes
Site reliability engineers (SREs) need to ensure performance of business critical workflows (for example, checkout, login, submit a product review, etc.) running in microservices for important customer segments (for example, traffic coming from iOS apps). When a planned or unplanned change takes place, SREs need to know how it impacts the experience of these customer segments or other key business metrics, but in most cases their monitoring solution cannot give them that kind of visibility. As a result, issues are missed.
Without the ability to zoom in on the performance of specific segments of traffic, SREs need to rely on aggregated telemetry, such as average latency or overall number of errors. With aggregated telemetry, performance issues that are specific to important customer segments get buried in massive amounts of data flowing in from other parts of the application. For instance, overall average latency might be acceptable, but if the organization cares about latency from an iOS app, and that latency is high, SREs will be unaware of this issue because it gets averaged out with other metrics.
On top of that, the aggregated telemetry data and its respective alerts are often tied only to application golden signals (such as latency, traffic, errors, and saturation) or infrastructure metrics (such as memory and CPU utilization). Without tying telemetry data to key business metrics, SREs will have difficulty prioritizing the alerts. For example, if you have 20 hosts out of memory, which one should you fix first? SREs need to be aware of and prioritize issues that are happening to customer segments that they care about most.
How can Splunk platform, Splunk Infrastructure Monitoring, and Splunk Application Performance Monitoring help with understanding the impact of changes?
Monitor the performance of key workflows
With Business Workflows, you can group together any combination of microservices that align to key functions performed by the backend, such as checkout or login. After a workflow is created, you can filter the service map to view the performance of that workflow, and set up alerts if performance degrades.
Isolate performance for specific segments aligned with changes
Using OpenTelemetry, you can easily add any tag required to your service. You can then point-and-click to use the tags as a filter to view traffic that you care about.
Isolate which service is at fault when there’s an issue
By combining Business Workflows and tag filtering, you can use the service map and Tag Spotlight to zoom in on the performance of key workflows that impact segments of important traffic. Then, when the change occurs, the service map uses color coding to show whether everything is working as intended, or uses a red dot to highlight which service caused an issue.
Engage developers with shared context
With Splunk Infrastructure Monitoring or logs-in-context, Splunk software brings together all the infrastructure metrics and all the logs that are relevant to the issue that’s being investigated (logs-in-context), and filters out the data that’s not relevant. After you have a good understanding of the issue, you can send a data link through any on-call system to the developers that own the service in question. When developers click on the data link, their browser loads a dashboard with the exact view that you used with all the related logs-in-context and related metrics. Armed with all the relevant data, developers can then quickly resolve the issue.
Use case guidance
- Optimizing performance in canary development environments with Splunk APM's custom MetricSets
- You can use use Splunk APM MetricSets to identify and respond to frequent microservice code releases, helping you to optimize your APM operations.
- Prescriptive Adoption Motion - Application Monitoring
- APM solves problems faster in monolith and microservice application architectures by detecting problems from deployments, troubleshooting the source of an issue, and optimizing service performance.