The challenge for engineers troubleshooting in cloud native environments is to quickly scope and isolate problems amidst increased complexity within a distributed system. It's likely an engineer didn’t write the code, and that they lack context across the dozens of services they are troubleshooting. Scoping and identifying problems in cloud native environments can require sifting through microservices and Kubernetes environments that have hundreds of dependencies, APIs, serverless functions, and third party components. Often teams use multiple monitoring tools or solutions which sample data and may miss root cause entirely. All of this can slow troubleshooting, as well as requiring more engineering resources and larger war rooms when isolating issues, which decreases the amount of time engineers have to build and deploy new code.
Engineers using observability solutions that focus on specific metric, trace, and log data, or backend/frontend visibility must piece together telemetry data across thousands of transactions that span backend services and their end user experience. This adds time and complexity which can slow troubleshooting. While most observability suites connect metrics, traces, and logs, they often sample data and rely on their proprietary agents, which might miss an issue or delay troubleshooting.
How can Splunk help?
Splunk Observability Cloud provides a number of capabilities that help customers isolate problems and provides multiple ways to solve those problems. For example, imagine that you have identified an error in charging customers in an online retail environment.
On the back end, Splunk APM provides a view that includes the microservices that contribute to the charge workflow. You could look more deeply into what might be in common with your erroneous transactions using Tag Spotlight. The Related Content feature helps you find the logs behind a trace, which might tell you even more about these errors. Splunk Observability Cloud surfaces logs coming from your Splunk platform, whether on-premises or in the cloud.
On the front end, you can track down this issue using Splunk Real User Monitoring to look into your user sessions. Splunk Observability Cloud brings in every single trace, so you can find a single transaction based on a session ID or other identifier. In this case, you might look for a long transaction and investigate what caused the transaction to be very slow. You can look at the transaction workflow in a map, as you can in Splunk APM, or you can look at this specific trace's waterfall.
Watch the following video to see a demo of how Splunk APM and Splunk Real User Monitoring can both be used to solve the same problem.
With Splunk Observability Cloud, engineers detect issues in real-time and receive end to end visibility of their entire stack dynamically as it experiences high latency, errors, and anomalies. Engineers can quickly scope an issue’s impact to services, customers, and workflows with time-series metrics, understand which components in their microservices environment are involved in the issue, and finally pinpoint the source of the issue with detailed, granular log data. Engineers can be confident they’ve isolated the issue because their observability solution connects and correlates all of their telemetry data from every service and dependency.