Debug Problems in Microservices

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Engineering teams need to ensure effective development and performance of revenue-generating microservices-based applications. When a planned or unplanned change causes an issue, these teams need to know how it impacts the experience of important customer segments or other key business metrics, but in most cases their monitoring solution simply cannot give them that kind of visibility. As a result, identifying issues and fixing them becomes a lengthy and difficult process.

Engineers often rely on aggregated telemetry, such as average latency or overall number of errors. With aggregated telemetry, performance issues that are specific to important customer segments get buried in massive amounts of data flowing in from other parts of the application. For instance, overall average latency might be acceptable, but if the organization cares about latency from an iOS app, and that latency is high, SREs will be unaware of this issue because it gets averaged out.

On top of that, the aggregated telemetry data and its respective alerts are often tied only to application golden signals (like latency, traffic, errors, and saturation) or infrastructure metrics (like memory and CPU utilization). Without tying telemetry data to key business metrics, it’s extremely hard to prioritize the alerts.

How can Splunk platform, Splunk Infrastructure Monitoring, and Splunk Application Performance Monitoring help with debugging problems in microservices?

Receive granular and accurate alerts on service issues

Developers often instrument custom metrics to improve detection and isolation of problems in their service. The more detailed and granular the metric is, the more it helps developers understand the issue. Handling detailed metrics at scale is difficult. The metrics engine in Splunk software is designed from the ground-up for large scale deployments. Developers can use SignalFlow to program their own alerts and to smooth out a signal from noisy business fluctuations in order to ensure the alerts they receive are accurate. With SLO monitoring, developers get more proactive alerts when their service is at risk of violating its SLA.

Monitor performance of key workflows

With Business Workflows, engineers can group together any combination of microservices that align to key functions performed by the backend, such as checkout, log-in, etc. Once a Workflow is created, engineers can filter the Service Map to view the performance of that workflow and set up alerts if performance degrades.

Isolate which application, infrastructure, or business logic is causing a problem

Using OpenTelemetry, engineers can code in any tag (which relates to business or logic attributes such as location, code version, Kubernetes cluster, etc.) for their service. With a point-and-click interface, they can then use that tag to filter traffic or view it on Tag Spotlight, which groups together different traces based on attributes they have in common. Then it uses a visual representation to show errors and latency for each group of traces. With this global view, engineers can more easily identify the cause of the problem since they can immediately identify what problematic traces have in common.

In cases where engineers learn of a specific issue from other sources, for example a customer complaint, they can use Trace Analyzer to search through all the traces just for the ones relevant to that issue, and drill down through the waterfall view to better understand the problem.

Have unified telemetry and visibility for each service

Splunk Observability Cloud brings together all the telemetry data that you need to debug issues in your services. Splunk provides out-of-the-box RED metrics (Rate, Error, Duration) alongside infrastructure dashboards. For each service, service centric views provides an out-of-the-box dashboard with all the relevant performance data that engineers would need for that particular service.

Beyond that, Related Content provides a way to toggle with just a click between different telemetry types while keeping context. For example, when viewing a trace, Related Content brings up all the related logs to that trace, or when viewing a span, Related Content shows the metrics for the Kubernetes node running that span.

Diagnose accurate root cause within a service

Through Splunk Infrastructure Monitoring, engineers can understand issues caused by the infrastructure (such as low host memory) or the network. As part of that, Kubernetes Navigator is designed to bubble up poor performance in Kubernetes clusters, pods, and hosts.

Within the waterfall view, engineers can understand the impact of upstream and downstream services on their own service, and can identify poor database query performance. And with AlwaysOn Profiling, they can see how much memory and CPU each line of code consumes (in Java, .NET, and Node.js) to identify problematic code.

Use logs from the Splunk platform for advanced troubleshooting use cases

Splunk Log Observer Connect automatically pulls relevant logs from the Splunk platform so that engineering teams can send logs once to a single vendor and use them for multiple use cases. Logs in dashboards then provide teams with logs in context to the metrics, infrastructure, and traces that they are viewing so that they can easily understand the root cause of issues.

Instrument for the last time

Developers often hesitate to change observability vendors because each vendor requires their own proprietary instrumentation. OpenTelemetry is the de facto open standard of instrumentation, and Splunk Observability Cloud is OpenTelemetry-native. With Splunk Observability Cloud, developers have the peace of mind knowing that after they instrument their code with OpenTelemetry, they can send their data to any observability vendor without needing to re-instrument if they change tools or as they build new applications.

Access a comprehensive view of your applications regardless of architecture

Some business transactions rely on a combination of microservices and 3-tier applications, but most observability tools are optimized for one architecture or the other. As a result, troubleshooting issues in such an environment requires a lot of mental context switching. The Splunk platform, together with Splunk Log Observer Connect, provide a shared context for Splunk Observability Cloud (optimized for microservices) and Splunk AppDynamics (which is optimized for 3-tier applications) so that developers can use a single solution to debug problems that span both microservices and 3-tier apps.

Use case guidance

Creating SLOs and tracking error budgets with SignalFlow
How to use SignalFlow to better understand your service-level objective needs and performance.
Maintaining *nix systems with Infrastructure Monitoring
How to monitor *nix systems running critical applications or services, with Splunk searches that you can save and run on a schedule.
Maintaining Microsoft Windows systems with Infrastructure Monitoring
Use Windows data with your Splunk deployment to monitor patch management, software deployment, inventory tracking, remote access availability, and more.