Implementing distributed tracing
You work for an organization that prioritizes monitoring workloads to ensure a successful customer experience. However, you are finding that as your organization's applications become more distributed and cloud-native, monitoring is becoming more complex. A single user transaction fans out to interact with tens or hundreds of microservices, each one requesting data from backend data stores or otherwise interacting with each other and other parts of your infrastructure. Determining exactly what it takes to fulfill a user’s request is becoming increasingly challenging over time.
Implement distributed tracing
Instrumenting your applications to generate spans and traces can help you easily find bottlenecks in your systems and clearly understand where time is spent.
A trace is a collection of transactions (or spans) representing a unique user or API transaction handled by an application and its constituent services. One trace represents one user interaction. The trace is made up of a collection of spans – single operations, which contain a beginning and ending time, a trace ID to correlate them to the specific user transaction involved, as well as some identifier (or tag) to add additional information about the request, like the particular version of a microservice that generated the span.
Traces are distributed across different services, so the industry refers to this process of tagging spans and correlating them together as distributed tracing. Distributed tracing follows a request (transaction) as it moves between multiple services within a microservices architecture, allowing you to identify where the service request originates from (user-facing frontend application) throughout its journey with other services.
Because customer experience is so vital and modern architecture is so complex (one user transaction can require services hosted on-premise, in multiple clouds, or even serverless function calls), access to this telemetry is essential. You’ll have better visibility into where your application is spending the most time and easily identify bottlenecks that may affect application performance.
For example, within a simple client-server application, the client begins by sending a request over to the server for a specific customer. The server then processes the request, and the response is sent back to the client. Within the context of the client, a single action has occurred. They sent a request and got a response. You can observe each server request generated as a result of this client request in a span. As the client performs different transactions with the server in the context of the application, more spans are generated, and you can correlate them together within a trace context. The trace context is the glue that holds the spans together, for example:
- Client sends a customer name request to Server at time: X (Trace Context: customerrequest1, SpanID: 1, timestamp: X)
- Server receives customer name request from Client at time: Y (Trace Context: customerrequest1, SpanID: 2, timestamp: Y)
- Server parses the request from the Client at time: Z (Trace Context: customerrequest1, SpanID: 3, timestamp: Z)
Here, the trace context remains the same, tying each span together and letting the infrastructure know that each span belongs to the same transaction.
Generate, collect, and export traces
To gather traces, applications must be instrumented. Instrumenting an application requires using a framework like OpenTelemetry to generate traces and measure application performance to discover where time is spent and locate bottlenecks quickly. With applications consisting of different coding languages, distributed microservices, and written by other individuals worldwide, it helps to have an open vendor-agnostic framework you can use to instrument your applications.
For many languages, OpenTelemetry provides automatic instrumentation of your application, where others must be manually instrumented. OpenTelemetry is the industry standard for observability data, so you’ll only have to do instrumentation work one time.
After your application has been instrumented, you can use the Splunk OpenTelemetry collector, which provides a unified way to receive, process, and export application telemetry to an analysis tool like Splunk APM, where you can create dashboards, business workflows and identify critical metrics.
One example is Splunk APM's Dynamic Service Map. It provides a quick and easy way to identify how your application's microservices are performing in your environment, helping you to understand your applications' microservice dependencies and gain a clear view of latency between each service. The service map dynamically changes to reflect your microservice's performance, so you can easily view performance bottlenecks and error propagation. This approach allows you to sift through trace data in seconds and immediately highlight which microservice may be responsible for errors.
Additionally, Splunk APM also offers Tag Spotlight to quickly correlate events like increases in latency or errors with tag values, providing a one-stop-shop to understand how traces are behaving across your entire application.
Next steps
The content in this guide comes from a previously published blog, one of the thousands of Splunk resources available to help users succeed. In addition, these Splunk resources might help you understand and implement this use case: