Getting traces into Splunk APM
Splunk Application Performance Monitoring (APM) tracks transactional events that occur between your application and any downstream services or endpoints. These events are called traces and are collections of operations (spans). Not only does APM provide full fidelity spans of the interactions that the application makes, it allows you to view latency, performance, number of requests sent, and how many transactions are successful or have failed. You can also track the performance of your application before and after critical releases and build detectors to alert you when thresholds aren’t being met.
Getting data in with OpenTelemetry
You have three main options for getting data into APM. Read the following brief descriptions and then click into the link documentation for instructions.
- Splunk Distribution of the OpenTelemetry Collector. Traces are most often ingested through a few key libraries for individual programming languages in which your applications will be built. This installs on standalone hosts (Linux, Windows Server, EC2, etc) to collect telemetry data. After installation, there are two ways to get the data into Splunk Observability Cloud:
- You can send them directly by giving the tracing library the Ingest Point URL and the token, which provides the lowest latency for traces. However, by doing so, you lose the ability to see the correlated logs and metrics covered in the video at the end of this article.
- You can send them through an OTel Collector that collects logs and metrics on the same host (or even on a remote OTel Collector, if you want). By doing this, you gain the ability to see the related/correlated logs and metrics from the same time period. You can also aggregate, sample, filter, or adjust your spans and traces before they get observability. If you used the first option, you lose the ability to perform these actions.
- Splunk OTeL Collector Chart. This is a OTeL Collector distribution specific to Kubernetes distributions, including: Openshift, AKS, EKS, GKE, or Vanilla Kubernetes. It installs an underlying agent within a Kubernetes cluster that can monitor either all the pods or nodes within a cluster, or you can specify individual pods or nodes.
- OpenTelemetry Collector Contrib (Upstream). Lastly, there’s the upstream open sourced version of the OTeL Collector which allows more features and functionality to be used to ingest and export data. This is useful for traces from languages not in the Splunk tracing library distribution. However, some data correlation isn’t available out of the box using the upstream collector, and not all exporters and receivers are vetted by Splunk, so performance isn't guaranteed.
Common OpenTelemetry issues
If you are experiencing issues with OTel, first always enable debug logging and check the logs for clues as to the core issue. Only information level logging is enabled by default. Debug logging is more robust. It can tell you what were the last three lines of logs that heard right before that exporter error, and that should give you an indication of what OTel was trying to do and what could have gone wrong.
Next, verify that the appropriate ports are open on the host or Kubernetes cluster so that the data can be sent to Splunk Observability Cloud (80, 443, 8006, 4317, 4318, etc).
Next verify that your endpoints are correct in terms of the Splunk Cloud Endpoint, Trace Url, API Url, etc. These values should align to your Splunk Cloud Platform instance or your Observability Realm.
Make sure your access token or realm are correct and have the correct permissions and scope enabled. Access tokens have two types of scopes: API and ingest. If you want to ingest any data and send it to Splunk Observability Cloud, the token must have the ingest scope. You cannot ingest data with a API-only scope access token. Also make sure that your token isn’t experiencing any throttling. You can try a different token to confirm.
Common APM issues
Instrumentation issues
Splunk Application Performance Monitoring is a very complex product because you have an OTel collector that's usually installed, but then you also have tracing libraries that are installed on top of that. Those tracing libraries are specific to the the application language itself. So we most commonly see issues regarding instrumentation. In these cases you might see missing sf_environment/deployment.environment
values or the value shows up as unknown. We might also see service names missing. Typically this means the OTEL_RESOURCE_ATTRIBUTES
needs to be set as an environment variable.
It could also mean that the environmental variable is being overwritten by the underlying host. For example, take a Java application running on a Linux box. If that Linux box, for some reason, has this same environment variable written on top of it, it can then be overwritten. If it gets overwritten, then you can either have a missing value, which will show up as unknown, or this value can be skewed to whatever is being set on the Linux box itself.
Missing traces
If you don’t see traces coming in from the services, verify that tracing library has the correct exporter settings if sending directly to Splunk Observability Cloud. You need to provide an access token and a realm for that. Note that there is no way to rotate tokens in the UI. It must be done via API.
However, more commonly, the library will be pointed to your OTel Collector instance so you can see metrics for an individual host. In this case, you need to have the correct ports open (4318, 4317, and 8006).
Missing metrics
Lastly, sometimes customers complain of missing MMS or TMS metrics, based off their APM traces, that keep track of how many requests are made or what the latency is. First make sure that you aren't limited on MMS or TMS. Your account team can help you run a health check script for that information.
If you aren’t being throttled, next check the traces and spans and make sure they have a span.kind
value of either Server or Consumer, as that is a pre-requisite for TMS or MMS metrics. This is a dimension that is within the span itself that we create internally,
Next steps
Now that you have data flowing into APM, watch this product walkthrough from Splunk Senior On-Demand Observability Consultant, Justin Thurston, to see how to get started using all that data to keep your applications running at peak performance.