Monitoring third-party API calls using the OpenTelemetry spanmetrics connector
In modern microservices architectures, businesses often rely on third-party APIs for critical functionalities like payment processing and data retrieval. However, monitoring these third-party API calls presents several challenges.
One Splunk customer had a situation where their internal microservices communicated with external APIs, such as www.punchh.com and https://www.olo.com/pay, for payment processing. If these third-party services became unresponsive or failing, they had alternative ways to process payments, but still needed to be notified about the failures.
They engaged with Splunk Professional Services to find a solution to this problem. This article explains the problem and solution, including specific steps you can take if you have the same problem.
Challenges identified
- Lack of direct instrumentation: Since third-party services are external, they cannot be instrumented with OpenTelemetry (OTel) directly.
- Custom error classification: The customer was specifically interested in tracking only HTTP 5xx errors and not other response codes like 4xx. Splunk Application Performance Monitoring (APM) classifies a span as an error span based on the OTel schema. This means that any 4xx on a span is considered as client-side error, which the customer was not interested in. To learn more about these codes, see How OpenTelemetry handles HTTP status codes.
- Error classification concerns: While Splunk APM provides an
sf_error
attribute to identify errors, the customer only wanted to track specific error codes (HTTP 500), whereassf_error
includes both 4xx and 5xx errors. - Concerns with synthetic monitoring: The customer was unwilling to use Splunk Synthetic Monitoring as it would require frequent pings to third-party services, which they wanted to avoid.
- Log-based monitoring concerns: Although logging HTTP status codes in Splunk Cloud Platform was an option, the customer wanted to avoid splitting their monitoring between Splunk Cloud Platform and Splunk Observability Cloud.
Exploring solutions
Several approaches were suggested, evaluated, and rejected:
- Using
sf_error
for inferred spans monitoring.- This was not viable as the customer only wanted to track 5xx errors, whereas
sf_error
considers both 4xx and 5xx errors.
- This was not viable as the customer only wanted to track 5xx errors, whereas
- Synthetic monitoring of third-party services.
- This was ruled out because the customer did not want to actively ping the third-party services and preferred passive monitoring.
- Instrumentation of third-party services.
- This was not an option because the third-party services were external and not under the customer’s control.
- Log-based monitoring in Splunk Cloud Platform.
- It is possible to send additional fields (status code and third-party service name) to Splunk Cloud Platform for dashboarding and alerting. However, the customer was hesitant about splitting their monitoring between Splunk Cloud Platform and Splunk Observability Cloud.
The solution the customer opted for was to use the spanmetrics
connector within the OpenTelemetry pipeline to capture inferred spans and extract the necessary dimensions.
- The idea was to generate span-based metrics where dimensions like
http.status_code
andnet.peer.name
could be preserved. - This allowed tracking outbound client spans (
SPAN_KIND=CLIENT
) directed at third-party services. - The metrics generated were used to create alerts and dashboards in Splunk Observability Cloud.
Implementation details
Since outbound requests to third-party services are SPAN_KIND=CLIENT
, we configured the spanmetrics
connector to capture these spans and extract relevant dimensions such as http.status_code
, net.peer.name
, and http.method
. The process is described in the next few sections. This configuration ensured that HTTP 500 errors from third-party services could be specifically monitored.
The spanmetrics
data was then visualized in Splunk Observability Cloud to provide real-time insights into third-party service failures.
Spanmetrics
connector configuration
We configured the spanmetrics
connector to capture outgoing requests to third-party services. We added the following dimensions to span.metrics.calls
to ensure we could filter based on error conditions:
connectors: spanmetrics: namespace: span.metrics dimensions: - name: http.status_code - name: http.url - name: http.method - name: net.peer.name - name: http.host
The http.status_code
dimension was critical for filtering only 5xx errors.
Trace exporter configuration
To ensure traces are exported correctly, we added the spanmetrics
connector as part of the OpenTelemetry pipeline:
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [spanmetrics, otlp]
Metrics receiver configuration
The spanmetrics
connector also needed to be included in the metrics pipeline:
service: pipelines: metrics: receivers: [spanmetrics] processors: [batch] exporters: [signalfx]
SignalFlow query for monitoring
We used SignalFlow queries to count inferred service failures based on http.status_code=5xx
:
A = data('span.metrics.calls', filter=filter('span.kind', 'SPAN_KIND_CLIENT') and filter('sf_service', 'my-node-service') and filter('http.status_code', '5**'), extrapolation='zero').sum(by=['net.peer.name', 'http.status_code']).publish(label='A')
Validation and observations
After configuration, we initially faced an issue where the metrics chart in SignalFx was showing continuous 500 errors even when no active requests were being made. This was due to the AggregationTemporality
being set to Cumulative
by default, leading to non-expiring metrics. Changing it to Delta
resolved the issue, ensuring only real-time counts were displayed.
Below is the dashboard widget that shows some CLIENT calls (Going from an instrumented to a third-party service punchh.com) and failing with a https 500 code. These are the calls the customer wanted to monitor.
Key learnings
- Using OpenTelemetry for enhanced visibility: The
spanmetrics
connector provided a viable way to transform trace data into metrics, allowing better monitoring. - Customer-centric monitoring: By leveraging this solution, the customer could track failed third-party API calls without actively pinging services or relying solely on logs.
Conclusion
By utilizing the OpenTelemetry spanmetrics
connector, we successfully provided a way for the customer to monitor third-party API calls based on HTTP status codes. This approach ensured accurate tracking of service failures while aligning with the customer's preferences for passive monitoring within Splunk Observability Cloud. The solution not only helped them gain visibility into inferred spans but also provided actionable insights to maintain their payment processing services effectively.