Monitoring third-party API calls using the OpenTelemetry spanmetrics connector

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

In modern microservices architectures, businesses often rely on third-party APIs for critical functionalities like payment processing and data retrieval. However, monitoring these third-party API calls presents several challenges.

One Splunk customer had a situation where their internal microservices communicated with external APIs, such as www.punchh.com and https://www.olo.com/pay, for payment processing. If these third-party services became unresponsive or failing, they had alternative ways to process payments, but still needed to be notified about the failures.

They engaged with Splunk Professional Services to find a solution to this problem. This article explains the problem and solution, including specific steps you can take if you have the same problem.

Challenges identified

Lack of direct instrumentation: Since third-party services are external, they cannot be instrumented with OpenTelemetry (OTel) directly.
Custom error classification: The customer was specifically interested in tracking only HTTP 5xx errors and not other response codes like 4xx. Splunk Application Performance Monitoring (APM) classifies a span as an error span based on the OTel schema. This means that any 4xx on a span is considered as client-side error, which the customer was not interested in. To learn more about these codes, see How OpenTelemetry handles HTTP status codes.
Error classification concerns: While Splunk APM provides an sf_error attribute to identify errors, the customer only wanted to track specific error codes (HTTP 500), whereas sf_error includes both 4xx and 5xx errors.
Concerns with synthetic monitoring: The customer was unwilling to use Splunk Synthetic Monitoring as it would require frequent pings to third-party services, which they wanted to avoid.
Log-based monitoring concerns: Although logging HTTP status codes in Splunk Cloud Platform was an option, the customer wanted to avoid splitting their monitoring between Splunk Cloud Platform and Splunk Observability Cloud.

Exploring solutions

Several approaches were suggested, evaluated, and rejected:

Using sf_error for inferred spans monitoring.
- This was not viable as the customer only wanted to track 5xx errors, whereas sf_error considers both 4xx and 5xx errors.
Synthetic monitoring of third-party services.
- This was ruled out because the customer did not want to actively ping the third-party services and preferred passive monitoring.
Instrumentation of third-party services.
- This was not an option because the third-party services were external and not under the customer’s control.
Log-based monitoring in Splunk Cloud Platform.
- It is possible to send additional fields (status code and third-party service name) to Splunk Cloud Platform for dashboarding and alerting. However, the customer was hesitant about splitting their monitoring between Splunk Cloud Platform and Splunk Observability Cloud.

The solution the customer opted for was to use the spanmetrics connector within the OpenTelemetry pipeline to capture inferred spans and extract the necessary dimensions.

The idea was to generate span-based metrics where dimensions like http.status_code and net.peer.name could be preserved.
This allowed tracking outbound client spans (SPAN_KIND=CLIENT) directed at third-party services.
The metrics generated were used to create alerts and dashboards in Splunk Observability Cloud.

Implementation details

Since outbound requests to third-party services are SPAN_KIND=CLIENT, we configured the spanmetrics connector to capture these spans and extract relevant dimensions such as http.status_code, net.peer.name, and http.method. The process is described in the next few sections. This configuration ensured that HTTP 500 errors from third-party services could be specifically monitored.

The spanmetrics data was then visualized in Splunk Observability Cloud to provide real-time insights into third-party service failures.

`Spanmetrics` connector configuration

We configured the spanmetrics connector to capture outgoing requests to third-party services. We added the following dimensions to span.metrics.calls to ensure we could filter based on error conditions:

connectors:
  spanmetrics:
    namespace: span.metrics
    dimensions:
      - name: http.status_code
      - name: http.url
      - name: http.method
      - name: net.peer.name
      - name: http.host

The http.status_code dimension was critical for filtering only 5xx errors.

Trace exporter configuration

To ensure traces are exported correctly, we added the spanmetrics connector as part of the OpenTelemetry pipeline:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [spanmetrics, otlp]

Metrics receiver configuration

The spanmetrics connector also needed to be included in the metrics pipeline:

service:
  pipelines:
    metrics:
      receivers: [spanmetrics]
      processors: [batch]
      exporters: [signalfx]

SignalFlow query for monitoring

We used SignalFlow queries to count inferred service failures based on http.status_code=5xx:

A = data('span.metrics.calls', filter=filter('span.kind', 'SPAN_KIND_CLIENT') and
    filter('sf_service', 'my-node-service') and filter('http.status_code', '5**'),
    extrapolation='zero').sum(by=['net.peer.name', 'http.status_code']).publish(label='A')

Validation and observations

After configuration, we initially faced an issue where the metrics chart in SignalFx was showing continuous 500 errors even when no active requests were being made. This was due to the AggregationTemporality being set to Cumulative by default, leading to non-expiring metrics. Changing it to Delta resolved the issue, ensuring only real-time counts were displayed.

Below is the dashboard widget that shows some CLIENT calls (Going from an instrumented to a third-party service punchh.com) and failing with a https 500 code. These are the calls the customer wanted to monitor.

Key learnings

Using OpenTelemetry for enhanced visibility: The spanmetrics connector provided a viable way to transform trace data into metrics, allowing better monitoring.
Customer-centric monitoring: By leveraging this solution, the customer could track failed third-party API calls without actively pinging services or relying solely on logs.

Conclusion

By utilizing the OpenTelemetry spanmetrics connector, we successfully provided a way for the customer to monitor third-party API calls based on HTTP status codes. This approach ensured accurate tracking of service failures while aligning with the customer's preferences for passive monitoring within Splunk Observability Cloud. The solution not only helped them gain visibility into inferred spans but also provided actionable insights to maintain their payment processing services effectively.