You might need to troubleshoot a high error rate alert on a checkout endpoint from APM when doing the following:
- Product: Splunk APM
- Feature: API integration
- Function: Error alerting
You are a site reliability engineer working in a cloud-native environment that runs several microservices. The organization also has a containerized infrastructure. Your Splunk APM deployment just issued a high error rate alert on a /checkout endpoint for api service. You need to investigate the issue.
- From the alert dialog box, click Troubleshoot to navigate directly to troubleshooting in APM, with time, service, endpoint, and environment context carried over.
- Check whether in the service map, the circle inside the
api:/checkoutendpoint has hashed lines. This indicates that the error is rooted in a more downstream service.
- Click the Requests and Errors card to get more insights. Information on the error sources is displayed at the bottom of the docked card in an error stack. An error stack identifies the full path of the error. Information on the error sources is displayed at the bottom of the docked card in an error stack. An error stack identifies the full path of the error. In this example, there is one error stack, identified by the name of the service the error is rooted in (payment).
- Click the error stack
paymentto display the full error path. The errors originate in the
paymentservice, propagate to the
checkoutservice, and finally to the
- Filter on the whole path. In the service map, double-click on the
checkoutservice and then on the
paymentservice to see the full error path. The circle inside the
paymentservice is solid, which indicates that the error originates with that service (root cause error).
- Check for trends in the errors observed in the
paymentservice. Top tags in error spans surfaces the indexed tags that have the highest error count in the selected service (payment). It looks like the problem is with a particular Kubernetes node (node6), as every request is resulting in an error.
- Explore further. You know that
node6has problems. From the Breakdown drop-down menu in the service map, select kubernetes_node to validate that the issue is only with
- Find out if, within node6, there is a particular tenant that has issues. Select tenant from the Breakdown drop-down menu to further confirm that
silverare all having the same issue. All the problems are rooted in a particular node (node6), and that node was picked up with the tag analysis.
- Look at an example trace. Click on a point that corresponds with high errors in the Request Rate chart to display a list of example traces to choose from. Click a trace ID to see the trace.
- Click on
/payment/execute(most downstream span with errors) to display the metadata on that span. You can see all the tags, including the
kubernetes_nodetag, that the problematic span is running on.
- Finally, explore
node6by navigating to the Kubernetes Navigator. In the node details, you can see the containers that are running in
node6. Notice that a container (
robot) is taking approximately 90 percent of memory in this node, which puts memory pressure on the
robotto open the sidebar and drill down to details without losing context. In this case, the container has no memory limit, which is probably why it is using all of the memory on this node.
In summary, a “noisy neighbor” put memory pressure on the pod that the payment service was running on, causing errors that then propagated all the way upstream to the api service, which triggered a high error rate alert.
These additional Splunk resources might help you understand and implement these recommendations: