Skip to main content
Splunk Lantern

Finding the root cause of a problem

Applicability

  • Product: Splunk APM
  • Feature: API integration
  • Function: Error alerting

Problem

You are a site reliability engineer working in a cloud-native environment that runs several microservices. The organization also has a containerized infrastructure. Your Splunk APM deployment just issued a high error rate alert on a /checkout endpoint for api service. You need to investigate the issue.

Sample Solution

  1. From the alert dialog box, click Troubleshoot to navigate directly to troubleshooting in APM, with time, service, endpoint, and environment context carried over.
  2. Check whether in the service map, the circle inside the api:/checkout endpoint has hashed lines. This indicates that the error is rooted in a more downstream service.
    clipboard_ee7570bf3d5b38b7e5b7b6fc391144ea5.png
  3. Click the Requests and Errors card to get more insights. Information on the error sources is displayed at the bottom of the docked card in an error stack. An error stack identifies the full path of the error. Information on the error sources is displayed at the bottom of the docked card in an error stack. An error stack identifies the full path of the error. In this example, there is one error stack, identified by the name of the service the error is rooted in (payment).
    clipboard_e73027d699ef29c2c3274e5be35e461e5.png
  4. Click the error stack payment to display the full error path. The errors originate in the payment service, propagate to the checkout service, and finally to the api service.
    clipboard_e706c26cdb59eb617c0444741904c3698.png
  5. Filter on the whole path. In the service map, double-click on the checkout service and then on the payment service to see the full error path. The circle inside the payment service is solid, which indicates that the error originates with that service (root cause error).
    clipboard_e70af644e819da862de2a15fcc8ac9a48.png
  6. Check for trends in the errors observed in the payment service. Top tags in error spans surfaces the indexed tags that have the highest error count in the selected service (payment). It looks like the problem is with a particular Kubernetes node (node6), as every request is resulting in an error.
    clipboard_e239e6b7cfb758fa98912ee1fae98f62b.png
  7. Explore further. You know that node6 has problems. From the Breakdown drop-down menu in the service map, select kubernetes_node to validate that the issue is only with node6.
    clipboard_ecd3439732da827bf7a4a2ea912578bb9.png
  8. Find out if, within node6, there is a particular tenant that has issues. Select tenant from the Breakdown drop-down menu to further confirm that gold, platinum, and silver are all having the same issue. All the problems are rooted in a particular node (node6), and that node was picked up with the tag analysis.
    clipboard_e150dc6cc72cb92c3e770e34b6e87e1a9.png
  9. Look at an example trace. Click on a point that corresponds with high errors in the Request Rate chart to display a list of example traces to choose from. Click a trace ID to see the trace.
    clipboard_e8b6b9e191252c8ddeafcbee7e2cb396a.png
  10. Click on /payment/execute (most downstream span with errors) to display the metadata on that span. You can see all the tags, including the kubernetes_node tag, that the problematic span is running on.
    clipboard_eff15d7006adce17804f6cb3b2ddcaea7.png
  11. Finally, explore node6 by navigating to the Kubernetes Navigator. In the node details, you can see the containers that are running in node6. Notice that a container (robot) is taking approximately 90 percent of memory in this node, which puts memory pressure on the payment pod. Click robot to open the sidebar and drill down to details without losing context. In this case, the container has no memory limit, which is probably why it is using all of the memory on this node.
    clipboard_eb960b93f24547978cc77a7e8902d157d.png

In summary, a “noisy neighbor” put memory pressure on the pod that the payment service was running on, causing errors that then propagated all the way upstream to the api service, which triggered a high error rate alert.

Additional resources

These additional Splunk resources might help you understand and implement these recommendations:

  • Was this article helpful?