Skip to main content
 
 
Splunk Lantern

Troubleshooting checkout latency issues

 

You are a site reliability engineer working in a cloud-native environment that runs several microservices to support an online store. Your customer support team has received a number of complaints from customers who are experiencing checkout delays, so you need to troubleshoot and fix the issue.

Data required

Business service data

How to use Splunk software for this use case

  1. Open Splunk Observability Cloud, navigate to the main menu, and click Splunk Real User Monitoring.
  2. The Splunk Real User Monitoring home screen shows you data covering load times, activity, and front end errors. Scan through this information and look for problems. In this example, you can see that the /cart/checkout node latency is high. Click it to investigate further.
    clipboard_efa730ac80f346866325cf45828b21133.png
  3. Looking at the /cart/checkout node, you can see there has been a latency jump. To look more closely, click SESSIONID.
    clipboard_e62881af2f7a243c5a307a2c09f5f9a24.png
  4. From the trace view, you can identify requests with latency issues. In this instance, you can see a 1.73 second delay. To access more information in Splunk Application Performance Monitoring, click APM in the center row of this area.
    clipboard_e65085cbcfff5dde87be2fbb46f1d98fb.png
  5. Here, you can see the backend view of the trace. In this example you can see some errors in cartservice, checkoutservice, and paymentservice. At the bottom of this backend view, click the frontend/cart/checkout link under Workflow Name to access the service map.
    clipboard_e4e266503d348c2b492aa0deb5fb437fa.png
  6. Looking at the service map, check for long response times. In this example, you can see latency issues running from the external client, through the frontend and checkoutservice, flowing through to the paymentservice. It's not immediately obvious that this problem is caused by errors stemming from the paymentservice, so you'll need to investigate further.
    clipboard_e69b2f1133203047b421d509a9409ace5.png
  7. Click on paymentservice and click on one of the errors in the Request Rate area to get some example traces. Click the Trace ID for one of the traces to see more.
    2021-09-22_16-57-57 (1).gif
  8. The waterfall view of this trace shows several errors. In this example it looks like checkoutservice is doing an exponential backoff, which results in the high latency bars you can see at the top of this area.
    clipboard_e2d1e6b380705f86b069dd2ed124568ec.png
  9. At this stage, you might want to double-check what environment, or environments, the errors are originating in. In the drop-down box under paymentservice, click Environment. This shows you the breakdown of errors in the environments you are running.
    clipboard_e407b1cf5a2051547c88407def2b7eb90.png
  10. Next to the drop-down field you just changed to access Tag Spotlight, click Spotlight. Tag Spotlight shows you all the tags coming in for these spans so you can look for problems within these tags. In this example, you can see that version 350.10 is correlated with all the errors that are happening, indicating that this deployment might need to be rolled back.
    clipboard_edcc059ccf9a9482f2af43e88adf5f458.png
  11. Open Splunk Log Observer and click the + in the top toolbar. Click Fields, find label_app in this area, and click paymentservice. Then, filter for error logs in the bar chart. In this example you can see multiple serialization failures occurring.
    2021-09-22_17-02-23 (1).gif
  12. To be completely sure that the new deployment is causing these errors, click Visual Analysis at the top of the screen and sort by version. Here you can see that version 350.10 is causing these serialization failures.
    2021-09-22_17-10-26 (1).gif
  13. (Optional) Click one of the serialization failure events to copy the stacktrace and send it to your team so you can later merge a fix for the issue.
    clipboard_e25d159e633a8efba6846b19e24c4c72c.png
  14. After you've rolled back the deployment, you can use Splunk Observability Cloud to double-check that the errors have stopped. Using the time picker at the top-left of the screen, click Live Tail. This mode allows you to see any errors that arise in real-time. As the deployment rolls out, you should see these errors stop, meaning that the issue is mitigated.
    clipboard_e0148eaabd7e429f6322314fe48939723.png

Next steps

Latency and errors in eCommerce environments can result in customers abandoning their planned purchases, with financial impact to your organization. Quickly identifying and fixing these issues results in fewer dollars lost and fewer frustrated customers.

The content in this guide comes from a .conf2020 talk, Logging for Observability, one of the thousands of Splunk resources available to help users succeed. In addition, these Splunk resources might help you understand and implement this use case:

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at ondemand@splunk.com if you would like assistance.