Skip to main content

 

Splunk Lantern

Troubleshooting critical application performance issues

 

As an application owner at your organization, you are constantly on the lookout for performance issues. It's vital to your users that applications work, but you know they are vulnerable to any of the following:

  • slow response times
  • application errors and exceptions
  • infrastructure resource constraints
  • bottlenecks in application components
  • network or client-side issues

 Your traditional approach to monitoring infrastructure with Splunk software has been the following: 

  1. Bring the logs from your application's backend services into a heavy forwarder.
  2. Send the logs to the appropriate index, based on context, such as infrastructure or security.
  3. Write searches with SPL to analyze the data and create reports.

This has worked well for you because you have been in the industry a long time. However, more junior employees don't always know what they are looking for. Sometimes they need more assistance in order to complete an effective response plan, which is based on the following principles:

  • Observe: Describe the right problems within the right scope and timeframe.
  • Remediate: Set up an alerting model and actions. Eliminate problems before they can impact you.
  • Understand: Determine the event’s impact to your organization. Correlate your application performance with business metrics.

How to use Splunk software for this use case

Your team is now going to use Splunk AppDynamics for application troubleshooting. Splunk AppDynamics deploys agents that have two important properties:

  • They know what to look for. They gather traces and events and metrics, and return information about performance issues, load, and errors.
  • They know about each other's existence, which enables them to correlate information.

Let's see how these help with a realistic scenario. Imagine you work for an online bookseller website. A customer selects several books, adds them to his cart and selects Check out now. Instead of advancing to a shipping and payment screen, the user sees a message that simply says "Something went wrong". Frustrated, the user closes the browser tab and walks away from his computer. Your employer is losing money, so you need to fix the problem quickly.

  1. First, review the Application Flow Map. This is created automatically by Splunk AppDynamics when you deploy an agent.Application flow map.png
  2. In the Response Time panel, you notice a spike in response time. Select the time range and click Set Time Range to narrow the time range shown in Application Flow Map.
  3. Now you need to narrow down the scope. On the right of the Application Flow Map, look for red in the health indicators. In this case, we'll click Business Transaction HealthBusiness transactions are tasks that an application performs, such as checkout procedure. They can include server requests, web services, REST services, MVC actions, and more. 
  4. Click on the Transactions page. Sort the transactions by health. Click a service that has a problem, as indicated by the red health badge. Business Transactions.png
  5. Now we follow the same pattern for the specific transaction. Look at the time range again for a spike and set it by dragging over the graph and clicking Set Time Range.Checkout Problem.png
  6. In the top menu bar, click Transaction Snapshots. When the application is monitored, you record specific goals of specific users. Filter by response time to see what is slow.
    slow_transactions.png
  7. Click a slow transaction (ideally with a blue file icon indicating a “Full Transaction Snapshot”) to see that specific transaction flow. 
  8. The list on the left lists potential problems, and the color coding in the diagram also shows where the problems are. In this case, the MySQL database is coded red. You likely already have the information you need at that point, but you can also select a node to drill into it and see all the code related to that business transaction. In this case, the code revealed that an SQL query was the specific problem, so you know who to talk to in order to fix the problem. You can send them the message provided by Splunk AppDynamics for context. Transaction Flow.png
  9. Finally, in the top menu bar, select Data Collectors. Data collectors in the transaction information provide business context, like the user name or the revenue that was lost when the checkout procedure didn't work. Note that this information isn't enabled automatically by agents and requires some configuration.

Next steps

Now that you have an idea of how Splunk AppDynamics can troubleshoot your applications, watch the full .Conf25 Talk, Mastering strategies for troubleshooting critical application performance issues. In the talk, you'll learn more about good triage process and ways that AI can help you troubleshoot faster in Splunk AppDynamics.

In addition, you might find these Splunk resources helpful:

  • Written by Łukasz Pokrzywka and Patryk Widz
  • Splunk Expert Engineers at Splunk