Splunk APM (APM) provides seamless correlation between cloud infrastructure and the microservices running on top of it. If an application service functions outside of expected norms because of a memory leak, a noisy neighbor container, or any other infrastructure-related issue, Splunk APM will let you know. To complete the picture, in-context access to logs and events enable deeper troubleshooting and root-cause analysis.
Splunk APM was built to solve problems faster in monolithic and microservice application architectures by immediately detecting problems from new deployments, troubleshooting the source of an issue, and optimizing service performance. Whether you’re spending time on call, debugging, or optimizing service performance, Splunk APM helps you quickly understand the “why” for your applications and services.
The benefits of application monitoring include:
- Pinpointing the source of application problems quickly.
- Acting before application performance affects end-user experience.
- Instantly detecting and accurately alerting on dynamic thresholds, multiple conditions, and complex rules to eliminate alert storms and dramatically reduce MTTD/MTTR.
- Answering business-critical questions in context and monitor service-level objectives and indicators instantly.
- Troubleshooting across thousands of microservices and billions of events without missing anything.
- Quickly investigating end-point performance problems.
Aim and strategy
Platform teams provide software engineers with observability tools so that all engineering teams have a single source of truth, which enables them to share best practices across the org and collaborate to minimize MTTR. At the same time, the platform engineers can monitor and maintain access and cost control of these observability tools so that everyone operates within budget. Teams using Splunk Observability Cloud can easily enrich their data with custom metadata in their traces as well as custom metrics. From key business metrics to infrastructure, applications, or the end user experience, teams can compare business performance alongside system health, and immediately detect and investigate problems.
Engineers detect any abnormal change or spike in results and can quickly scope the severity of an issue by comparing their infrastructure and application performance metrics alongside their business KPIs. To prioritize and isolate issues more efficiently, you can use AI-directed troubleshooting that recommends the causes for errors or slowness that impact services and customers the most.
Splunk Observability Cloud helps speed troubleshooting, and identify the biggest problems of slowness, errors, or anomalies that impact services and customers. Because all metrics, traces, and logs are collected, engineers can quickly scope an issue’s severity, narrow down problematic components, and pinpoint the exact issue. For troubleshooting microservices and cloud native environments, Splunk APM lets you detect, troubleshoot, and optimize with more context in application environments so you can:
- Monitor deployments for latency, errors or anomalies.
- Visualize interdependencies between services in your workflows.
- Review services and workflows by error rate and latency.
- Resolve service bottlenecks due to poor CPU and memory allocation.
- Pinpoint the source of an issue down to detailed, granular log data.
Common use cases
Splunk Observability Cloud Admin
Configure the Splunk APM solution in Splunk Observability Cloud.
Make decisions around global process definition and solution requirements definition and approval.
Provide self-service tooling for developers to improve productivity and create consistency across teams.
Site Reliability Engineer (SRE)
Deploy and manage apps and cloud infrastructure, and ensure reliability.
Work with the business and end users to understand and document business requirements.
Quality Assurance (QA)
Responsible for functionality testing.
To get data into Splunk Observability Cloud for application monitoring:
- You must be an administrator in Splunk Observability Cloud.
- You must have an access token for the Splunk Observability Cloud organization you want to get data into.
- If you are connecting to Amazon Web Services, you must have an access token for the Splunk Observability Cloud organization you want to get data into.
2. Recommended training
- Introduction to Splunk Observability Cloud - 15 minute free
- Introduction to application performance management - 15 minute free
- Using Splunk application performance management - 4.5 hour virtual
- Configuring tracing and profiling for Splunk APM - 6 hour virtual
- Manual instrumentation with Splunk APM - 6 hour virtual
- Splunk Lantern: Comprehensive help guide
Splunk APM is part of Splunk Observability Cloud. To get started with Splunk Observability Cloud, follow the instructions in the Splunk Docs topic, Set up and administer Splunk Observability Cloud. New users might also be interested in an overview of important terms and concepts.
Observability is a broad concept, and organizations can have difficulty determining what to monitor. But many organizations also find that monitoring four key capabilities help to monitor the aspects of your systems in a way that meets the needs of your organization: availability monitoring, error conditions monitoring, performance monitoring, and capacity monitoring. When a full 360-degree view of an end-to-end production system is needed, observability should be instrumented for all four capabilities. Failing to monitor all four elements of a live production system increases the risk of business/operation disruption and significantly reduces time to restore service should a disruption occur.
Whether your objective is to horizontally monitor all of the same type of a set of common server types (for example, all Linux servers or Kubernetes pods) or to monitor only the infrastructure that supports a critical business application (for example, the Linux or Windows databases), Splunk recommends that you employ the capabilities in the following sequence to achieve that objective.
- Availability monitoring. Monitoring availability is the foundational observability element. You must monitor for the very existence of the object being observed because if something doesn’t exist, nothing else matters. You must know if an application is running and if the intended users can use it. In the physical world, your favorite store must be open so that you can walk in and buy something. If it is closed (that is, unavailable), nothing else about its operations matters.
- Error conditions monitoring. Error conditions are meant to be a catch-all concept. There are infinite situations that can determine if a system has an issue that is currently disrupting the business or is imminently going to cause disruption if there is no intervention. It is generally intended to look at logs for messages that indicate a problem has occurred or is about to occur. The error conditions element is where ‘state monitoring’ fits. State monitoring is when something goes from one state to another such as from on to off, or from up to down.
- Performance monitoring. After you have a good implementation of monitoring availability, it is important to understand how well it is performing. Think again about your favorite physical store. When the store is available and shoppers are shopping, you might want to understand the duration it takes a shopper to proceed through the checkout line. This is the same for systems; you want to know the time it takes for a user to receive a checkout complete message from the time they clicked on the pay now button. This is just one of many different types of performance measures, but they are all about cycle time, or how long it takes to complete a task.
- Capacity monitoring. Capacity is the resources that are consumed for a particular function. Capacity can come in many different measurements, such as bandwidth, storage space, memory utilization, number of purchases executed simultaneously, and more. If a physical store only has one cash register and a marketing promotion drives a large number of people to shop there at the same time, the duration to proceed through the checkout line could increase dramatically. The same is true in systems. Lots of people using the system at the same time can cause really slow response times or possibly crash a system.
You might choose to fully implement one capability before moving to the next. Some organizations have employed an iterative strategy in which they do partial instrumentation for each and then cycle back through two or three more times, adding additional metrics, logs, and traces each time. Your approach should be taken based on the particular needs of your organization.
Full-stack versus individual element strategy
On a copy machine there are many moving parts, and a single component can impact the performance of the entire machine. For example, if the toner is low or a roller is sticky, the end result is a copy that doesn’t meet your needs. The same is true in IT systems. For example, a database queuing issue can cause users to experience slow response times.
As you instrument monitoring for your environment, keep this analogy in mind. Just as you would not want to only monitor for low toner, you don't want to monitor only a single component of a full-stack service. You can certainly start monitoring a single component (for example, all Linux servers or AWS EC2 environments) but don't forget to instrument monitoring for the rest of the components that work together and can impact the customer experience.
Splunk APM has system limits that help ensure good performance, stability, and reliability. These limits also protect the application monitoring multi-tenant environment. Exceeding these limits might degrade your application monitoring experience.
To help you avoid problems when you use application monitoring, consider the system limit information presented in Splunk APM system limits, which includes the following:
- The name and value of each system limit.
- If available, the organization metrics associated with the limit.
- The impact you observe when you exceed the limit.
The following process steps provide guidance for your Splunk APM implementation.
- Identify a good cloud native microservice-based application candidate within your company. Use this to validate your implementation and to set appropriate enterprise-wide standards for future application onboarding.
- Select an application of low to medium complexity with clear problem, challenge, or situation statements. This assures value realization as you move forward on your APM journey with Splunk.
- Make sure you agree on a measurable objective in support of the goal. This makes sure you can measure success and evangelize within your company. Some examples are:
- Improve deployment frequency from every 3 days to every 2 days.
- Improve application availability from 96 to 98.5%.
- Improve customer order drop-out rates from 15% to 5%.
- Design and instrument your application according to how you plan to monitor, investigate, and diagnose issues to drive the goals and objectives in the Identify step. Defining and adhering to enterprise standards, such as tag definitions and naming conventions with appropriate governance, is critical to success.
- Operate. Make sure you have good operational alerts (detectors) that use proactive monitoring and data analytics to notify and engage operations according to priority and severity. This should be tightly coupled with your company's existing DevOps processes, event management, incident, change, and problem management processes. These interlocks provide improved visibility and continuity.
- Improve. Your solution always requires attention to sustain and provide incremental value back to the business. You should aim to iteratively improve the processes you put in place in the Operate step, specifically problem management processes, to ensure that the APM instrumentation remains relevant and drives operational excellence. Some examples are:
- Add additional detectors or improve existing ones.
- Create additional tags to segment data from scans and traces.
- Create additional business workflows to interconnect business KPIs with critical business transactions.
- Stakeholders. Always make sure you have designated stakeholders that cross development, operations, and shared infrastructure process boundaries. They help keep the project on course and assist you in addressing any bottlenecks in achieving initiative goals and objectives.
Here are some best practice articles to assist with implementing the above steps:
- Getting started with APM - Understand some core concepts and the value Splunk APM aims to deliver.
- Getting data in APM - The first step to driving value from Splunk Observability Cloud is getting data in.
- Extracting service insights from APM - The Splunk APM homepage provides a high density view at a service/workflow level with historical context.
- Implementing features and use case with Splunk APM - High value features you should configure and optimize in order to get the most out of Splunk APM.
- Administering APM - Know how to best manage the tool in order to optimize usage throughout your organization.
When implementing the guidance in this adoption guide, you should see improvements in the following:
- Application availability
- Customer order drop rates
- Development frequency
- Mean Time To Detect (MTTD)
- Mean Time to Recovery (MTTR)
- Service Level Objectives (SLO)
- Service Level Agreements (SLA)