During the Observe stage, the benefits customers target to address fall into one or more of the following areas:
- Customer experience. Preventing user or customer frustration due to slow response times and outages is a major benefit of observability. Being able to deliver error-free software releases faster can improve customer experience as well.
- Business continuity. Being able to keep critical business operations running is vital to survival for all organizations. Companies that are non-operational for too long might never recover.
- Revenue enablement. Ensuring infrastructure and applications are able to allow businesses to market and sell their products and services is another important benefit. Companies must monitor spot issues before they happen, when they happen, and after they happen. Rapidly responding to both reactive and proactive situations prevents or minimizes disruptions from impacting sales.
- Brand reputation protection. Repeat outages lead to a company having a reputation as being risky among its customers.
- Competitiveness. In many industries, availability and response times are benchmarked and compared to their competitors. Monitoring helps organizations measure quality of service in order to improve it. Being able to deliver error-free software releases faster can also improve an organization’s competitiveness.
- Mission attainment. Most government organizations, such as the military, are not revenue generators. Yet, with their life-or-death missions, systems must be ready to use at the levels of quality necessary whenever they are called upon.
- Operational efficiency and scale. Organizations looking to do more with less or attain economies of scale utilize observability solutions to automate operations such as monitoring and remediation of disruptions.
The following products in the Splunk Observability Cloud help customers reach these benefits: Splunk APM, Splunk Infrastructure Monitoring, Splunk Synthetic Monitoring, Splunk Real User Monitoring, and Splunk Log Observer.
If you're just starting out with observability, click the following link to learn more about observability best practices. Otherwise, jump to the links at the bottom of the page to start exploring use cases.
- ►Observability best practices
Observability is a broad concept, and organizations can have difficulty determining what to monitor at the Observe stage. Thinking about the four capabilities of Observe will help you monitor the aspects of your systems in a way that meets the needs of your organization. These are availability monitoring, error conditions monitoring, performance monitoring, and capacity monitoring. When a full 360-degree view of an end-to-end production system is needed, observability should be instrumented for all four capabilities. Failing to monitor all for elements of a live production system increases the risk of business/operation disruption and significantly reduces time to restore service should a disruption occur.
It is common, however, in the development environment to primarily focus on the performance and error conditions elements for testing purposes. Software developers need to know that the code they are developing will be performant prior to it moving to the production environment. Likewise, looking for errors real-time during testing can provide a more rapid remediation of defects.
Whether your objective is to horizontally monitor all of the same type of a set of common server types (for example, all Linux servers or Kubernetes pods) or to monitor only the infrastructure that supports a critical business application (for example, the Linux or Windows databases), Splunk recommends that you employ the capabilities in the following sequence to achieve that objective.
You may choose to fully implement one capabilities before moving to the next. Some organizations have employed an iterative strategy in which they do partial instrumentation for each and then cycle back through two or three more times, adding additional metrics, logs, and traces each time. Your approach should be taken based on the particular needs of your organization.
Availability monitoring. Monitoring availability is the foundational observability element. You must monitor for the very existence of the object being observed because if something doesn’t exist, nothing else matters. You must know if an application is running and if the intended users can use it. In the physical world, your favorite store must be open so that you can walk in and buy something. If it is closed (that is, unavailable), nothing else about its operations matters.
Error conditions monitoring. Error conditions is meant to be a catch-all concept. There are infinite situations that can determine if a system has an issue that is currently disrupting the business or is imminently going to cause disruption if there is no intervention. It is generally intended to look at logs for messages that indicate an problem has occurred or is about to occur. The error conditions element is where ‘state monitoring’ fits. State monitoring is when something goes from one state to another such as from on to off, or from up to down.
Performance monitoring. After you have a good implementation of monitoring availability, it is important to understand how well it is performing. Think again about your favorite physical store. When the store is available and shoppers are shopping, you might want to understand the duration it takes a shopper to proceed through the checkout line. This is the same for systems; you want to know the time it takes for a user to receive a checkout complete message from the time they clicked on the pay now button. This is just one of many different types of performance measures, but they are all about cycle time, or how long it takes to complete a task.
Capacity monitoring. Capacity is the resources that are consumed for a particular function. Capacity can come in many different measurements - such as, bandwidth, storage space, memory utilization, number of purchases executed simultaneously, and more. If a physical store only has one cash register and a marketing promotion drives tons of people to shop there at the same time, the duration to proceed through the checkout line could increase dramatically. The same is true in systems. Lots of people using the system at the same time can cause really slow response times or possibly crash a system.
Full-stack versus individual element strategy
Like on a copy machine there are many moving parts, and a single component can impact the entire machine. For example, if the toner is low or a roller is sticky, the end result is a copy that doesn’t meet your needs. The same is true in IT systems. For example, a database queuing issue can cause users to experience slow response times.
As you instrument monitoring for your environment, keep this analogy in mind. Just as you would not want to only monitor for low toner, you don't want to monitor only a single component of a full-stack service. You can certainly start monitoring a single component (for example, all Linux servers or AWS EC2 environments) but don't forget to instrument monitoring for the rest of the components that work together and can impact the customer experience.
Explore Observe focal areas and find your use cases
If you're at the Observe stage of your journey, explore the following focal areas to find use cases you should apply.
- Application monitoring overview
- Splunk APM insights and observations provide the ability to quickly identify root causes and drive improvements in release frequencies, MTTD, MTTR, and service availability.
- Business service insights overview
- Business service insights regarding processes and service level expectations are important to driving customer satisfaction, brand loyalty, and service quality improvements.
- Digital experience monitoring overview
- Splunk Synthetic Monitoring is a synthetic web performance monitoring system that helps teams see the speed and reliability of websites, web apps, and resources over time.
- Infrastructure monitoring overview
- Splunk Infrastructure Monitoring monitors and observes system metrics for physical and virtual components across enterprise hybrid and multi-cloud environments.