Skip to main content
 
 
 
Splunk Lantern

Prescriptive Adoption Motion - Infrastructure Monitoring

 

Infrastructure monitoring is an essential aspect of managing the service stack's lower parts. It involves monitoring crucial elements such as servers, databases, container systems, app servers, network, and storage. The goal is to ensure the availability and optimal performance of the applications built on top of them. Any error in these components could have a negative impact on users - for example, if a server experiences memory shortages, it may cause delays for users trying to make purchases on an online store hosted on that server. When problems like this occur, a key goal of an organization's support team is to reduce the mean-time-to-recovery (MTTR) in order to minimize user impact or avoid disruptions altogether.

The infrastructure tier forms the foundation of the Observability Full Stack. Notably, the infrastructure tier is not monitored in isolation; it is seamlessly connected to other tiers in the Observability Full Stack in terms of metrics, logs, and traces. This integration provides complete observability and context for efficient monitoring.

The benefits of infrastructure monitoring include: 

  • Acting before infrastructure performance affects end-user experience. 
  • Instantly detecting and accurately alerting on dynamic thresholds, multiple conditions, and complex rules to eliminate alert storms and dramatically reduce MTTD/MTTR.
  • Answering business-critical questions in context and monitor service-level objectives and indicators instantly. 
  • Tracking custom metrics for business KPIs to token based access and usage controls.
  • Troubleshooting across thousands of microservices and billions of events without missing anything. 
  • Aggregating metrics before they are ingested or dropping any unused metric time series so that you can focus on scaling your apps.

Aim and strategy

Platform engineers provide software engineers with observability tools so that all the engineering teams have a single source of truth, which enables them to share best practices across the org and collaborate to minimize MTTR. At the same time, the platform engineers can monitor and maintain access and cost control of these observability tools so that everyone operates within budget. 

Teams using Splunk Observability Cloud can easily enrich their data with custom metrics. From key business metrics to infrastructure, applications, or the end user experience, teams can compare business performance alongside system health, and immediately detect and investigate problems. 

Engineers detect any abnormal change or spike in results, and can quickly scope the severity of an issue by comparing their infrastructure and application performance metrics alongside their business KPIs. To prioritize and isolate issues more efficiently, the Splunk platform provides AI-directed troubleshooting that recommends the causes for errors or slowness that impact services and customers the most.

Splunk Observability Cloud helps speed troubleshooting and identification of the biggest causes of slowness, errors, or anomalies that impact services and customers. Because all metrics, traces, and logs are collected, engineers can quickly scope an issue’s severity, narrow down problematic components, and pinpoint the exact issue. For troubleshooting microservices and cloud native environments, Splunk Infrastructure Monitoring helps teams detect, isolate, and resolve issues faster.

Splunk Infrastructure Monitoring is the solution in the Splunk Observability Cloud platform that monitors and observes system metrics for physical and virtual components across enterprise hybrid and multi-cloud environments. It also offers support for a broad range of integrations for collecting full-fidelity data, from system metrics for infrastructure components to custom data from your applications. When an issue is detected with Splunk Infrastructure Monitoring, it triggers the incident management lifecycle. In addition to detecting an issue, Splunk Infrastructure Monitoring is also used in the investigation step of the incident management cycle.

Common use cases

Container orchestration platform monitoring

Server and operating system monitoring

Database monitoring

Other use cases to consider

Use cases are specific to each organization, so also consider these thought starters as a way to help with your ideation of new use cases. Be sure to monitor availability, performance, capacity, and error conditions for any new components.   

  • Hosted services monitoring
  • Database monitoring 
  • Network monitoring
  • Serverless function monitoring
  • Middleware monitoring
  • Virtualization platform monitoring
  • Storage monitoring
  • Identity and access management (IAM) monitoring

User roles 

Role Responsibilities

Splunk Observability Cloud Admin

Configure Splunk Infrastructure Monitoring solution in Splunk Observability Cloud.

Process Owner

Give decision-making authority for global process definition and solution requirements definition and approval.

Engineering Team

Provide self-service tooling for developers to improve productivity and create consistency across teams.

Site Reliability Engineer (SRE)

Deploy and manage apps and cloud infrastructure, and ensure reliability.

Business Analyst 

Work with the business and end users to understand and document business requirements.

Quality Assurance (QA)

Responsible for functionality testing. 

Preparation

1. Prerequisites

To get data into Splunk Observability Cloud for infrastructure monitoring:

  • You must be an administrator in Splunk Observability Cloud.
  • You must have an access token for the Splunk Observability Cloud organization you want to get data into.
  • If you are connecting to Amazon Web Services, you must have an access token for the Splunk Observability Cloud organization you want to get data into.

4. Considerations

Splunk Infrastructure Monitoring is part of Splunk Observability Cloud. To get started with Splunk Observability Cloud, follow the instructions in the Splunk Docs topic, Set up and administer Splunk Observability Cloud. New users might also be interested in an overview of important terms and concepts.

Observability is a broad concept, and organizations can have difficulty determining what to monitor. But many organizations also find that monitoring four key capabilities help to monitor the aspects of your systems in a way that meets the needs of your organization: availability monitoring, error conditions monitoring, performance monitoring, and capacity monitoring. When a full 360-degree view of an end-to-end production system is needed, observability should be instrumented for all four capabilities. Failing to monitor all four elements of a live production system increases the risk of business/operation disruption and significantly reduces time to restore service should a disruption occur.

Whether your objective is to horizontally monitor all of the same type of a set of common server types (for example, all Linux servers or Kubernetes pods) or to monitor only the infrastructure that supports a critical business application (for example, the Linux or Windows databases), Splunk recommends that you employ the capabilities in the following sequence to achieve that objective.

  1. Availability monitoring. Monitoring availability is the foundational observability element. You must monitor for the very existence of the object being observed because if something doesn’t exist, nothing else matters. You must know if an application is running and if the intended users can use it. In the physical world, your favorite store must be open so that you can walk in and buy something. If it is closed (that is, unavailable), nothing else about its operations matters.
  2. Error conditions monitoring. Error conditions are meant to be a catch-all concept. There are infinite situations that can determine if a system has an issue that is currently disrupting the business or is imminently going to cause disruption if there is no intervention. It is generally intended to look at logs for messages that indicate a problem has occurred or is about to occur. The error conditions element is where ‘state monitoring’ fits. State monitoring is when something goes from one state to another such as from on to off, or from up to down.
  3. Performance monitoring. After you have a good implementation of monitoring availability, it is important to understand how well it is performing. Think again about your favorite physical store. When the store is available and shoppers are shopping, you might want to understand the duration it takes a shopper to proceed through the checkout line. This is the same for systems; you want to know the time it takes for a user to receive a checkout complete message from the time they clicked on the pay now button. This is just one of many different types of performance measures, but they are all about cycle time, or how long it takes to complete a task.
  4. Capacity monitoring. Capacity is the resources that are consumed for a particular function. Capacity can come in many different measurements such as, bandwidth, storage space, memory utilization, number of purchases executed simultaneously, and more. If a physical store only has one cash register and a marketing promotion drives a large number of people to shop there at the same time, the duration to proceed through the checkout line could increase dramatically. The same is true in systems. Lots of people using the system at the same time can cause really slow response times or possibly crash a system.

You might choose to fully implement one capability before moving to the next. Some organizations have employed an iterative strategy in which they do partial instrumentation for each and then cycle back through two or three more times, adding additional metrics, logs, and traces each time. Your approach should be taken based on the particular needs of your organization.

Full-stack versus individual element strategy

On a copy machine there are many moving parts, and a single component can impact the performance of the entire machine. For example, if the toner is low or a roller is sticky, the end result is a copy that doesn’t meet your needs. The same is true in IT systems. For example, a database queuing issue can cause users to experience slow response times. 

As you instrument monitoring for your environment, keep this analogy in mind. Just as you would not want to only monitor for low toner, you don't want to monitor only a single component of a full-stack service. You can certainly start monitoring a single component (for example, all Linux servers or AWS EC2 environments) but don't forget to instrument monitoring for the rest of the components that work together and can impact the customer experience.  

Splunk Infrastructure Monitoring has system limits that help ensure good performance, stability, and reliability. These limits also protect the infrastructure monitoring multi-tenant environment. Exceeding these limits might degrade your infrastructure monitoring experience.

To help you avoid problems when you use infrastructure monitoring, consider the system limit information presented in this System limits for Splunk Infrastructure Monitoring, which includes the following:

  • The name and value of each system limit
  • If available, the organization metrics associated with the limit
  • The impact you observe when you exceed the limit

Implementation guide

Splunk Infrastructure Monitoring helps customers answer the question “Do I have a problem?” and then alerts customers of the problem in real time via detectors, powered by streaming analytics, in Splunk Infrastructure Monitoring.

To adopt the infrastructure monitoring use cases and gain value, you must send data into Splunk Infrastructure Monitoring via integrations or agents with cloud providers/services, create dashboards and visualizations to make sense of all your data, and then create detectors in order to be alerted on this data at the desired thresholds important to the your use cases.

Here are some best practice articles to assist with implementing the above steps: 

Success measurement

When implementing the guidance in this adoption guide, you should see improvements in the following: 

  • Service performance
  • Customer experience 
  • Developer productivity 
  • Mean Time To Detect (MTTD)
  • Mean Time to Recovery (MTTR)
  • Service Level Objectives (SLO)
  • Service Level Agreements (SLA)