Prescriptive Adoption Motion - Infrastructure Monitoring

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Infrastructure monitoring is an essential aspect of managing the service stack's lower parts. It involves monitoring crucial elements such as servers, databases, container systems, app servers, network, and storage. The goal is to ensure the availability and optimal performance of the applications built on top of them. Any error in these components could have a negative impact on users - for example, if a server experiences memory shortages, it may cause delays for users trying to make purchases on an online store hosted on that server. When problems like this occur, a key goal of an organization's support team is to reduce the mean-time-to-recovery (MTTR) in order to minimize user impact or avoid disruptions altogether.

The infrastructure tier forms the foundation of the Observability Full Stack. Notably, the infrastructure tier is not monitored in isolation; it is seamlessly connected to other tiers in the Observability Full Stack in terms of metrics, logs, and traces. This integration provides complete observability and context for efficient monitoring.

The benefits of infrastructure monitoring include:

Acting before infrastructure performance affects end-user experience.
Instantly detecting and accurately alerting on dynamic thresholds, multiple conditions, and complex rules to eliminate alert storms and dramatically reduce MTTD/MTTR.
Answering business-critical questions in context and monitor service-level objectives and indicators instantly.
Tracking custom metrics for business KPIs to token based access and usage controls.
Troubleshooting across thousands of microservices and billions of events without missing anything.
Aggregating metrics before they are ingested or dropping any unused metric time series so that you can focus on scaling your apps.

Aim and strategy

Platform engineers provide software engineers with observability tools so that all the engineering teams have a single source of truth, which enables them to share best practices across the org and collaborate to minimize MTTR. At the same time, the platform engineers can monitor and maintain access and cost control of these observability tools so that everyone operates within budget.

Teams using Splunk Observability Cloud can easily enrich their data with custom metrics. From key business metrics to infrastructure, applications, or the end user experience, teams can compare business performance alongside system health, and immediately detect and investigate problems.

Engineers detect any abnormal change or spike in results, and can quickly scope the severity of an issue by comparing their infrastructure and application performance metrics alongside their business KPIs. To prioritize and isolate issues more efficiently, the Splunk platform provides AI-directed troubleshooting that recommends the causes for errors or slowness that impact services and customers the most.

Splunk Observability Cloud helps speed troubleshooting and identification of the biggest causes of slowness, errors, or anomalies that impact services and customers. Because all metrics, traces, and logs are collected, engineers can quickly scope an issue’s severity, narrow down problematic components, and pinpoint the exact issue. For troubleshooting microservices and cloud native environments, Splunk Infrastructure Monitoring helps teams detect, isolate, and resolve issues faster.

Splunk Infrastructure Monitoring is the solution in the Splunk Observability Cloud platform that monitors and observes system metrics for physical and virtual components across enterprise hybrid and multi-cloud environments. It also offers support for a broad range of integrations for collecting full-fidelity data, from system metrics for infrastructure components to custom data from your applications. When an issue is detected with Splunk Infrastructure Monitoring, it triggers the incident management lifecycle. In addition to detecting an issue, Splunk Infrastructure Monitoring is also used in the investigation step of the incident management cycle.

Common use cases

Container orchestration platform monitoring

Monitoring Kubernetes pods

Server and operating system monitoring

AWS Elastic Compute Cloud monitoring using Splunk Infrastructure Monitoring

Database monitoring

Other use cases to consider

Use cases are specific to each organization, so also consider these thought starters as a way to help with your ideation of new use cases. Be sure to monitor availability, performance, capacity, and error conditions for any new components.

Hosted services monitoring
Database monitoring
Network monitoring
Serverless function monitoring
Middleware monitoring
Virtualization platform monitoring
Storage monitoring
Identity and access management (IAM) monitoring

User roles

Role	Responsibilities
Splunk Observability Cloud Admin	Configure Splunk Infrastructure Monitoring solution in Splunk Observability Cloud.
Process Owner	Give decision-making authority for global process definition and solution requirements definition and approval.
Engineering Team	Provide self-service tooling for developers to improve productivity and create consistency across teams.
Site Reliability Engineer (SRE)	Deploy and manage apps and cloud infrastructure, and ensure reliability.
Business Analyst	Work with the business and end users to understand and document business requirements.
Quality Assurance (QA)	Responsible for functionality testing.

Preparation

1. Prerequisites

To get data into Splunk Observability Cloud for infrastructure monitoring:

You must be an administrator in Splunk Observability Cloud.
You must have an access token for the Splunk Observability Cloud organization you want to get data into.
If you are connecting to Amazon Web Services, you must have an access token for the Splunk Observability Cloud organization you want to get data into.

2. Recommended Training

Introduction to Splunk Observability - free
Introduction to Splunk Infrastructure Monitoring - free
Visualizing and alerting in Splunk Infrastructure Monitoring - 4.5 hour virtual or onsite
Automation using the REST and SignalFlow API - 2 day virtual or onsite
Using the Splunk IM Terraform provider - 2 day virtual or onsite
Ingesting application metrics in Splunk Observability Cloud - 4.5 hour virtual
Kubernetes monitoring with Splunk Observability Cloud - 4.5 hour virtual
Network performance monitoring with Splunk Network Explorer - free

3. Resources

Self-service resources:

Splunk Lantern: Comprehensive help guide
Splunk Docs: Quick start tutorial for Splunk Infrastructure Monitoring

4. Considerations

Splunk Infrastructure Monitoring is part of Splunk Observability Cloud. To get started with Splunk Observability Cloud, follow the instructions in the Splunk Docs topic, Set up and administer Splunk Observability Cloud. New users might also be interested in an overview of important terms and concepts.

Observability is a broad concept, and organizations can have difficulty determining what to monitor. But many organizations also find that monitoring four key capabilities help to monitor the aspects of your systems in a way that meets the needs of your organization: availability monitoring, error conditions monitoring, performance monitoring, and capacity monitoring. When a full 360-degree view of an end-to-end production system is needed, observability should be instrumented for all four capabilities. Failing to monitor all four elements of a live production system increases the risk of business/operation disruption and significantly reduces time to restore service should a disruption occur.

Whether your objective is to horizontally monitor all of the same type of a set of common server types (for example, all Linux servers or Kubernetes pods) or to monitor only the infrastructure that supports a critical business application (for example, the Linux or Windows databases), Splunk recommends that you employ the capabilities in the following sequence to achieve that objective.

Availability monitoring. Monitoring availability is the foundational observability element. You must monitor for the very existence of the object being observed because if something doesn’t exist, nothing else matters. You must know if an application is running and if the intended users can use it. In the physical world, your favorite store must be open so that you can walk in and buy something. If it is closed (that is, unavailable), nothing else about its operations matters.
Error conditions monitoring. Error conditions are meant to be a catch-all concept. There are infinite situations that can determine if a system has an issue that is currently disrupting the business or is imminently going to cause disruption if there is no intervention. It is generally intended to look at logs for messages that indicate a problem has occurred or is about to occur. The error conditions element is where ‘state monitoring’ fits. State monitoring is when something goes from one state to another such as from on to off, or from up to down.
Performance monitoring. After you have a good implementation of monitoring availability, it is important to understand how well it is performing. Think again about your favorite physical store. When the store is available and shoppers are shopping, you might want to understand the duration it takes a shopper to proceed through the checkout line. This is the same for systems; you want to know the time it takes for a user to receive a checkout complete message from the time they clicked on the pay now button. This is just one of many different types of performance measures, but they are all about cycle time, or how long it takes to complete a task.
Capacity monitoring. Capacity is the resources that are consumed for a particular function. Capacity can come in many different measurements such as, bandwidth, storage space, memory utilization, number of purchases executed simultaneously, and more. If a physical store only has one cash register and a marketing promotion drives a large number of people to shop there at the same time, the duration to proceed through the checkout line could increase dramatically. The same is true in systems. Lots of people using the system at the same time can cause really slow response times or possibly crash a system.

You might choose to fully implement one capability before moving to the next. Some organizations have employed an iterative strategy in which they do partial instrumentation for each and then cycle back through two or three more times, adding additional metrics, logs, and traces each time. Your approach should be taken based on the particular needs of your organization.

Full-stack versus individual element strategy

On a copy machine there are many moving parts, and a single component can impact the performance of the entire machine. For example, if the toner is low or a roller is sticky, the end result is a copy that doesn’t meet your needs. The same is true in IT systems. For example, a database queuing issue can cause users to experience slow response times.

As you instrument monitoring for your environment, keep this analogy in mind. Just as you would not want to only monitor for low toner, you don't want to monitor only a single component of a full-stack service. You can certainly start monitoring a single component (for example, all Linux servers or AWS EC2 environments) but don't forget to instrument monitoring for the rest of the components that work together and can impact the customer experience.

Splunk Infrastructure Monitoring has system limits that help ensure good performance, stability, and reliability. These limits also protect the infrastructure monitoring multi-tenant environment. Exceeding these limits might degrade your infrastructure monitoring experience.

To help you avoid problems when you use infrastructure monitoring, consider the system limit information presented in this System limits for Splunk Infrastructure Monitoring, which includes the following:

The name and value of each system limit
If available, the organization metrics associated with the limit
The impact you observe when you exceed the limit

Implementation guide

Splunk Infrastructure Monitoring helps customers answer the question “Do I have a problem?” and then alerts customers of the problem in real time via detectors, powered by streaming analytics, in Splunk Infrastructure Monitoring.

To adopt the infrastructure monitoring use cases and gain value, you must send data into Splunk Infrastructure Monitoring via integrations or agents with cloud providers/services, create dashboards and visualizations to make sense of all your data, and then create detectors in order to be alerted on this data at the desired thresholds important to the your use cases.

Review the Quick start tutorial for the steps including prerequisites and configuration instructions. This includes key topics such as:

Here are some best practice articles to assist with implementing the above steps:

Getting started with Infrastructure Monitoring - Understand some core concepts and the value this product/technology aims to deliver.
Getting data into Infrastructure Monitoring - The first step to driving value from Splunk Observability Cloud is getting data in.
Extracting insights from Infrastructure Monitoring - Built-in content provides you with immediate visibility and value right out of the box.
Implementing use cases in Infrastructure Monitoring - Start with high-value foundational use cases.
Administering Splunk Infrastructure Monitoring - Know how to best manage the tool in order to optimize usage throughout your organization.

Success measurement

When implementing the guidance in this adoption guide, you should see improvements in the following:

Service performance
Customer experience
Developer productivity
Mean Time To Detect (MTTD)
Mean Time to Recovery (MTTR)
Service Level Objectives (SLO)
Service Level Agreements (SLA)