Maintaining *nix systems with Infrastructure Monitoring

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

In your organization, you have lots of *nix systems running critical applications or services. You need to monitor these systems to ensure the health of the associated apps and services, such as basic configuration, system diagnostics, file systems, and packages. You need to log and watch all these components, and ensure that appropriate technical staff are notified as quickly as possible if problems arise. With all these different concerns, you need Splunk searches that you can save and easily run on a schedule or as needed to keep your users up and running.

You can use Splunk software to manage patches and updates to ensure all connected systems and related processes are running after the patch or update is complete. You can also use Splunk software for a number of other maintenance tasks, such as watching out for connectivity issues.

Prerequisites

Technologies:

Splunk Infrastructure Monitoring
Splunk OpenTelemetry Collector

Data:

*nix: Operating system logs
Command line output (df, ps, iostat, etc.) via scripted inputs

How to use Splunk software for this use case

For each of the procedures below, ensure that you have the Splunk OpenTelemetry Collector installed on the host you want to monitor.

*Nix CPU utilization nearing capacity

In Splunk Infrastructure Monitoring, use the following SignalFlow to search the cpu.utilization streaming metric and filter down to the hosts and processes you want to check.

A = data('cpu.utilization', filter=filter('host.name', '<name of host to check>')).publish(label='A')

To alert when CPU utilization is nearing max capacity for the selected host(s) and process(es), use the SignalFlow from this procedure to configure a detector with an alert condition of "Static Threshold" and alert settings of:

Alert when: Above
Threshold: 95
Trigger sensitivity: Duration
Duration: 5m

*Nix memory utilization nearing capacity

In Splunk Infrastructure Monitoring, use the following SignalFlow to search the memory.utilization streaming metric and filter down to the desired host(s).

A = data('memory.utilization', filter=filter('host.name', '<name of host to check>'), rollup='latest').publish(label='A')

To alert when memory utilization is nearing max capacity for the selected host(s), use the SignalFlow from this procedure to configure a detector with an alert condition of "Static Threshold" and alert settings of:

Alert when: Above
Threshold: 95
Trigger sensitivity: Duration
Duration: 5m

Expected *Nix process not running

Update the receivers section of the OTEL agent config file on the host to collect procstat metrics for each process.

…
receivers:
…
  #The following config will collect process metrics for all processes. You can adjust the pattern parameter to filter down to a subset of processes
  smartagent/procstat:
    type: telegraf/procstat
    pattern: ".*"

Update the services.pipelines.metrics.receivers section of the OTEL agent config file to include the procstat receiver.

…
service:
  extensions: …
  pipelines:
    traces: 
      …
    metrics:
      receivers: [..., smartagent/procstat]
      …

In Splunk Infrastructure Monitoring, use the following SignalFlow to search the procstat.cpu_usage streaming metric, filter down to the desired hosts and processes, and summarize results by counting the total number of processes found per host.
```
A = data('procstat.cpu_usage', filter=filter('host.name', '<name of host to check>') and filter('process_name', '<name of process to check>')).count(by=['host.name']).publish(label='A')
```

To alert when no process data is flowing in for the selected host(s) and process(es), use the SignalFlow from this procedure to configure a detector with an alert condition of "heartbeat" and alert settings of 15 minutes.

*Nix host stops reporting data

In Splunk Infrastructure Monitoring, use the following SignalFlow to search for hosts not reporting metrics after a fixed period of time.

A = data('cpu.utilization', filter=filter('host.name', '<HOSTS-TO-CHECK>')).publish(label='A')

The metric cpu.utilization is fundamental and should be present on all hosts. To alert when a host stops reporting data, use the SignalFlow from this procedure to configure a detector with an alert condition of "Heartbeat Check" and Alert Settings of "Hasn't reported For: 5m".

Next steps

To maximize their benefit, the procedures in this article likely need to tie into existing processes at your organization or become new standard processes. These processes commonly impact success with this use case:

Running regular backups
Maintaining tooling for software provisioning
Maintaining tooling for configuration management
Site reliability engineering processes

Measuring impact and benefit is critical to assessing the value of IT operations. The following are example metrics that can be useful to monitor when implementing this use case:

Mean time to resolution
Mean time to root cause
Reduction in defects

This use case is also included in the IT Essentials Learn app, which provides more information about how to implement the use case successfully in your IT maturity journey. In addition, these Splunk resources might help you understand and implement this use case:

Conf Talk: Getting the most out of logs for IT monitoring and troubleshooting
Tech Talk: Get monitoring tricks for all your *nix Part 1
Tech Talk: Get monitoring tricks for all your *nix Part 2