Skip to main content
Splunk Lantern

Expected *Nix process not running

Many critical IT applications and services running on *nix operating systems run as a process. You want to detect when an expected process is not found in the process list on the host so you can proactively manage potential stability issues.

Procedure

Option 1

  1. Ensure that you have installed the Splunk Add-on for Unix and Linux on your Splunk search head, indexer, and the Splunk universal forwarders on the monitored systems. Click here for an example inputs.conf file that can be deployed to the universal forwarder on the *nix host to collect Memory utilization data and store the results into a metrics index.
  2. In Splunk Enterprise or Splunk Cloud Platform, run the following search. You can optimize it by specifying an index and adjusting the time range.
| mstats count WHERE index="<name of *nix metrics index>" AND metric_name=ps_metric* host="<name of host to check>" BY host, COMMAND span=15m
| rename COMMAND AS process
| search process!=\[*
| eval expected_process_list=mvappend("<name of process to check>", "<name of process to check>") 
| eval expected_process_count="<total number of processes expected per host>"
| eval expected_process_regex="(?i)".mvjoin(expected_process_list, "|")
| eval expected_process_found=if(match(process,expected_process_regex),1,0)
| stats values(expected_process_list) AS expected_processes values(expected_process_count) AS expected_process_count 
  values(eval(if(expected_process_found>0,process,null()))) AS processes_found sum(expected_process_found) AS processes_found_count BY _time host
| eval count_of_missing_processes=expected_process_count - processes_found_count
| dedup host
| rename expected_processes AS "Expected Processes", expected_process_count AS "# of Expected Processes per Host", processes_found_count AS "# of Expected Processes"
         processes_found AS "Expected Processes Found on Host", processes_found_count AS "# of Expected Process Found on Host", count_of_missing_processes AS "Expected Processes Missing"

Search explanation

The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.

Splunk Search Explanation
| mstats count WHERE index="< name of *nix metrics index >" AND metric_name=ps_metric* host="<name of host to check>" BY host, COMMAND span=15m Search metrics index(es) where process data is being collected and filter down to the desired host(s) to check.
| rename COMMAND AS process
| search process!=\[*
Rename the field as shown for better readability.

| eval expected_process_list=mvappend("< name of process to check >", "< name of process to check >") 

| eval expected_process_count="<total number of processes expected per host>

Capture the list of expected processes to check and the total expected process count per host.

Add as many processes as you need. You can use regex syntax here.

| eval expected_process_regex="(?i)".mvjoin(expected_process_list, "|")

| eval expected_process_found=if(match(process,expected_process_regex),1,0)

Convert the expected process list into a regex expression, searching over each process for each host looking for matching processes. 
| stats values(expected_process_list) AS expected_processes values(expected_process_count) AS expected_process_count 
  values(eval(if(expected_process_found>0,process,null()))) AS processes_found sum(expected_process_found) AS processes_found_count BY _time host
Compute the number of matching processes per host over time.
| eval count_of_missing_processes=expected_process_count - processes_found_count Return the total number of expected processes which are not currently running on the host.
| dedup host Remove duplicate hosts.
| rename expected_processes AS "Expected Processes", expected_process_count AS "# of Expected Processes per Host", processes_found_count AS "# of Expected Processes"
         processes_found AS "Expected Processes Found on Host", processes_found_count AS "# of Expected Process Found on Host", count_of_missing_processes AS "Expected Processes Missing"
Rename the fields as shown for better readability.

Next steps

The Expected Processes Missing field indicates the total number of processes expected but missing from the most recent host process data. Any positive number indicates one or more expected processes missing. Zero indicates the number of running processes matches what is expected. A negative number indicates that a higher number of processes were found than expected.

To alert when a host is not running one or more critical processes, you can configure one of the following two recommendations:

  • Use the SPL from this procedure to configure a Core Splunk alert.
  • Build a new Vital Metric for the Unix/Linux entity type in IT Essentials Work and configure vital metric alerting. Click here for an example SPL search that can be used for the vital metric search. Once the vital metric has been created, configure it to alert when the number of expected processes not running is greater than zero.

Finally, you might be interested in other processes associated with the Maintaining *nix systems use case.

Option 2

  1. Ensure that you have the Splunk OTEL Collector installed on the host you want to monitor.
  2. Update the receivers section of the OTEL agent config file on the host to collect procstat metrics for each process.
    …
    receivers:
    …
      #The following config will collect process metrics for all processes. You can adjust the pattern parameter to filter down to a subset of processes
      smartagent/procstat:
        type: telegraf/procstat
        pattern: ".*"
  3. Update the services.pipelines.metrics.receivers section of the OTEL agent config file to include the procstat receiver.
    …
    service:
      extensions: …
      pipelines:
        traces: 
          …
        metrics:
          receivers: [..., smartagent/procstat]
          …
  4. In Splunk Infrastructure Monitoring, use the following SignalFlow to search the procstat.cpu_usage streaming metric, filter down to the desired hosts and processes, and summarize results by counting the total number of processes found per host.
    A = data('procstat.cpu_usage', filter=filter('host.name', '<name of host to check>') and filter('process_name', '<name of process to check>')).count(by=['host.name']).publish(label='A')

Next steps

To alert when no process data is flowing in for the selected host(s) and process(es), use the SignalFlow from this procedure to configure a detector with an alert condition of "heartbeat" and alert settings of 15 minutes.

Finally, you might be interested in other processes associated with the Maintaining *nix systems use case.