Host availability is a critical aspect of IT operations monitoring. You want to monitor and alert on hosts that have become unavailable either because they have gone down, or have otherwise lost the ability to send data to your Splunk deployment.
In Splunk Enterprise or Splunk Cloud Platform, this procedure can operate on any event data which is consistently received from the host including data from the Splunk Add-on for Unix and Linux add-on.
- Ensure that you have installed the Splunk Add-on for Unix and Linux on your Splunk search head, indexer, and the Splunk universal forwarders on the monitored systems. Click here for an example inputs.conf file that can be deployed to the universal forwarder on the *nix host to collect Memory utilization data and store the results into a metrics index.
- In Splunk Enterprise or Splunk Cloud Platform, run the following search. You can optimize it by specifying an index and adjusting the time range.
|tstats dc(host) AS val max(_time) AS _time WHERE index="<index to check>" host="<hosts to check>" BY host |append [|metadata type=hosts index="<index to check>" | table host lastTime | rename lastTime AS _time | where _time>now()-(60*60*12) | eval val=0] |stats max(val) AS val max(_time) AS _time by host | where val=0 | rename val AS "Has Data" | eval Missing Duration= tostring(now()-_time, "duration") | table host "Has Data" "Missing Duration"
The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.
||tstats dc(host) AS val max(_time) AS _time WHERE index="<index to check>" host="<hosts to check>" BY host||Obtain a lists of all hosts for which data has been recently received.|
||append [|metadata type=hosts index="<index to check>" | table host lastTime | rename lastTime AS _time | where _time>now()-(60*60*12) | eval val=0]||Obtain a list of all hosts that have sent data into the environment in the last 12 hours and add the results onto the previous results table.|
||stats max(val) AS val max(_time) AS _time by host||Create a table with a val column where val=1 if the data was seen for the host, and val=0 if not. Include a _time column that contains the timestamp of the most recently seen event for that host, and group by host.|
|| where val=0||Filter the results to only hosts not currently sending data.|
|| rename val AS "Has Data"||Rename the field as shown for better readability.|
|| eval Missing Duration= tostring(now()-_time, "duration")||Convert the Missing Duration value into a string formatted as HH:MM:SS.|
|| table host "Has Data" "Missing Duration"||Display the results in a table with columns in the order shown.|
Create an alert based on this search so you can proactively manage potential stability issues. To alert when a host is no longer sending data, you can configure one of the following two recommendations:
- Use the SPL from this procedure to configure a Core Splunk alert.
- Build a new Vital Metric in IT Essentials Work for the desired entity type and configure vital metric alerting.
Finally, you might be interested in other processes associated with the Maintaining *nix systems use case.