Network device down
It is crucial to detect and alert on any lost networking host in your environment. By using the presence of syslog data as a “heartbeat” of the host’s presence, you can configure Splunk software to alert when a host that was previously sending data is no longer reporting.
To collect SNMP traps in Splunk, you will need to run an snmptrapd server on a Linux or Windows machine to collect traps and write them to a file. After they are written to disk, you can configure the Universal Forwarder to read those files and forward them to Splunk; this configuration is outlined in our documentation.
- Ensure you have configured Splunk Connect for Syslog or Windows performance metrics.
- If you are switching to Splunk software from another vendor, front SC4S with the same IP address that your previous software used to collect syslog traffic. Doing so helps prevent the need to reconfigure all network devices and firewall rules that would be necessary to allow syslog traffic to flow to a new syslog receiver.
- Run the following search. You can optimize it by specifying an index and adjusting the time range.
index IN (*) sourcetype IN (*) sc4s_vendor_product=* | stats sparkline(count, 1m) AS trend count min(_time) AS earliest_time max(_time) AS latest_time BY host | eval minutes_since_last_reported_event = round((now()-latest_time) / 60, 0) | eval alive = if(minutes_since_last_reported_event < 11, "Yes", "No") | convert ctime(earliest_time) AS earliest_event_last60m ctime(latest_time) AS latest_event_last60m | table host alive minutes_since_last_reported_event trend count | where alive="No"
The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.
|index IN (*) sourcetype IN (*) sc4s_vendor_product=*||
Search the syslog data in the Splunk Connect for Syslog app.
If you are not using the Splunk Connect for Syslog app, you can use Windows performance metrics instead. Replace this first line of the search with the following: | mstats count WHERE metric_name="Process.Elapsed_Time" AND index IN (*) host IN (*) instance IN ("listCriticalProcessesHere") BY host instance span=1m
|| stats sparkline(count, 1m) AS trend count min(_time) AS earliest_time max(_time) AS latest_time BY host||Add a sparkline showing activity for each host.|
|| eval minutes_since_last_reported_event = round((now()-latest_time) / 60, 0)||Calculate the time, in minutes, since the process last reported activity if the activity is within the last 60 minutes.|
|| eval alive = if(minutes_since_last_reported_event < 11, "Yes", "No")||Set the field named alive to show whether the process reported activity in the last 10 minutes or longer.|
|| convert ctime(earliest_time) AS earliest_event_last60m ctime(latest_time) AS latest_event_last60m||
Convert the first and last times the event was seen into an easily readable format.
|| table host alive minutes_since_last_reported_event trend count||Display the results in a table with columns in the order shown.|
|| where alive="No"||Filter the results to show only those where the process has not reported activity in the last 10 minutes.|
To further restrict your search, limit the search to include only the source types associated with your networking devices. When you find devices not reporting, you can take appropriate steps to get them running again and, later, to determine the cause of the shutdown to reduce the possibility of loss in the future.
You might be interested in other processes associated with the Recovering lost visibility of IT infrastructure use case.