Windows availability problems
Windows uptime is extremely important to everyone at your organization. When basic Windows resources aren't functioning, productivity declines dramatically. You need to be able to quickly identify systems with availability issues due to unexpected shutdowns, application crashes, and hangs.
Data required
Microsoft: Windows update logs
Procedure
- In Splunk Enterprise or Splunk Cloud Platform, verify that you deployed the Splunk Add-on for Microsoft Windows add-on to your search heads, indexer, and universal forwarders on the monitored systems. For more information, see About installing Splunk add-ons.
- Verify that data is being ingested with the
WinEventLog
source type. - Run the following search. You can optimize it by specifying an index and adjusting the time range.
source=WinEventLog* "EventCode=1076" OR "EventCode=6008" OR "EventCode=1001" OR "EventCode=1002" Type=Error |rex field=Message "(?m)(?<cause>.*)$" |rex field=cause mode=sed "s/(at \d{1,2}:\d{1,2}:\d{1,2}.+was)/at <time of event> was/g" |stats count(EventCode) AS total_availability_issues values(cause) AS cause BY host, EventCode
Search explanation
The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.
Splunk Search | Explanation |
---|---|
|
Search only Windows event logs. |
|
Search for unexpected shutdowns and application hangs or crashes. |
|
Search for error events. If no results are found, this might need to be omitted from the search. |
|
Copy all text in the message field in the event and rename it |
|
Delete lines so only the first line shows for better readability. |
|
Count the number of availability errors and group them by host and event code. |
Next steps
The following table shows sample results. You see the host
, the EventCode
, the total_availability_issues
count, and the cause
values, which are descriptive text pulled out of the long message field in the original event.
host |
EventCode |
total_availability_issues |
cause |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
As you can see the cause field is rich in information. A similar field in the event is captured with the Event_Name
field, but it's not as informative as what the cause field shows. This is a result of the inline rex command in the SPL for the search.
A good next step would be to append the search with the following to return the hosts with the most issues at the top of the list.
|sort - total_availability_issue
Enriching the event with asset priority information from a lookup would also be a valuable next step in prioritizing mitigation efforts.
Finally, you might be interested in other processes associated with the Maintaining Microsoft Windows systems use case.