Skip to main content
 
Splunk Lantern

Windows availability problems

 

Windows uptime is extremely important to everyone at your organization. When basic Windows resources aren't functioning, productivity declines dramatically. You need to be able to quickly identify systems with availability issues due to unexpected shutdowns, application crashes, and hangs.

Data required 

Microsoft: Windows update logs

Procedure

  1. Verify that you deployed the Splunk Add-on for Microsoft Windows add-on to your search heads, indexer, and Splunk Universal Forwarders on the monitored systems. For more information, see About installing Splunk add-ons.
  2. Verify that data is being ingested with the WinEventLog sourcetype.
  3. Run the following search. You can optimize it by specifying an index and adjusting the time range.
source=WinEventLog* "EventCode=1076" OR "EventCode=6008" OR "EventCode=1001" OR "EventCode=1002" Type=Error
|rex field=Message "(?m)(?<cause>.*)$" 
|rex field=cause mode=sed "s/(at \d{1,2}:\d{1,2}:\d{1,2}.+was)/at <time of event> was/g"
|stats count(EventCode) AS total_availability_issues values(cause) AS cause BY host, EventCode

Search explanation

The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.

Splunk Search Explanation

source=WinEventLog* 

Search only Windows event logs.

"EventCode=1076" OR "EventCode=6008" OR "EventCode=1001" OR "EventCode=1002" Type=Error

Search for unexpected shutdowns and application hangs or crashes.

Type=Error

Search for error events. If no results are found, this might need to be omitted from the search. 

|rex field=Message "(?m)(?<cause>.*)$" 

Copy all text in the message field in the event and rename it “cause”.  

|rex field=cause mode=sed "s/(at \d{1,2}:\d{1,2}:\d{1,2}.+was)/at <time of event> was/g"

Delete lines so only the first line shows for better readability. 

|stats count(EventCode) AS total_availability_issues values(cause) AS cause BY host, EventCode

Count the number of availability errors and group them by host and event code.

Next steps

The following table shows sample results. You see the host, the EventCode, the total_availability_issues count, and the cause values, which are descriptive text pulled out of the long message field in the original event. 

host EventCode total_availability_issues cause

busdev-001

1001

80

Detection of product '20130613', feature 'SetReceiver' failed during request for component '{4E76FF7E-AEBA-4C87-B788-CD47E5425B9D}'

Detection of product 'League of Legends.exe', feature 'SetReceiver' failed during request for component '{F3B1321E-2472-4211-8735-E1239BE41D9F}' Detection of product 'webex.exe', feature 'SetReceiver' failed during request for component '{17BC5B75-6692-40E6-A347-849F595BC802}'

Event Name: AVSubmit 

Event Name: WindowsWcpOtherFailure3

Fault bucket -734962412

Fault bucket 91467906712

Not Available

coredev-002

1001

56

Detection of product 'spytech-spyagent.exe', feature 'SetReceiver' failed during request for component '{FD33EC178-D1B1-3396-99ED-G0BE1B0AA521}' Fault bucket 124914201808 Fault bucket 125796201882 Fault bucket 125822201825 Fault bucket 128886201823 Not Available

dc-cup-01

1002

3

The DFS Replication service is starting.

As you can see the cause field is rich in information. A similar field in the event is captured with the Event_Name field, but it's not as informative as what the cause field shows. This is a result of the inline rex command in the SPL for the search. 

A good next step would be to append the search with the following to return the hosts with the most issues at the top of the list. : 

|sort -  total_availability_issue 

Enriching the event with asset priority information from a lookup would also be a valuable next step in prioritizing mitigation efforts.  

Finally, you might be interested in other processes associated with the Maintaining Microsoft Windows systems use case.