Skip to main content
 
 
Splunk Lantern

Failed Windows updates

 

Windows Update is the service Microsoft provides to automatically download and install security patches and system fixes, but it’s not without occasional malfunctions of its own. You want to identify failed updates faster so you can take the necessary actions to resolve them.

Data required  

Microsoft: Windows update logs

Procedure 

  1. In Splunk Enterprise or Splunk Cloud Platform, verify that you deployed the Splunk Add-on for Microsoft Windows add-on to your search heads, indexer, and universal forwarders on the monitored systems. For more information, see About installing Splunk add-ons.
  2. Verify that data is being ingested with the WindowsUpdateLog source type.
  3. Run the following search. You can optimize it by specifying an index and adjusting the time range.
eventtype="*Update_Failed*" package=* 
|dedup host package 
|stats count, max(_time) AS latest_failure_time BY host,package 
|sort - latest_failure_time 
|convert ctime(latest_failure_time) 
|eval kb_details="KB".package." (Total Fails=".tostring(count).") (Last Failure at:".latest_failure_time.")" 
|stats sum(count) AS total_fails, values(kb_details) AS latest_fail_details BY host

Search explanation

The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.

Splunk Search Explanation

eventtype="*Update_Failed*" package=*

Search for failed update events.

If you did not install the Splunk App for Windows Infrastructure, replace this search string with the following: ((source="*:System" "Installation Failure") OR (sourcetype="WindowsUpdateLog" Failure "Content Install" "Installation Failure"))

|dedup host package

Remove duplicate combinations of host and package.

|stats count, max(_time) AS latest_failure_time BY host,package

Return the count and latest time of failed updates by host and package

|sort - latest_failure_time

Sort the results with the most recent failure time first.

|convert ctime(latest_failure_time)

Convert epoch time to a calendar format.

|eval kb_details= package." (Total Fails=".tostring(count).") (Last Failure at:".latest_failure_time.")" 

kb_details evaluates to a string that concatenates the package name followed by (total fails=count) followed by the (Last Failure calendar date). This string is displayed in the rightmost column of the results.

|stats sum(count) AS total_fails, values(kb_details) AS latest_fail_details BY host

Get the sum of total failures by host to populate the middle column

Next steps

This procedure significantly decreases mean-time-to-identify (MTTI) and mean-time-to-resolve (MTTR) failed updates. Hosts with failed updates can be restarted as soon as reported by this search, as opposed to waiting for the patching tools to report it. This is possible because as soon as a source system encounters a failure, the Splunk platform captures this information and can alert you.

The search produces a helpful table that shows the failure details grouped by host. It may be useful to add a descending sort on the total_fails field so you can see which host has the most failures and use that information to prioritize your response.

host total_fails latest_fail_details

coredev-002

1

KB3175024 (Total Fails=1) (Last Failure at:09/14/2020 18:11:45.086)

dc-san-01

1

KB2267602 (Total Fails=1) (Last Failure at:09/14/2020 19:14:45.086)

exch-cas-san-01

3

KB3012702 (Total Fails=1) (Last Failure at:09/14/2020 17:40:45.086)
KB3021674 (Total Fails=1) (Last Failure at:09/14/2020 19:31:45.086)
KB3041857 (Total Fails=1) (Last Failure at:09/14/2020 20:08:45.086)

iis-srv-167

1

KB2868626 (Total Fails=1) (Last Failure at:09/14/2020 20:31:45.086)

A recommended next step is to save this search as a scheduled report and have it run every few minutes during the maintenance window when the updates are being applied. That will allow you to see the status of updates on a continuous basis and not have to wait for the patch system tools to report. When you see a failed update you can investigate soon after the failure is seen, possibly giving time to remedy the problem and restart the update. This near-real-time functionality is what could reduce MTTR.

Another next step is to enrich the results with priority information for the hosts, In the sample data above, the host dc-san-01 might be high priority because many production systems depend on it. Update failures on that host would be considered more severe then the same updates on a print server. The enrichment information could come from an asset list that is a simple key-value pair with host name as key and priority as the value. Save this list as a lookup file in the Splunk platform and use it to filter on priority and take action first on high priority assets.

Finally, you might be interested in other processes associated with the Maintaining Microsoft Windows systems use case.