Maintaining Microsoft Windows systems with Infrastructure Monitoring

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

In your organization, you have many applications and services hosted on Microsoft Windows that are critical to the support of the business. Because of the reliance on these critical applications and services by workers and management, you need to monitor availability and performance to make sure that the functionality is there when needed. In order to do this, you need to search application and infrastructure logs for key indicators of failures and potential performance degradation, which are often disparate. Because it is easy to get data into Splunk and then search and alert on key indicators, you are motivated to onboard data. After the data is available, you want to develop and save searches that help you achieve this type of monitoring efficiently. You can use Splunk software to monitor a large number of Windows system management tasks and events, such as patch management, software deployment, inventory tracking, remote access availability, and more.

Prerequisites

Splunk Infrastructure Monitoring
Splunk OpenTelemetry Collector [1]

Data required

Microsoft: Windows event and update logs

How to use Splunk software for this use case

For each of the procedures below, ensure that you have the Splunk OpenTelemetry Collector installed on the host you want to monitor.

Windows host stops reporting data

Use the following SignalFlow to search for hosts not reporting metrics after a fixed period of time.

A = data('cpu.utilization', filter=filter('host.name', '<HOSTS-TO-CHECK>')).publish(label='A')

To alert when a host stops reporting data, use the SignalFlow from this procedure to configure a detector with an alert condition of "Heartbeat Check" and Alert Settings of "Hasn't reported For: 5m".

Expected Windows process not running

Update the receivers section of the OTEL agent config file on the host to collect procstat metrics for each process.

…
receivers:
…
  #The following config will collect process metrics for all processes. You can adjust the pattern parameter to filter down to a subset of processes
  smartagent/procstat:
    type: telegraf/procstat
    pattern: ".*"

Update the services.pipelines.metrics.receivers section of the OTEL agent config file to include the procstat receiver.

…
service:
  extensions: …
  pipelines:
    traces: 
      …
    metrics:
      receivers: [..., smartagent/procstat]
      …

Use the following SignalFlow to search the procstat.cpu_usage streaming metric, filter down to the desired hosts and processes, and summarize results by counting the total number of processes found per host.
```
A = data('procstat.cpu_usage', filter=filter('host.name', '<name of host to check>') and filter('process_name', '<name of process to check>')).count(by=['host.name']).publish(label='A')
```

To alert when no process data is flowing in for the selected hosts and processes, use the SignalFlow from this procedure to configure a detector with an alert condition of "Heartbeat" and alert settings of 15 minutes.

Windows memory utilization nearing capacity

Use the following SignalFlow to search the memory.utilization streaming metric and filter down to the desired hosts and processes.

A = data('memory.utilization', filter=filter('host.name', '<name of host to check>'),rollup='latest').publish(label='A')

To alert when memory utilization is nearing max capacity for the selected hosts, use the SignalFlow from this procedure to configure a detector with an alert condition of "Static Threshold" and alert settings of:

Alert when: Above
Threshold: 95
Trigger sensitivity: Duration
Duration: 5m

Windows CPU utilization nearing capacity

Use the following SignalFlow to search the cpu.utilization streaming metric and filter down to the desired hosts.

 A = data('cpu.utilization', filter=filter('host.name', '<name of host to check>')).publish(label='A')

To alert when CPU utilization is nearing max capacity for the selected host(s), use the SignalFlow from this procedure to configure a detector with an alert condition of "Static Threshold" and alert settings of:

Alert when: Above
Threshold: 95
Trigger sensitivity: Duration
Duration: 5m

Windows disk drive utilization nearing capacity

Use the following SignalFlow to search the disk.utilization streaming metric, filter down to the desired hosts and mountpoints, and summarize results by counting the total number of processes found per host.

A = data('disk.utilization', filter=filter('host', '<name of host to check>') and filter('mountpoint', '<name of disk to check>')).publish(label='A')

To alert when disk utilization is nearing capacity on the specified hosts and mountpoints, use the SignalFlow from this procedure to configure a detector with an alert condition of "Resource Running Out" and alert settings of:

Alert when nearing: Capacity
Capacity: 100
Trigger Sensitivity: Medium

Next steps

To maximize their benefit, the how-to articles linked in the previous section likely need to tie into existing processes at your organization or become new standard processes. These processes commonly impact success with this use case:

Active directory administration, which is closely related to Windows Maintenance
The use of cloud services, such as Azure, to cover Windows maintenance requirements
Integration with ticketing systems used for the service desk
The use of any other related applications, such MS SQL Server, IIS, Exchange, and O365, which can all affect a Windows environment

Measuring impact and benefit is critical to assessing the value of IT operations. The following are example metrics that can be useful to monitor when implementing this use case:

Availability of service: Percentage of agreed service time to down time
Maintainability of service: Mean time to repair (MTTR) and mean time between failure (MTBF)
Additional metrics: Page load times, average response time, and operations per second

This use case is also included in the IT Essentials Learn app, which provides more information about how to implement the use case successfully in your IT maturity journey. In addition, these Splunk resources might help you understand and implement this use case:

Splunk Help: Monitoring Windows event log data
Splunk Tech Talk: My start will go on: Splunk's TA for Windows Part 1
Splunk Tech Talk: My start will go on: Splunk's TA for Windows Part 2
Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their Success Plan. Engage the ODS team at ondemand@cisco.com if you would like assistance.