Skip to main content
 
 
Splunk Lantern

Maintaining Microsoft Windows systems with Infrastructure Monitoring

 

In your organization, you have many applications and services hosted on Microsoft Windows that are critical to the support of the business. Because of the reliance on these critical applications and services by workers and management, you need to monitor availability and performance to make sure that the functionality is there when needed. In order to do this, you need to search application and infrastructure logs for key indicators of failures and potential performance degradation, which are often disparate. Because it is easy to get data into Splunk and then search and alert on key indicators, you are motivated to onboard data. After the data is available, you want to develop and save searches that help you achieve this type of monitoring efficiently. You can use Splunk software to monitor a large number of Windows system management tasks and events, such as patch management, software deployment, inventory tracking, remote access availability, and more.

Prerequisites

Technologies:

Data:

How to use Splunk software for this use case

For each of the procedures below, ensure that you have the Splunk OpenTelemetry Collector installed on the host you want to monitor.

Windows host stops reporting data

Use the following SignalFlow to search for hosts not reporting metrics after a fixed period of time.

A = data('cpu.utilization', filter=filter('host.name', '<HOSTS-TO-CHECK>')).publish(label='A')

To alert when a host stops reporting data, use the SignalFlow from this procedure to configure a detector with an alert condition of "Heartbeat Check" and Alert Settings of "Hasn't reported For: 5m".

Expected Windows process not running

  1. Update the receivers section of the OTEL agent config file on the host to collect procstat metrics for each process.
    …
    receivers:
    …
      #The following config will collect process metrics for all processes. You can adjust the pattern parameter to filter down to a subset of processes
      smartagent/procstat:
        type: telegraf/procstat
        pattern: ".*"
  2. Update the services.pipelines.metrics.receivers section of the OTEL agent config file to include the procstat receiver.
    …
    service:
      extensions: …
      pipelines:
        traces: 
          …
        metrics:
          receivers: [..., smartagent/procstat]
          …
  3. Use the following SignalFlow to search the procstat.cpu_usage streaming metric, filter down to the desired hosts and processes, and summarize results by counting the total number of processes found per host.
    A = data('procstat.cpu_usage', filter=filter('host.name', '<name of host to check>') and filter('process_name', '<name of process to check>')).count(by=['host.name']).publish(label='A')

To alert when no process data is flowing in for the selected host(s) and process(es), use the SignalFlow from this procedure to configure a detector with an alert condition of "Heartbeat" and alert settings of 15 minutes.

Windows memory utilization nearing capacity

Use the following SignalFlow to search the memory.utilization streaming metric and filter down to the desired hosts and processes.

A = data('memory.utilization', filter=filter('host.name', '<name of host to check>'),rollup='latest').publish(label='A')

To alert when memory utilization is nearing max capacity for the selected hosts, use the SignalFlow from this procedure to configure a detector with an alert condition of "Static Threshold" and alert settings of:

  • Alert when: Above
  • Threshold: 95
  • Trigger sensitivity: Duration
  • Duration: 5m

Windows CPU utilization nearing capacity

Use the following SignalFlow to search the cpu.utilization streaming metric and filter down to the desired hosts.

 A = data('cpu.utilization', filter=filter('host.name', '<name of host to check>')).publish(label='A')

To alert when CPU utilization is nearing max capacity for the selected host(s), use the SignalFlow from this procedure to configure a detector with an alert condition of "Static Threshold" and alert settings of:

  • Alert when: Above
  • Threshold: 95
  • Trigger sensitivity: Duration
  • Duration: 5m

Windows disk drive utilization nearing capacity

Use the following SignalFlow to search the disk.utilization streaming metric, filter down to the desired hosts and mountpoints, and summarize results by counting the total number of processes found per host.

A = data('disk.utilization', filter=filter('host', '<name of host to check>') and filter('mountpoint', '<name of disk to check>')).publish(label='A')

To alert when disk utilization is nearing capacity on the specified host(s) and mountpoint(s), use the SignalFlow from this procedure to configure a detector with an alert condition of "Resource Running Out" and alert settings of:

  • Alert when nearing: Capacity
  • Capacity: 100
  • Trigger Sensitivity: Medium

Next steps

To maximize their benefit, the how-to articles linked in the previous section likely need to tie into existing processes at your organization or become new standard processes. These processes commonly impact success with this use case:

  • Active directory administration, which is closely related to Windows Maintenance
  • The use of cloud services, such as Azure, to cover Windows maintenance requirements
  • Integration with ticketing systems used for the service desk
  • The use of any other related applications, such MS SQL Server, IIS, Exchange, and O365, which can all affect a Windows environment

Measuring impact and benefit is critical to assessing the value of IT operations. The following are example metrics that can be useful to monitor when implementing this use case:

  • Availability of service: Percentage of agreed service time to down time
  • Maintainability of service: Mean time to repair (MTTR) and mean time between failure (MTBF)
  • Additional metrics: Page load times, average response time, and operations per second

This use case is also included in the IT Essentials Learn app, which provides more information about how to implement the use case successfully in your IT maturity journey. In addition, these Splunk resources might help you understand and implement this use case:

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at OnDemand-Inquires@splunk.com if you require assistance.