Skip to main content

 

Splunk Lantern

ESXi hosts with sustained high swapping

You might need to monitor ESXi hosts for sustained high swapping when doing the following:

Prerequisites 

In order to execute this procedure in your environment, the following data, services, or apps are required:

Example

When an ESXi host can't reclaim necessary memory through ballooning, the host begins to swap memory to disk. Memory swapping on the host is a strong indication that the host is over provisioned and experiencing significant memory pressure. The latency introduced by the swapping has a noticeable performance impact on the virtual machines running on the host. You want to monitor and investigate hosts with high memory swapping.

Option 1

To optimize the search shown below, you should specify an index and a time range.

  1. In Splunk Enterprise or Splunk Cloud Platform, ensure that you have installed the IT Essentials Work app to onboard VMware data and provide the various VMware entity type configurations and dashboards.
  2. Ensure that you are collecting VMware data through one or more Data Collection Nodes, which are essentially Splunk heavy forwarders with specific VMware collection configurations.
  3. Run the following search: 
| mstats max(vsphere.esxihost.mem.llSwapUsed) AS vsphere.esxihost.mem.llSwapUsed WHERE (index=vmware-perf-metrics) BY name moid
| stats max(vsphere.esxihost.mem.llSwapUsed) AS max_p_average_mem_llSwapUsed_kiloBytes BY name moid
| eval is_high_swapping = if(max_p_average_mem_llSwapUsed_kiloBytes > 5000, 1, 0)
| eventstats mean(max_p_average_mem_llSwapUsed_kiloBytes) AS mean_host_population stdev(max_p_average_mem_llSwapUsed_kiloBytes) AS stdev_host_population
| eval stdev_from_host_population = (max_p_average_mem_llSwapUsed_kiloBytes - mean_host_population) / stdev_host_population
| sort - stdev_from_host_population
| table name moid max_p_average_mem_llSwapUsed_kiloBytes is_high_swapping stdev_from_host_population mean_host_population stdev_host_population

Search explanation

The table provides an explanation of what each part of this search achieves. You can adjust this query based on the specifics of your environment.

Splunk Search Explanation
| mstats max(vsphere.esxihost.mem.llSwapUsed) AS vsphere.esxihost.mem.llSwapUsed WHERE (index=vmware-perf-metrics) BY name moid
| stats max(vsphere.esxihost.mem.llSwapUsed) AS max_p_average_mem_llSwapUsed_kiloBytes BY name moid

Calculate the average of llSwapUsed, which is the amount of space used for caching swapped pages in the host cache, in kilobytes, for each managed object ID (MOID). 

| eval is_high_swapping = if(max_p_average_mem_llSwapUsed_kiloBytes > 5000, 1, 0)

Create the is_high_swapping field for results where more than 5,000 kilobytes of space are used. 

| eventstats mean(max_p_average_mem_llSwapUsed_kiloBytes) AS mean_host_population stdev(max_p_average_mem_llSwapUsed_kiloBytes) AS stdev_host_population

Calculate the average and standard deviation of the results.

| eval stdev_from_host_population = (max_p_average_mem_llSwapUsed_kiloBytes - mean_host_population) / stdev_host_population

Calculate a running total of how many standard deviations away each MOID is from the average amount of space used and put the result in a field called stdev_from_host_population.

|sort - stdev_from_host_population

Sort results with the largest standard deviation first.

| table name moid max_p_average_mem_llSwapUsed_kiloBytes is_high_swapping stdev_from_host_population mean_host_population stdev_host_population

Display the results in a table with columns in the order shown.

Result

Sample results for this search are shown in the table below. None of the hosts have crossed the swapping threshold set in the search. You can see the average KB of memory swapped and which hosts are under some memory pressure and which are less so. From that information, you can determine if and where to move load from and to in order to better balance load. 

moid Avg Mem swapped (KB) Swapping Mean Host Pop. Stdev of Host Pop Stdev Across Host Pop

host-26

1000

No

325

471.699

1.431

host-11

300

No

325

471.699

-0.053

host-10

0

No

325

471.699

-0.689

host-20

0

No

325

471.699

-0.689

Option 2

  1. Ensure that you have the Splunk OTEL Collector installed on the host you want to monitor. After installation, complete the following additional steps:
    1. Configure the receiver in agent_config.yaml. Note the extra groups.
      receivers:
        smartagent/vsphere:
          type: vsphere
          host: "vcenterexample.local"
          username: "administrator"
          password: "mypassword"
          insecureSkipVerify: true
          extraGroups:
               - cpu
               - mem
      
    2. Configure the service pipeline in agent_config.yaml (smartagent/vsphere).
      service:
        ...
          metrics:
            receivers: [hostmetrics, otlp, signalfx, smartagent/signalfx-forwarder, smartagent/vsphere]
      
    3. Restart the agent after configuration changes:
       systemctl restart splunk-otel-collector
  2. In Splunk Infrastructure Monitoring, use the following SignalFlow to search the vsphere.mem_swapin_rate_kbs and vsphere.mem.swapout_rate_kbs streaming metrics and calculate the mean by host.
    A = data('vsphere.mem_swapin_rate_kbs', filter=filter('object_type', 'HostSystem')).publish(label='A', enable=False)
    B = data('vsphere.mem_swapout_rate_kbs', filter=filter('object_type', 'HostSystem')).publish(label='B', enable=False)
    C = (A+B).mean(by=['esx_ip']).publish(label='C')
    

Result

To alert when the swap rate suddenly changes on an ESXi host, you can use the SignalFlow from this procedure to configure a detector with the following configurations:

  • Alert condition: Sudden change
  • Alert when: Too high
  • Trigger sensitivity: Medium