Monitoring VMware components with Infrastructure Monitoring
You work in the IT department for a large software development company that makes heavy use of virtual machines for testing during development.
It is crucial that the virtual machines that rely on the VMware ESXi hypervisor and vSphere remain available so that the software release cycle isn't disrupted. Part of your job is to monitor all VMware infrastructure and respond to any issues that might arise.
You can use Splunk Infrastructure Monitoring to monitor virtual machine and virtual machine host resource usage, watch for key events that might require troubleshooting, and obtain useful inventories of your VMware environment.
How to use Splunk software for this use case
This procedure depends on data that is collected through the Splunk OpenTelemetry Collector. Follow these instructions to install the Splunk OTEL Collector on the system where you will connect to your VMWare vSphere instances. Additional OpenTelemetry agent configurations are required to collect VMWare metrics:
- Configure the receiver in the agent_config.yaml file:
receivers: smartagent/vsphere: type: vsphere host: "vcenterexample.local" username: "administrator" password: "mypassword" insecureSkipVerify: true extraGroups: - cpu - mem
- Configure the service pipeline in the
agent_config.yaml
for Smart Agent or VSphere:service: ... metrics: receivers: [hostmetrics, otlp, signalfx, smartagent/signalfx-forwarder, smartagent/vsphere]
- Restart the agent after configuration changes:
systemctl restart splunk-otel-collector
Identify ESXi hosts with high CPU sum ready
CPU Sum Ready indicates that a virtual machine needs access to CPU resources to continue processing, but the underlying host has no remaining CPU resources to allocate. This metric can be calculated as summation or percentage.
When many virtual machines on an ESXi host have high sum ready metrics, the host might be experiencing CPU pressure. You want to monitor your network for this type of problem so you can take mitigating action.
This specific procedure requires additional metrics to be collected for the CPU group.
In Splunk Infrastructure Monitoring, use the following SignalFlow to search the vsphere.cpu_ready_ms
streaming metric and calculate the mean by ESXi host.
A = data('vsphere.cpu_ready_ms').mean(by=['esx_ip']).publish(label='A')
To alert when CPU Sum Ready suddenly changes on an ESXi host, you can use SignalFlow to configure a detector with the following configurations:
- Alert Condition: Sudden change
- Alert Settings:
- Alert when: Too high
- Trigger sensitivity: Medium
Use a pre-built terraform template from GitHub to automatically build this detector in your environment. To build the detector, issue the following commands in the folder containing the terraform script and enter the appropriate terraform variable values from your environment when prompted:
terraform init terraform apply
Identify ESXi hosts with sustained high swapping
When an ESXi host can't reclaim necessary memory through ballooning, the host begins to swap memory to disk. Memory swapping on the host is a strong indication that the host is over provisioned and experiencing significant memory pressure. The latency introduced by the swapping has a noticeable performance impact on the virtual machines running on the host. You want to monitor and investigate hosts with high memory swapping.
This specific procedure requires additional metrics to be collected.
In Splunk Infrastructure Monitoring, use the following SignalFlow to search the vsphere.cpu_ready_ms
streaming metric and calculate the mean by ESXi host.
A = data('vsphere.cpu_ready_ms').mean(by=['esx_ip']).publish(label='A')
To alert when swap rate suddenly changes on an ESXi host, you can use the SignalFlow from this procedure to configure a detector with the following configurations:
- Alert Condition: Sudden change
- Alert Settings:
- Alert when: Too high
- Trigger sensitivity: Medium
Use a pre-built terraform template from GitHub to automatically build this detector in your environment. To build the detector, issue the following commands in the folder containing the terraform script and enter the appropriate terraform variable values from your environment when prompted:
terraform init terraform apply
Next steps
To maximize their benefit, the procedures in this article likely need to tie into existing processes at your organization or become new standard processes. These processes commonly impact success with this use case:
- Capacity planning, compute hardware, storage, and network
- Native monitoring tools integrated with Splunk software for cross-domain visibility
- Tooling for software provisioning and configuration management
- Backups, security, and compliance
Measuring impact and benefit is critical to assessing the value of IT operations. The following are example metrics that can be useful to monitor when implementing this use case:
- Mean time to problem resolution
- Mean time to root cause analysis
- Reduction in system degradation, such as underperformance or unplanned downtime
This use case is also included in the IT Essentials Learn app, which provides more information about how to implement the use case successfully in your IT maturity journey. In addition, these Splunk resources might help you understand and implement this use case: