Scenario: In your organization, you have lots of *nix systems running critical applications or services. You need to monitor these systems to ensure the health of the associated apps and services, such as basic configuration, system diagnostics, file systems, and packages. You need to log and watch all these components, and ensure that appropriate technical staff are notified as quickly as possible if problems arise. With all these different concerns, you need Splunk searches that you can save and easily run on a schedule or as needed to keep your users up and running.
How Splunk software can help
You can use Splunk software to manage patches and updates to ensure all connected systems and related processes are running after the patch or update is complete. You can also use Splunk software for a number of other maintenance tasks, such as watching out for connectivity issues.
What you need
To succeed in implementing this use case, you need the following dependencies, resources, and information.
The best person to implement this use case is a system administrator who is familiar with Linux or UNIX and its variants. This person might come from your team, a Splunk partner, or Splunk OnDemand Services.
Managing *nix systems using Splunk software can last up to a few hours to get the data into Splunk.
The following technologies, data, and integrations are useful in successfully implementing this use case:
- Splunk Enterprise or Splunk Cloud
- Data sources onboarded
- Linux log data (/var/log/messages and similar)
- Command line output (df, ps, iostat, etc. ) via scripted inputs
- Splunk Add-on for Unix and Linux
How to use Splunk software for this use case
You can run many searches with Splunk software to manage *nix systems. Depending on what information you have available, you might find it useful to identify some or all of the following:
- *Nix hosts with NFS connectivity issues
- Filesystem mounts after *nix patching event
- Processes running after *nix patching event
- Package installations and upgrades on a *nix server
Other steps you can take
To maximize their benefit, the how-to articles linked in the previous section likely need to tie into existing processes at your organization or become new standard processes. These processes commonly impact success with this use case:
- Running regular backups
- Maintaining tooling for software provisioning
- Maintaining tooling for configuration management
- Site reliability engineering processes
These additional Splunk resources might help you understand and implement this use case:
- Blog: SAI Something Linux: Monitoring Linux with Splunk App for Infrastructure
- Conf Talk: Designing and Deploying a splunk Hardware Monitoring Service at Workday
- Conf Talk: Getting the Most Out of Logs for IT Monitoring and Troubleshooting
Measuring impact and benefit is critical to assessing the value of IT operations. The following are example metrics that can be useful to monitor when implementing this use case:
- Mean time to resolution
- Mean time to root cause
- Reduction in defects