Troubleshooting slow Splunk platform restarts when running under systemd on Linux
Most of your servers respond promptly to a "splunk stop" or "splunk restart." However, a small subset might require around 10 minutes to shut down. Confusingly, the splunkd.log file clearly shows the shutdown occurring quickly, often within 30 seconds. The delay might also be flagged as potentially breaking the search head cluster rolling restart.
The affected instances are all running the same OS and Splunk platform versions, but a Splunk support case is unable to identify a clear cause for the behavior.
This article shows you how to trace and diagnose slow Splunk platform shutdown issues caused by systemd failing to cleanly kill orphaned child processes on specific Linux distributions. Documentation on how systemd works on shutdown is limited, so the techniques described here can help you diagnose similar problems.
The likelihood of being affected by this exact issue is low, but the tracing methods described here should prove useful if you encounter anything similar.
What happens when a "splunk stop" occurs?
Under normal circumstances, Splunk platform starts a shutdown procedure, which triggers the systemd service to enter a deactivating state. Run this command:
systemctl status Splunkd
This outputs a current status such as:
$ systemctl status Splunkd ● Splunkd.service - Systemd service file for Splunk, generated by 'splunk enable boot-start' Loaded: loaded (/etc/systemd/system/Splunkd.service; enabled; vendor preset: disabled) Active: deactivating (stop-sigterm) since Wed 2026–01–14 11:44:13 AEDT; 13s ago
The most useful line of text in this example is the state deactivating (stop-sigterm). On working servers, you should see a transition from this state to something like Active: inactive (dead) since Wed 2026–01–14 00:27:09 UTC; 2s ago, and then the restart proceeds as normal.
On problematic servers, you see this state instead:
$ systemctl status Splunkd ● Splunkd.service - Systemd service file for Splunk, generated by 'splunk enable boot-start' Loaded: loaded (/etc/systemd/system/Splunkd.service; enabled; vendor preset: disabled) Active: deactivating (final-sigkill) since Tue 2026–01–13 08:02:26 UTC; 17s ago
At this point, the splunkd log might show:
01–14–2026 14:10:07.665 +1100 INFO loader [4172766 MainThread] - All pipelines finished.
The restart will not proceed until the timeout is reached. The timeout itself is TimeoutStopSec, which is configured in the systemd unit file. If this is set to 600 seconds, you experience a 10-minute delay. Reducing this timeout does not solve the problem; it just shortens the wait time.
You can also run systemctl daemon-reload to reset the timer when the unit is in a deactivating (final-sigkill) state, allowing the Splunk platform to finish restarting within seconds. This can be useful for testing the issue in non-production environments and for bypassing slow restarts in production.
The journalctl logs (journalctl -u Splunkd) can reveal additional hints about what might be happening. This information is also visible in the systemctl status output if you run it at the right time:
Jan 13 07:56:03 test-001 splunk[4068479]: 2026–01–13 07:56:03.186 +0000 splunkd started (build 75595d8f83ef) pid=4068479 Jan 13 08:02:26 test-001 systemd[1]: Stopping Systemd service file for Splunk, generated by 'splunk enable boot-start'… Jan 13 08:02:26 test-001 splunk[4068479]: 2026–01–13 08:02:26.816 +0000 Interrupt signal received sent by PID 1, which is my parent, command="/usr/lib/systemd/systemd - switched-> Jan 13 08:02:37 test-001 systemd[1]: Splunkd.service: Killing process 4069436 (python3) with signal SIGKILL. Jan 13 08:02:37 test-001 systemd[1]: Splunkd.service: Killing process 4069445 (metricator_read) with signal SIGKILL. Jan 13 08:02:37 test-001 systemd[1]: Splunkd.service: Killing process 4069744 (nmon_x86_64_ol8) with signal SIGKILL.
The processes being killed can differ from server to server. In this example, the Technical Add-on for the Metricator Nmon application is having its processes killed. Note that the TA itself is not necessarily the root cause; it is used in these examples because a fix for it was tested on a development server.
What issue occurs during the Splunk platform shutdown on problematic servers?
At this point, you can determine that systemd itself is killing processes on shutdown of the Splunk platform instance.
AI tools like Microsoft Copilot advise that the expected behavior for KillMode=mixed (as used by the splunkd.service file) is:
- Systemd sends the stop signal (default:
SIGTERM) only to the main process. - After the timeout (
TimeoutStopSec=), systemd sendsSIGKILLto all remaining processes in the service's cgroup, including:- children
- grandchildren
- anything else the service spawned
This statement is accurate in that Splunk platform does kill its child processes after receiving the shutdown signal (SIGINT, as defined in the Splunkd.unit file). However, documentation on how systemd is expected to handle service shutdown is quite limited, so this article reflects only observations made during investigation.
You might find that the TimeoutStopSec timeout is only reached on servers running Oracle Linux version 8; on Oracle Linux versions 7 and 9, the issue does not occur. In other words, the kill occurs immediately after the main process exits, not only after the timeout. Additionally, when systemd does not kill any processes on Oracle Linux version 8, the shutdown completes normally without any delay.
Tracing the issue
The ExecProcessor component reports kill signals sent on shutdown. To enable this, run:
splunk set log-level ExecProcessor -level DEBUG
You can also use pstree to view the process hierarchy. Find the splunkd mothership pid in the systemctl status Splunkd output, then run:
pstree -p <splunkd pid>
Example output (shortened for brevity):
$ pstree -p 2619542
splunkd(2619542)─┬─splunkd(2619608)─┬─compsup(2619897)─┬─identity(2619952)─┬─{identity}(2619955)
│ │ │ ├─{identity}(2619956)
│ │ │ ├─{identity}(2619957)
│ │ │ ├─{identity}(2619963)
│ │ │ ├─{identity}(2619964)
│ │ │ ├─{identity}(2620026)
│ │ │ └─{identity}(2690145)
│ │ ├─{compsup}(2619898)
│ │ ├─{compsup}(2619899)
│ │ ├─{compsup}(2619900)
│ │ ├─{compsup}(2619902)
│ │ ├─{compsup}(2619914)
│ │ ├─{compsup}(2619916)
│ │ ├─{compsup}(2619918)
│ │ └─{compsup}(2708546)
│ ├─mongod(2619892)─┬─{mongod}(2619904)
│ │ ├─{mongod}(2619905)
│ │ ├─{mongod}(2619906)
│ │ ├─{mongod}(2619969)
│ │ ├─{mongod}(2619970)
│ │ ├─{mongod}(2619971)
│ │ ├─{mongod}(2619972)
│ │ ├─{mongod}(2619973)
│ ├─python3.9(974591)
│ ├─python3.9(974592)
│ ├─python3.9(2620796)─┬─{python3.9}(2622282)
│ │ ├─{python3.9}(2622378)
│ │ ├─{python3.9}(2622384)
│ │ ├─{python3.9}(2622387)
│ │ ├─{python3.9}(2622390)
│ │ ├─{python3.9}(2622393)
│ ├─splunkd(799549)─┬─splunkd(799572)
│ │ ├─{splunkd}(799573)
│ │ ├─{splunkd}(799574)
│ │ ├─{splunkd}(799575)
│ │ ├─{splunkd}(799577)
│ │ ├─{splunkd}(799631)
│ │ ├─{splunkd}(809992)
│ │ └─{splunkd}(810097)
│ └─splunkd(2621237)─┬─{splunkd}(2621436)
│ ├─{splunkd}(2621437)
│ ├─{splunkd}(2621438)
│ ├─{splunkd}(2621439)
│ ├─{splunkd}(2621440)
│ ├─{splunkd}(2621441)
│ ├─{splunkd}(2621449)
│ ├─{splunkd}(2621450)
│ ├─{splunkd}(2621451)
│ ├─{splunkd}(2621452)
│ ├─{splunkd}(2621462)
│ └─{splunkd}(2621469)
├─{splunkd}(2619609)
├─{splunkd}(2619610)
├─{splunkd}(2619611)
├─{splunkd}(2619612)
This helps you confirm whether a process is a child of Splunkd.
Running systemctl status Splunkd shows all processes in the cgroup, which can include processes that are not children of the Splunkd process:
$ systemctl status Splunkd
● Splunkd.service - Systemd service file for Splunk, generated by 'splunk enable boot-start'
Loaded: loaded (/etc/systemd/system/Splunkd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2026-01-14 12:05:52 AEDT; 4 days ago
Process: 2619545 ExecStartPost=/bin/bash -c chown -R splunk:splunk /sys/fs/cgroup/memory/system.slice/Splunkd.service; exit 0 (code=exited, status=0/SUCCESS)
Process: 2619543 ExecStartPost=/bin/bash -c chown -R splunk:splunk /sys/fs/cgroup/cpu/system.slice/Splunkd.service (code=exited, status=0/SUCCESS)
Main PID: 2619542 (splunkd)
Tasks: 267
Memory: 3.2G (limit: 3.5G)
CGroup: /system.slice/Splunkd.service
├─ 906141 [splunkd pid=2619542] [search-launcher]
├─ 906144 [splunkd pid=2619542] [search-launcher] [process-runner]
├─ 920829 [splunkd pid=2619542] [search-launcher]
├─ 920833 [splunkd pid=2619542] [search-launcher] [process-runner]
├─ 921264 [splunkd pid=2619542] [search-launcher]
├─ 921285 [splunkd pid=2619542] [search-launcher] [process-runner]
├─ 951548 [splunkd pid=2619542] [search-launcher]
├─ 951570 [splunkd pid=2619542] [search-launcher] [process-runner]
├─ 959280 python3 /opt/splunk/etc/apps/TA-metricator-for-nmon/bin/metricator_reader.py --fifo fifo1
├─ 959282 /bin/sh /opt/splunk/etc/apps/TA-metricator-for-nmon/bin/metricator_reader.sh /opt/splunk/var/log/metricator/var/nmon_repository/fifo1/nmon.fifo
├─ 959306 /opt/splunk/var/log/metricator/bin/linux/ol/nmon_x86_64_ol8 -F /opt/splunk/var/log/metricator/var/nmon_repository/fifo1/nmon.fifo -T -s 60 -c 1440 -d 1500 -g auto -D -p
├─2619542 splunkd --under-systemd --systemd-delegate=yes -p 8089 _internal_launch_under_systemd
├─2619608 [splunkd pid=2619542] splunkd --under-systemd --systemd-delegate=yes -p 8089 _internal_launch_under_systemd [process-runner]
├─2619892 mongod --dbpath=/opt/splunk/var/lib/splunk/kvstore/mongo --storageEngine=wiredTiger --wiredTigerCacheSizeGB=1.050000 --port=8191 --timeStampFormat=iso8601-utc --oplogSize=200 --keyFile=/opt>
├─2619897 compsup daemon
├─2619952 /opt/splunk/var/run/supervisor/pkg-run/pkg-identity722142497/identity
├─2620796 /opt/splunk/bin/python3.9 -O /opt/splunk/lib/python3.9/site-packages/splunk/appserver/mrsparkle/root.py --proxied=127.0.0.1,8065,8000
└─2621237 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089 --with-kvstore
Jan 14 12:05:55 dev-004 splunk[2619542]: Your indexes and inputs configurations are not internally consistent. For more information, run 'splunk btool check --de>
Jan 14 12:05:55 dev-004 splunk[2619592]: Bad regex value: '^[', of param: props.conf / [panther:apache:error] / TIME_PREFIX; why: missing terminating ] f>
Jan 14 12:05:56 dev-004 splunk[2619542]: One or more regexes in your configuration are not valid. For details, please see btool.log or directly above.
Jan 14 12:05:56 dev-004 splunk[2619542]: Checking filesystem compatibility... Done
Jan 14 12:05:56 dev-004 splunk[2619542]: Checking conf files for problems...
Jan 14 12:05:56 dev-004 splunk[2619542]: Done
Jan 14 12:05:56 dev-004 splunk[2619542]: Checking default conf files for edits...
Jan 14 12:05:56 dev-004 splunk[2619542]: Validating installed files against hashes from '/opt/splunk/splunk-9.3.3-75595d8f83ef-linux-2.6-x86_64-manifest'
Jan 14 12:05:56 dev-004 splunk[2619542]: PYTHONHTTPSVERIFY is set to 0 in splunk-launch.conf disabling certificate validation for the httplib and urllib libraries shipped with t>
Jan 14 12:05:56 dev-004 splunk[2619542]: 2026-01-14 12:05:56.925 +1100 splunkd started (build 75595d8f83ef) pid=2619542
The systemd-cgls command shows just the processes running under the Splunkd service, which is slightly easier to read:
systemd-cgls /system.slice/Splunkd.service --no-pager -l Control group /system.slice/Splunkd.service: ├─ 799549 [splunkd pid=2619542] [search-launcher] ├─ 799572 [splunkd pid=2619542] [search-launcher] [process-runner] ├─ 906141 [splunkd pid=2619542] [search-launcher] ├─ 906144 [splunkd pid=2619542] [search-launcher] [process-runner] ├─ 920829 [splunkd pid=2619542] [search-launcher] ├─ 920833 [splunkd pid=2619542] [search-launcher] [process-runner] ├─ 921264 [splunkd pid=2619542] [search-launcher] ├─ 921285 [splunkd pid=2619542] [search-launcher] [process-runner] ├─ 951548 [splunkd pid=2619542] [search-launcher] ├─ 951570 [splunkd pid=2619542] [search-launcher] [process-runner] ├─ 959280 python3 /opt/splunk/etc/apps/TA-metricator-for-nmon/bin/metricator_reader.py --fifo fifo1 ├─ 959282 /bin/sh /opt/splunk/etc/apps/TA-metricator-for-nmon/bin/metricator_reader.sh /opt/splunk/var/log/metricator/var/nmon_repository/fifo1/nmon.fifo ├─ 959306 /opt/splunk/var/log/metricator/bin/linux/ol/nmon_x86_64_ol8 -F /opt/splunk/var/log/metricator/var/nmon_repository/fifo1/nmon.fifo -T -s 60 -c 1440 -d 1500 -g auto -D -p ├─2619542 splunkd --under-systemd --systemd-delegate=yes -p 8089 _internal_launch_under_systemd ├─2619608 [splunkd pid=2619542] splunkd --under-systemd --systemd-delegate=yes -p 8089 _internal_launch_under_systemd [process-runner] ├─2619892 mongod --dbpath=/opt/splunk/var/lib/splunk/kvstore/mongo --storageEngine=wiredTiger --wiredTigerCacheSizeGB=1.050000 --port=8191 --timeStampFormat=iso8601-utc --oplogSize=200 --keyFile=/opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key --setParameter=enableLocalhostAuthBypass=0 --setParameter=oplogFetcherSteadyStateMaxFetcherRestarts=0 --replSet=AB96FBE1-FC8F-4BBE-894A-5F4C41A60B4C --bind_ip=0.0.0.0 --sslMode=requireSSL --sslAllowInvalidHostnames --sslPEMKeyFile=/opt/splunk/etc/auth/server.pem --sslPEMKeyPassword=xxxxxxxx --tlsDisabledProtocols=noTLS1_0,noTLS1_1 --sslCipherConfig=ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256:AES256-GCM-SHA384:AES128-GCM-SHA256:AES128-SHA256 --nounixsocket --noscripting ├─2619897 compsup daemon ├─2619952 /opt/splunk/var/run/supervisor/pkg-run/pkg-identity722142497/identity ├─2620796 /opt/splunk/bin/python3.9 -O /opt/splunk/lib/python3.9/site-packages/splunk/appserver/mrsparkle/root.py --proxied=127.0.0.1,8065,8000 └─2621237 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089 --with-kvstore
To determine the parent of any pid, use ps -ef to confirm the parent ID in the third output column (process ID 1 for python3 in the example below):
$ ps -ef | grep nmon | grep -v grep splunk 4139835 1 0 Jan18 ? 00:00:12 python3 /opt/splunk/etc/peer-apps/TA-metricator-for-nmon/bin/metricator_reader.py --fifo fifo2 splunk 4139837 4139835 0 Jan18 ? 00:00:17 /bin/sh /opt/splunk/etc/peer-apps/TA-metricator-for-nmon/bin/metricator_reader.sh /opt/splunk/var/log/metricator/var/nmon_repository/fifo2/nmon.fifo splunk 4139885 1 0 Jan18 ? 00:00:50 /opt/splunk/var/log/metricator/bin/linux/ol/nmon_x86_64_ol8 -F /opt/splunk/var/log/metricator/var/nmon_repository/fifo2/nmon.fifo -T -s 60 -c 1440 -d 1500 -N -g auto -D -p
Finally, use strace to watch for process kill signals:
strace -ff -tt -e trace=kill,tkill,tgkill,rt_sigqueueinfo,rt_tgsigqueueinfo,pidfd_send_signal -p <splunkd pid>
The auditd tool might also be an alternative option for tracing process kills.
Through this tracing, you can confirm whether the processes are being killed by the Splunk platform on shutdown or by systemd. You can also confirm whether the processes are children of the Splunk platform. If you have a TA with a shell script configured in its inputs.conf file, the script might spawn a new process (a grandchild process, which then spawns more processes). If the script then exits, the grandchild process ends up with a parent ID of 1.
The order of the kills can also be confusing. Checking the splunkd.log and journal logs:
splunkd logs:
01–13–2026 08:02:37.557 +0000 INFO Shutdown [4068580 Shutdown] - Shutdown complete in 5.711 seconds 01–13–2026 08:02:37.559 +0000 INFO loader [4068479 MainThread] - All pipelines finished.
journal logs:
Jan 13 07:56:03 test-001 splunk[4068479]: 2026–01–13 07:56:03.186 +0000 splunkd started (build 75595d8f83ef) pid=4068479 Jan 13 08:02:26 test-001 systemd[1]: Stopping Systemd service file for Splunk, generated by 'splunk enable boot-start'… Jan 13 08:02:26 test-001 splunk[4068479]: 2026–01–13 08:02:26.816 +0000 Interrupt signal received sent by PID 1, which is my parent, command="/usr/lib/systemd/systemd - switched-> Jan 13 08:02:37 test-001 systemd[1]: Splunkd.service: Killing process 4069436 (python3) with signal SIGKILL. Jan 13 08:02:37 test-001 systemd[1]: Splunkd.service: Killing process 4069445 (metricator_read) with signal SIGKILL. Jan 13 08:02:37 test-001 systemd[1]: Splunkd.service: Killing process 4069744 (nmon_x86_64_ol8) with signal SIGKILL.
In the above example, the shutdown completed at 08:02:37, and the journal also shows that the kills occurred during this same second. However, you cannot see whether the systemd kill occurred after the Splunk platform process had shut down.
To get more precise timing, use the -o short-precise option with the journal command:
journalctl -u Splunkd - since '20 minutes ago' --no-pager -o short-precise Jan 13 07:56:03.186501 test-001 splunk[4068479]: 2026–01–13 07:56:03.186 +0000 splunkd started (build 75595d8f83ef) pid=4068479 Jan 13 08:02:26.816122 test-001 systemd[1]: Stopping Systemd service file for Splunk, generated by 'splunk enable boot-start'… Jan 13 08:02:26.816932 test-001 splunk[4068479]: 2026–01–13 08:02:26.816 +0000 Interrupt signal received sent by PID 1, which is my parent, command="/usr/lib/systemd/systemd - switched-root - system - deserialize" (UID 0) Jan 13 08:02:37.592258 test-001 systemd[1]: Splunkd.service: Killing process 4069436 (python3) with signal SIGKILL. Jan 13 08:02:37.592574 test-001 systemd[1]: Splunkd.service: Killing process 4069445 (metricator_read) with signal SIGKILL. Jan 13 08:02:37.592766 test-001 systemd[1]: Splunkd.service: Killing process 4069744 (nmon_x86_64_ol8) with signal SIGKILL.
From these logs, you can see the kill occurs after the Splunk platform has exited (.592 for systemd vs .559 for splunkd). This timing difference appears to cause the problem in certain systemd/OS versions.
Modifying TA Metricator to handle killing its own processes
This modification is only required if you have this add-on and an OS version that has this issue. These modifications also help test the theory that all processes must exit before Splunkd exits for a clean restart.
One option is to send a kill signal to the process group ID. However, all Splunkd child processes share the main Splunkd process group, meaning signaling the group would cause additional problems. Instead, you can modify the TA's scripts.
The changes involve:
- Recording any child pids as the script spawns them
- Keeping the parent script running until all children exit
- Setting up a "trap" to receive the kill signal from Splunkd
- Having the trap kill all child processes of the script (and their children)
The nmon binary is a challenge because it disowns the parent process and becomes a child of pid 1. Therefore, use a for loop so the script exits if nmon is killed or fails. This way, the script can restart the nmon binary as it did before this modification. The following diff output shows the required changes:
-#!/bin/sh+#!/bin/bash # hostname HOST=`hostname`+pids=()+ # format date output to strftime dd/mm/YYYY HH:MM:SS log_date () { date "+%d-%m-%Y %H:%M:%S" if [ -z "${SPLUNK_HOME}" ]; then exit 1 fi+cleanup() {+ pids+=("$nmon_pid")+ for pid in "${pids[@]}"; do+ for child_pid in $(pgrep -P $pid); do+ kill $child_pid # kill grandchildren+ done+ # Check if still running+ kill -0 "$pid" 2>/dev/null && kill $pid 2>/dev/null || true+ done+ for pid in $pids; do+ wait "$pid" 2>/dev/null || true+ done+}+trap 'cleanup' INT TERM+ # Splunk Home variable: This should automatically defined when this script is being launched by Splunk # If you intend to run this script out of Splunk, please set your custom value here SPL_HOME=${SPLUNK_HOME} start_fifo_reader () { "perl") nohup $APP/bin/metricator_reader.pl --fifo fifo1 </dev/null >/dev/null 2>&1 & ;; "python"|"python3") nohup $INTERPRETER $APP/bin/metricator_reader.py --fifo fifo1 </dev/null >/dev/null 2>&1 & ;;+ pids+=("$!")+ ;; esac echo $! > ${APP_VAR}/var/fifo_reader_fifo1.pid fifo_started="fifo1" start_fifo_reader () { "perl") nohup $APP/bin/metricator_reader.pl --fifo fifo2 </dev/null >/dev/null 2>&1 & ;; "python"|"python3") nohup $INTERPRETER $APP/bin/metricator_reader.py --fifo fifo2 </dev/null >/dev/null 2>&1 & ;;+ pids+=("$!")+ ;; esac echo $! > ${APP_VAR}/var/fifo_reader_fifo2.pid fifo_started="fifo2" else # Relevant for Solaris Only write_pid remove_mutex+ # we could just "wait" but if the nmon process exits it will not restart+ # if nmon fails/terminates we should exit allowing the restart occur+ # but we cannot use wait against a pid that is owned by parent id 1+ while true; do+ nmon_pid=`ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`;+ # if we sleep too long we miss the signal and don't terminate fast enough...+ sleep 3;+ if [ -z $nmon_pid ]; then break; fi+ done+ #wait exit 0 ;; else
Click here for the complete version of this file.
#!/bin/bash
#set -x
# Program name: metricator_helper.sh
# Purpose - nmon sample script to start collecting data with a 1mn interval refresh
# Author - Guilhem Marchand
# Version 2.0.2
# For AIX / Linux / Solaris
#################################################
## Your Customizations Go Here ##
#################################################
# hostname
HOST=`hostname`
pids=()
# format date output to strftime dd/mm/YYYY HH:MM:SS
log_date () {
date "+%d-%m-%Y %H:%M:%S"
}
if [ -z "${SPLUNK_HOME}" ]; then
echo "`log_date`, ${HOST} ERROR, SPLUNK_HOME variable is not defined"
exit 1
fi
cleanup() {
#echo "Adding nmon pid $nmon_pid" >> /tmp/kill.txt
pids+=("$nmon_pid")
for pid in "${pids[@]}"; do
#echo "`date` Will attempt to kill $pid" >> /tmp/kill.txt
for child_pid in $(pgrep -P $pid); do
#echo "`date` attempting to kill child pid $child_pid" >> /tmp/kill.txt
kill $child_pid # kill grandchildren
done
# Check if still running
kill -0 "$pid" 2>/dev/null && kill $pid 2>/dev/null || true
done
#echo "`date` Kiling complete" >> /tmp/kill.txt
for pid in $pids; do
wait "$pid" 2>/dev/null || true
done
}
trap 'cleanup' INT TERM
# Splunk Home variable: This should automatically defined when this script is being launched by Splunk
# If you intend to run this script out of Splunk, please set your custom value here
SPL_HOME=${SPLUNK_HOME}
# Check SPL_HOME variable is defined, this should be the case when launched by Splunk scheduler
if [ -z "${SPL_HOME}" ]; then
echo "`log_date`, ${HOST} ERROR, SPL_HOME (SPLUNK_HOME) variable is not defined"
exit 1
fi
# APP path discovery
if [ -d "$SPLUNK_HOME/etc/apps/TA-metricator-for-nmon" ]; then
APP=$SPLUNK_HOME/etc/apps/TA-metricator-for-nmon
elif [ -d "$SPLUNK_HOME/etc/peer-apps/TA-metricator-for-nmon" ];then
APP=$SPLUNK_HOME/etc/peer-apps/TA-metricator-for-nmon
else
echo "`log_date`, ${HOST} ERROR, the APP directory could not be defined, is the TA-metricator-for-nmon installed ?"
exit 1
fi
# pre-action scripts, run any script available in bin/pre_action_scripts
pre_action_scripts=`find $APP/bin/pre_action_scripts -name "*.sh" -type f`
for pre_action_script in $pre_action_scripts; do
if [ -x $pre_action_script ]; then
echo "`log_date`, ${HOST} INFO, executing pre-action script: $pre_action_script"
$pre_action_script
fi
done
# Var directory for data generation
APP_VAR=$SPLUNK_HOME/var/log/metricator
# Create directory if not existing already
[ ! -d $APP_VAR ] && { mkdir -p $APP_VAR; }
# Mutex: avoid running metricator_helper.sh and metricator_consumer.sh concurrently
mutex="${APP_VAR}/mutex"
# Allow 10s mini to acquire mutex and break
count=0
while [ -f $mutex ]; do
sleep 2
count=`expr $count + 1`
if [ $count -gt 5 ]; then
break
fi
done
# acquire mutex
touch $mutex
remove_mutex () {
rm -f $mutex
}
# Which type of OS are we running
UNAME=`uname`
# Linux binaries are stored in the bin/linux.tgz archive file
# At first startup only, if the linux directory does not exist, extract the binaries archive file
case $UNAME in
Linux )
if [ ! -d ${APP}/bin/linux ]; then
cd ${APP}/bin
tar -xzpf linux.tgz
fi
;;
esac
# Manage sarmon binaries for Solaris
case $UNAME in
SunOS)
if [ ! -d ${APP}/bin/sarmon_bin_i386 ]; then
cd ${APP}/bin
gunzip sarmon_bin_i386.tgz
tar xf sarmon_bin_i386.tar
fi
if [ ! -d ${APP}/bin/sarmon_bin_sparc ]; then
cd ${APP}/bin
gunzip sarmon_bin_sparc.tgz
tar xf sarmon_bin_sparc.tar
fi
;;
esac
# Silently update bin content to run directory (see after this)
# Note: on some systems, cp is an alias to cp -i which would prevent this from working as expected
update_var_bin () {
cd ${APP}/bin
case $UNAME in
Linux )
tar -xzpf linux.tgz ;;
esac
\cp -pf ${APP}/default/app.conf ${APP_VAR}/app.conf > /dev/null 2>&1
\cp -rpf ${APP}/bin ${APP_VAR}/ > /dev/null 2>&1
}
# To prevents binaries overwrites during upgrades and sh cluster deployment issues, cache the bin directory
# Binaries will be launched from the cache directory
if [ -d ${APP_VAR}/bin ]; then
# the bin directory has been already cached, verify if an update is required
if [ -f ${APP_VAR}/app.conf ]; then
diff ${APP}/default/app.conf ${APP_VAR}/app.conf >/dev/null
# if return code does not equal to 0, update is required
if [ $? -ne 0 ]; then
update_var_bin
fi
else
# no app.conf found, force copy of app.conf and update
update_var_bin
fi
else
# the bin directory has not been cached already
update_var_bin
fi
# Remove stanza name, prevent any change from the SHC deployer
reformat_default_nmon_conf () {
# Retrieve category from first arg
nmon_conf=$1
case $UNAME in
"Linux")
sed -i 's/ = /=/g' ${nmon_conf}
sed -i 's/\[nmon\]//g' ${nmon_conf}
;;
*)
cat ${nmon_conf} | sed 's/ = /=/g' | sed 's/\[nmon\]//g' > /tmp/metricator_helper.tmp.$$
mv /tmp/metricator_helper.tmp.$$ ${nmon_conf}
;;
esac
}
###
### FIFO options:
###
# Using FIFO files (named pipe) are now used to minimize the CPU footprint of the technical addons
# As such, it is not required anymore to use short cycle of Nmon run to reduce the CPU usage
# You can still want to manage the volume of data to be generated by managing the interval and snapshot values
# as a best practice recommendation, the time to live of nmon processes writing to FIFO should be 24 hours
# value for interval: time in seconds between 2 performance measures
fifo_interval="60"
# value for snapshot: number of measure to perform
fifo_snapshot="1440"
# AIX common options default, will be overwritten by nmon.conf (unless the file would not be available)
# Note: Since the version 1.3.0, AIX uses fifo files to minimize the CPU footprint, this requires the -F option
# and is not compatible with the "-f" option that defines output to csv
# The -F option is implicitly added by the metricator_helper.sh script during processing
AIX_options="-T -A -d -K -L -M -P -O -W -S -^ -p"
# Linux max devices (-d option), default to 1500
Linux_devices="1500"
# Change the priority applied while looking at nmon binary
# by default, the metricator_helper.sh script will use any nmon binary found in PATH
# Set to "1" to give the priority to embedded nmon binaries
Linux_embedded_nmon_priority="0"
# Change the limit for processes and disks capture of nmon for Linux
# In default configuration, nmon will capture most of the process table by capturing main consuming processes
# You can set nmon to an unlimited number of processes to be captured, and the entire process table will be captured.
# Note this will affect the number of disk devices captured by setting it to an unlimited number.
# This will also increase the volume of data to be generated and may require more cpu overhead to process nmon data
# The default configuration uses the default mode (limited capture), you can set bellow the limit number of capture to unlimited mode
# Change to "1" to set capture of processes and disks to no limit mode
Linux_unlimited_capture="0"
# endtime_margin defines the time in seconds before a new nmon process will be started
# in default configuration, a new process will be spawned 240 seconds before the current process ends
# see nmon.conf (this value will be overwritten by nmon.conf)
endtime_margin="240"
# Linux disks extended statistics (see nmon.conf)
Linux_disk_dg_enable="1"
# Name of the DG group file
Linux_disk_dg_group="auto"
# nmon external generation, default is activated
nmon_external_generation="1"
# source default / local / per server nmon.conf
for nmon_conf_file in $APP/default/nmon.conf $APP/local/nmon.conf /etc/nmon.conf; do
if [ -f $nmon_conf_file ]; then
# Verify and reformat if required
grep '\[nmon\]' $nmon_conf_file >/dev/null
if [ $? -eq 0 ]; then
reformat_default_nmon_conf $nmon_conf_file
. $nmon_conf_file
else
. $nmon_conf_file
fi
fi
done
# Manage FQDN option
echo $nmonparser_options | grep '\-\-use_fqdn' >/dev/null
if [ $? -eq 0 ]; then
# Only relevant for Linux OS
case $UNAME in
Linux)
HOST=`hostname -f` ;;
AIX)
HOST=`hostname` ;;
SunOS)
HOST=`hostname` ;;
esac
else
HOST=`hostname`
fi
# Manage host override option based on Splunk hostname defined
case $override_sys_hostname in
"1")
# Retrieve the Splunk host value
HOST=`cat $SPLUNK_HOME/etc/system/local/inputs.conf | grep '^host =' | awk -F\= '{print $2}' | sed 's/ //g'`
;;
esac
# Nmon Binary
case $UNAME in
##########
# AIX #
##########
AIX )
# Use topas_nmon in priority
if [ -x /usr/bin/topas_nmon ]; then
NMON="/usr/bin/topas_nmon"
AIX_topas_nmon="true"
else
NMON=`which nmon 2>&1`
if [ ! -x "$NMON" ]; then
echo "`log_date`, ${HOST} ERROR, Nmon could not be found, cannot continue."
remove_mutex
exit 1
fi
AIX_topas_nmon="false"
fi
;;
##########
# Linux #
##########
# Nmon App comes with most of nmon versions available from http://nmon.sourceforge.net/pmwiki.php?n=Site.Download
Linux )
case $Linux_embedded_nmon_priority in
0)
# give priority to any nmon binary found in local PATH
# Nmon BIN full path (including bin name), please update this value to reflect your Nmon installation
which nmon >/dev/null 2>&1
if [ $? -eq 0 ]; then
NMON=`which nmon`
else
NMON=""
fi
;;
1)
# give priority to embedded binaries
# if none of embedded binaries can suit the local system, we will switch to local nmon binary, if it's available
NMON=""
;;
esac
if [ ! -x "$NMON" ];then
# No nmon found in env, so using prepackaged version
# First, define the processor architecture, use the arch command in priority, fall back to uname -m if not available
which arch >/dev/null 2>&1
if [ $? -eq 0 ]; then
ARCH=`arch`
else
ARCH=`uname -m`
fi
# Let's convert some of architecture names to more conventional names, specially used by the nmon community to name binaries (not that ppc32 is more or less clear than power_32...)
case $ARCH in
i686 )
ARCH_NAME="x86" ;; # x86 32 bits
x86_64 )
ARCH_NAME="x86_64" ;; # x86 64 bits
ia64 )
ARCH_NAME="ia64" ;; # Itanium 64 bits
ppc32* )
ARCH_NAME="power_32" ;; # powerpc 32 bits
ppc64* )
ARCH_NAME="power_64" ;; # powerpc 64 bits
s390 )
ARCH_NAME="mainframe_32" ;; # s390 32 bits mainframe
s390x )
ARCH_NAME="mainframe_64" ;; # s390x 64 bits mainframe
arm* )
ARCH_NAME="arm" ;; # arm architecture
* )
ARCH_NAME="${ARCH}" ;; # None of those!
esac
### PowerLinux specific ###
# On PowerLinux arch, some OS can run in Big Endian while most will run in Little Endian
# On a Little Endian system, the following command will return "1" for a Little Endian arch
# See this nice article: https://www.mainline.com/linux-on-power-to-be-or-not-to-be-why-should-i-care
# And specifically "Ubuntu is LE only; SLES 11 is BE only; SLES 12 is LE only; RedHat 6.x is BE only; RedHat 7.1 has two distributions – one LE, the other BE"
# For convenience, all powerLinux binaries are suffixed by "_le" or "_be"
case $ARCH in
ppc32* | ppc64* )
# Assign default to Little Endian in case of failure
BYTE_ORDER_STATUS="1"
BYTE_ORDER="le"
BYTE_ORDER_STATUS=`echo I | tr -d [:space:] | od -to2 | head -n1 | awk '{print $2}' | cut -c6`
case ${BYTE_ORDER_STATUS} in
0 )
# Big Endian
BYTE_ORDER="be" ;;
# Little Endian
1 )
BYTE_ORDER="le" ;;
esac
;;
esac
# Initialize linux_vendor
linux_vendor=""
linux_mainversion=""
linux_subversion=""
linux_fullversion=""
# Try to find the better embedded binary depending on Linux version
# Most modern Linux comes with an /etc/os-release, this is (from far) the better scenario for system identification
OSRELEASE="/etc/os-release"
if [ -f $OSRELEASE ]; then
# Great, let's try to find the better binary for that system
linux_vendor=`grep '^ID=' $OSRELEASE | awk -F= '{print $2}' | sed 's/\"//g' | sed 's/ //g'` # The Linux distribution
linux_mainversion=`grep '^VERSION_ID=' $OSRELEASE | awk -F'"' '{print $2}' | awk -F'.' '{print $1}'` # The main release (eg. rhel 7)
# some distribution (eg. Fedora) seem to use a non standard format
case $linux_mainversion in
"")
linux_mainversion=`grep '^VERSION_ID=' $OSRELEASE | sed 's/ //g' | sed 's/\"//' | awk -F'=' '{print $2}'` ;;
esac
linux_subversion=`grep '^VERSION_ID=' $OSRELEASE | awk -F'"' '{print $2}' | awk -F'.' '{print $2}'` # The sub level release (eg. "1" from rhel 7.1)
linux_fullversion=`grep '^VERSION_ID=' $OSRELEASE | awk -F'"' '{print $2}' | sed 's/\.//g'` # Concatenated version of the release (eg. 71 for rhel 7.1)
case $ARCH in
# PowerLinux
ppc32* | ppc64* )
# Manage Big / Little Endian arch
case ${BYTE_ORDER} in
# Big Endian
"be" )
# Try the most accurate
if [ -f $APP_VAR/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_fullversion}_be ]; then
NMON="$APP_VAR/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_fullversion}_be"
# try the mainversion
elif [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_be ]; then
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_be"
# try the linux_vendor
elif [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}_be ]; then
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}_be"
fi
;;
# Little Endian
"le" )
# Try the most accurate
if [ -f $APP_VAR/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_fullversion}_le ]; then
NMON="$APP_VAR/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_fullversion}_le"
# try the mainversion
elif [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_le ]; then
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_le"
# try the linux_vendor
elif [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}_le ]; then
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}_le"
fi
;;
esac
;;
# All other arch
*)
# Try the most accurate
if [ -f $APP_VAR/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_fullversion} ]; then
NMON="$APP_VAR/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_fullversion}"
# try the mainversion
elif [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion} ]; then
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
# try the linux_vendor
elif [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor} ]; then
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}"
fi
;;
esac
# So bad, no os-release, probably old linux, things becomes a bit harder
# centOS, OS and version detection
elif [ -f /etc/centos-release ]; then
for version in 5 6 7; do
if grep "CentOS release $version" /etc/centos-release >/dev/null; then
linux_vendor="centos"
linux_mainversion="$version"
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
fi
done
# rhel, OS and version detection
elif [ -f /etc/redhat-release ]; then
# Redhat has some version for PowerLinux that can be Little or Big endian
for version in 4 5 6 7; do
# search for rhel
if grep "Red Hat Enterprise Linux Server release $version" /etc/redhat-release >/dev/null; then
linux_vendor="rhel"
linux_mainversion="$version"
case $ARCH in
# PowerLinux
ppc32* | ppc64* )
# Manage Big / Little Endian arch
case ${BYTE_ORDER} in
# Big endian
"be" )
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_be"
;;
# Little endian
"le")
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_le"
;;
esac
;;
# Other arch
* )
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
;;
esac
fi
done
# Second chance for sles and opensuse, /etc/SuSE-release is deprecated and should be removed in future version
elif [ -f /etc/SuSE-release ]; then
# sles
if grep "SUSE Linux Enterprise Server" /etc/SuSE-release >/dev/null; then
linux_vendor="sles"
# Get the main version only
linux_mainversion=`grep 'VERSION =' /etc/SuSE-release | sed 's/ //g' | awk -F= '{print $2}' | awk -F. '{print $1}'`
linux_subversion=`grep 'PATCHLEVEL =' /etc/SuSE-release | sed 's/ //g' | awk -F= '{print $2}' | awk -F. '{print $1}'`
case $ARCH in
# PowerLinux
ppc32* | ppc64* )
# Manage Big / Little Endian arch
case ${BYTE_ORDER} in
# Big endian
"be" )
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_be"
;;
# Little endian
"le")
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_le"
;;
esac
;;
# Other arch
* )
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
;;
esac
elif grep "openSUSE" /etc/SuSE-release >/dev/null; then
linux_vendor="opensuse"
# Get the main version only
linux_mainversion=`grep 'VERSION =' /etc/SuSE-release | sed 's/ //g' | awk -F= '{print $2}' | awk -F. '{print $1}'`
linux_subversion=`grep 'PATCHLEVEL =' /etc/SuSE-release | sed 's/ //g' | awk -F= '{print $2}' | awk -F. '{print $1}'`
# try the most accurate
if [ -f ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}${linux_subversion} ]; then
NMON=" ${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}${linux_subversion}"
else
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
fi
fi
elif [ -f /etc/issue ]; then
# search for debian (note: starting debian 7, the /etc/os-release should be available)
# This shall not be updated in the future as the /etc/os-release is now available by default
if grep "Debian GNU/Linux" /etc/issue >/dev/null; then
for version in 5 6 7; do
if grep "Debian GNU/Linux $version" /etc/issue >/dev/null; then
linux_vendor="debian"
linux_mainversion="$version"
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
fi
done
# Ubuntu is Little Endian only
elif grep "Ubuntu" /etc/issue >/dev/null; then
for version in 6 7 8 9 10 11 12 13 14 15; do
if grep "Ubuntu $version" /etc/issue >/dev/null; then
linux_vendor="ubuntu"
linux_mainversion="$version"
case $ARCH in
# PowerLinux
ppc32* | ppc64* )
# Manage Big / Little Endian arch
case ${BYTE_ORDER} in
# Big endian
"be" )
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_be"
;;
# Little endian
"le")
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}_le"
;;
esac
;;
# Other arch
* )
NMON="${APP_VAR}/bin/linux/${linux_vendor}/nmon_${ARCH_NAME}_${linux_vendor}${linux_mainversion}"
;;
esac
fi
done
fi
fi
# Verify NMON is set and exists, if not, try falling back to generic builds
case $NMON in
"")
# Look for local binary in PATH
which nmon >/dev/null 2>&1
if [ $? -eq 0 ]; then
NMON=`which nmon 2>&1`
else
case $ARCH in
# PowerLinux
ppc32* | ppc64* )
# Manage Big / Little Endian arch
case ${BYTE_ORDER} in
# Big endian
"be" )
NMON="${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH_NAME}_be"
;;
# Little endian
"le")
NMON="${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH_NAME}_le"
;;
esac
;;
# Other arch
* )
NMON="${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH_NAME}"
;;
esac
fi
;;
*)
if [ ! -x ${NMON} ]; then
# Look for local binary in PATH
which nmon >/dev/null 2>&1
if [ $? -eq 0 ]; then
NMON=`which nmon 2>&1`
fi
fi
;;
esac
# Finally verify we have a binary that exists and is executable
if [ ! -x ${NMON} ]; then
if [ -x ${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH} ]; then
# Try switching to embedded generic
case $ARCH in
# PowerLinux
ppc32* | ppc64* )
# Manage Big / Little Endian arch
case ${BYTE_ORDER} in
# Big endian
"be" )
NMON="${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH_NAME}_be"
;;
# Little endian
"le")
NMON="${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH_NAME}_le"
;;
esac
;;
# Other arch
* )
NMON="${APP_VAR}/bin/linux/generic/nmon_linux_${ARCH_NAME}"
;;
esac
else
echo "`log_date`, ${HOST} ERROR, could not find an nmon binary suitable for this system, please install nmon manually and set it available in the user PATH"
remove_mutex
exit 1
fi
fi
fi
;;
##########
# SunOS #
##########
SunOS )
# Nmon BIN full path (including bin name), please update this value to reflect your Nmon installation
NMON=`which sadc 2>&1`
if [ ! -x "$NMON" ];then
# No nmon found in env, so using prepackaged version
sun_arch=`uname -a`
echo ${sun_arch} | grep sparc >/dev/null
case $? in
0 )
NMON="$APP_VAR/bin/sarmon_bin_sparc/sadc" ;;
* )
# arch is x86
NMON="$APP_VAR/bin/sarmon_bin_i386/sadc" ;;
esac
fi
;;
* )
echo "`log_date`, ${HOST} ERROR, Unsupported system ! Nmon is available only for AIX / Linux / Solaris systems, please check and deactivate nmon data collect"
remove_mutex
exit 2
;;
esac
# Nmon file final destination
# Default to nmon_repository of Nmon Splunk App
NMON_REPOSITORY=${APP_VAR}/var/nmon_repository
[ ! -d $NMON_REPOSITORY ] && { mkdir -p $NMON_REPOSITORY; }
#also needed -
[ -d ${APP_VAR}/var/csv_repository ] || { mkdir -p ${APP_VAR}/var/csv_repository; }
[ -d ${APP_VAR}/var/config_repository ] || { mkdir -p ${APP_VAR}/var/config_repository; }
# Nmon PID file
PIDFILE=${APP_VAR}/nmon.pid
# FIFO file 1
FIFO1_DIR=${NMON_REPOSITORY}/fifo1
FIFO1=${FIFO1_DIR}/nmon.fifo
# FIFO file 2
FIFO2_DIR=${NMON_REPOSITORY}/fifo2
FIFO2=${FIFO2_DIR}/nmon.fifo
# outdated net if state file: used to inform the script that parsers have detected an outdated definition of network
# interfaces. If the outdated state file exist, the current running process will be terminated
# If a network change has occurred and the list of interfaces have changed, network metrics will not be available
# until a new process is started
OUTDATED_NETIF_NMON_STATE=${APP_VAR}/var/outdated_network_int_nmon.state
# create dir
[ -d ${FIFO1_DIR} ] || { mkdir -p ${FIFO1_DIR}; }
[ -d ${FIFO2_DIR} ] || { mkdir -p ${FIFO2_DIR}; }
# ensure fifo files do not exist currently as regular files instead of named pipe
if [ -s $FIFO1 ]; then
rm -f $FIFO1
fi
if [ -s $FIFO2 ]; then
rm -f $FIFO2
fi
# create fifo files if required
if [ ! -p $FIFO1 ]; then
mkfifo $FIFO1
fi
if [ ! -p $FIFO2 ]; then
mkfifo $FIFO2
fi
# csv_repository
[ -d ${APP_VAR}/var/csv_repository ] || { mkdir -p ${APP_VAR}/var/csv_repository; }
#
# Interpreter choice
#
PYTHON=0
PYTHON2=0
PYTHON3=0
PERL=0
# Set the default interpreter
INTERPRETER="python"
# Get the version for both worlds
PYTHON2=`which python 2>&1`
PYTHON3=`which python3 2>&1`
PERL=`which perl 2>&1`
# Handle Python
PYTHON_available="false"
case $PYTHON3 in
*python*)
PYTHON_available="true"
INTERPRETER="python3" ;;
*)
case $PYTHON2 in
*python*)
PYTHON_available="true"
INTERPRETER="python" ;;
esac
;;
esac
# Handle Perl
case $PERL in
*perl*)
PERL_available="true"
;;
*)
PERL_available="false"
;;
esac
case `uname` in
# AIX priority is Perl
"AIX")
case $PERL_available in
"true")
INTERPRETER="perl" ;;
"false")
INTERPRETER="$INTERPRETER" ;;
esac
;;
# Other OS, priority is Python
*)
case $PYTHON_available in
"true")
INTERPRETER="$INTERPRETER" ;;
"false")
INTERPRETER="perl" ;;
esac
;;
esac
############################################
# functions
############################################
# create snap scripts for nmon_external
create_nmon_external () {
# fifo_started variable is exported by the function start_fifo_reader
case $fifo_started in
"fifo1")
cat ${APP}/bin/nmon_external_cmd/nmon_external_start.sh | sed "s|NMON_FIFO_PATH|$NMON_EXTERNAL_DIR|g" > "${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo1.sh"
chmod +x "${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo1.sh"
cat ${APP}/bin/nmon_external_cmd/nmon_external_snap.sh | sed "s|NMON_FIFO_PATH|$NMON_EXTERNAL_DIR|g" > "${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo1.sh"
chmod +x "${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo1.sh"
;;
"fifo2")
cat ${APP}/bin/nmon_external_cmd/nmon_external_start.sh | sed "s|NMON_FIFO_PATH|$NMON_EXTERNAL_DIR|g" > "${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo2.sh"
chmod +x "${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo2.sh"
cat ${APP}/bin/nmon_external_cmd/nmon_external_snap.sh | sed "s|NMON_FIFO_PATH|$NMON_EXTERNAL_DIR|g" > "${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo2.sh"
chmod +x "${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo2.sh"
;;
esac
}
# Verify that we don't spawn multiple instances of nmon external snap script
# this issue is unexpected and has been reported on some cases in AIX
# If this occurs, don't let processes multiplication happening
# Any process running more than 2 minutes will be killed
check_duplicated_external_snap () {
# get the list of occurrences
count="0"
count=`ps -ef | grep nmon_external_snap | grep -v grep | wc -l`
if [ $count -gt 0 ]; then
oldPidList=`ps -ef | grep nmon_external_snap | grep -v grep | awk '{print $2}'`
for pid in $oldPidList; do
pid_runtime=0
# only run the process is running
if [ -d /proc/${pid} ]; then
# get the process runtime in seconds
pid_runtime=`ps -p ${pid} -oetime= | tr '-' ':' | awk -F: '{ total=0; m=1; } { for (i=0; i < NF; i++) {total += $(NF-i)*m; m *= i >= 2 ? 24 : 60 }} {print total}'`
case ${pid_runtime} in
''|*[!0-9]*)
echo "`log_date`, ${HOST} WARN: run time identification of process with pid ${pid} failed, it has been probably terminated"
;;
*)
if [ ${pid_runtime} -gt 120 ]; then
echo "`log_date`, ${HOST} WARN: fifo nmon external snap script took long and will be killed (SIGTERM): `ps -p ${pid} -ouser,pid,command,etime,args | grep -v PID`"
kill $pid
# Allow some time for the process to end
sleep 1
# re-check the status
ps -p ${pid} -oetime= >/dev/null
if [ $? -eq 0 ]; then
echo "`log_date`, ${HOST} WARN, fifo nmon external snap due to `ps -eo user,pid,command,etime,args | grep $pid | grep -v grep` failed to stop, killing (-9) process $pid"
kill -9 $pid
fi
fi
;;
esac
fi
done
fi
}
# For AIX / Linux, the -p option when launching nmon will output the instance pid in stdout
start_nmon () {
#
# Set Nmon command line
#
# NOTE:
# Collecting NFS Statistics:
# --> Since Nmon App Version 1.5.0, NFS activation can be controlled by the nmon.conf file in default/local directories
# - Linux: Add the "-N" option if you want to extract NFS Statistics (NFS V2/V3/V4)
# - AIX: Add the "-N" option for NFS V2/V3, "-NN" for NFS V4
# For AIX, the default command options line "-f -T -A -d -K -L -M -P -O -W -S -^" includes: (see http://www-01.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.cmds4/nmon.htm)
# AIX options can be managed using local/nmon.conf, do not modify options here
# -A Includes the Asynchronous I/O section in the view.
# -d Includes the Disk Service Time section in the view.
# -K Includes the RAW Kernel section and the LPAR section in the recording file. The -K flag dumps the raw numbers
# of the corresponding data structure. The memory dump is readable and can be used when the command is recording the data.
# -L Includes the large page analysis section.
# -M Includes the MEMPAGES section in the recording file. The MEMPAGES section displays detailed memory statistics per page size.
# -O Includes the Shared Ethernet adapter (SEA) VIOS sections in the recording file.
# -W Includes the WLM sections into the recording file.
# -S Includes WLM sections with subclasses in the recording file.
# -P Includes the Paging Space section in the recording file.
# -T Includes the top processes in the output and saves the command-line arguments into the UARG section. You cannot specify the -t, -T, or -Y flags with each other.
# -^ Includes the Fiber Channel (FC) sections.
# -p print pid in stdout
# For Linux, the default command options line "-f -T -d 1500" includes:
# -t include top processes in the output
# -T as -t plus saves command line arguments in UARG section
# -d <disks> to increase the number of disks [default 256]
# -p print pid in stdout
case $UNAME in
AIX )
# -p option is mandatory to get the pid of the launched instances, ensure it has been set
echo ${AIX_options} | grep '\-p' >/dev/null
if [ $? -ne 0 ]; then
AIX_options="${AIX_options} -p"
fi
# Since release 1.3.0, we use fifo files, -f option is prohibited
echo ${AIX_options} | grep '\-f' >/dev/null
if [ $? -eq 0 ]; then
AIX_options=`echo ${AIX_options} | sed 's/\-f //g'`
fi
# option -y is compatible and mandatory, ensure it has been set
echo ${AIX_options} | grep 'yoverwrite' >/dev/null
if [ $? -ne 0 ]; then
echo "`log_date`, ${HOST}, WARN, the -yoverwrite=1 option was not used while loading local settings (please review nmon.conf), option is mandatory and will be forced"
AIX_options="${AIX_options} -yoverwrite=1"
fi
# Manage NFS
if [ ${AIX_NFS23} -eq 1 ]; then
nmon_command="-N -s ${fifo_interval} -c ${fifo_snapshot}"
elif [ ${AIX_NFS4} -eq 1 ]; then
nmon_command="-NN -s ${fifo_interval} -c ${fifo_snapshot}"
else
nmon_command="-s ${fifo_interval} -c ${fifo_snapshot}"
fi
# Set the nmon command for AIX
nmon_command_fifo1="${NMON} -F ${FIFO1} ${AIX_options} ${nmon_command}"
nmon_command_fifo2="${NMON} -F ${FIFO2} ${AIX_options} ${nmon_command}"
;;
SunOS )
nmon_command="${NMON} ${fifo_interval} ${fifo_snapshot}"
;;
Linux )
# Since 1.2.47, Linux_unlimited_capture feature has changed
# For historical reason, and in case the old activation value (1) has been set in local/nmon.conf, manage it.
case ${Linux_unlimited_capture} in
"1")
Linux_unlimited_capture="-1" ;;
esac
# Set the default Linux minimal args list
Linux_nmon_args="-T -s ${fifo_interval} -c ${fifo_snapshot} -d ${Linux_devices}"
case ${Linux_NFS} in
"1" )
Linux_nmon_args="$Linux_nmon_args -N" ;;
esac
case ${Linux_unlimited_capture} in
"0" )
Linux_nmon_args="$Linux_nmon_args" ;;
"-1" )
Linux_nmon_args="$Linux_nmon_args -I ${Linux_unlimited_capture}" ;;
* )
if [ `echo "${Linux_unlimited_capture}" | grep -E "^[0-9]+(\.[0-9]+)?$"` ]; then
Linux_nmon_args="$Linux_nmon_args -I ${Linux_unlimited_capture}"
else
echo "`log_date`, ${HOST} ERROR, invalid value for Linux_unlimited_capture (${Linux_unlimited_capture} is not an integer or a floating number)"
remove_mutex
exit 2
fi
;;
esac
case ${Linux_disk_dg_enable} in
"1" )
Linux_nmon_args="$Linux_nmon_args -g auto -D" ;;
esac
# Set command lines
nmon_command_fifo1="${NMON} -F ${FIFO1} $Linux_nmon_args -p"
nmon_command_fifo2="${NMON} -F ${FIFO2} $Linux_nmon_args -p"
;;
esac
#
# Starting Nmon
#
case $UNAME in
AIX )
# on AIX, prevent error messages linked to /usr/opt/freeware/bin/rpm
unset LIBPATH
# global nmon_external
NMON_EXTERNAL_DIR="${APP_VAR}/var/nmon_repository/${fifo_started}"
export NMON_EXTERNAL_DIR
NMON_EXTERNAL_FIFO="${APP_VAR}/var/nmon_repository/${fifo_started}/nmon.fifo"
export NMON_EXTERNAL_FIFO
TIMESTAMP=0
export TIMESTAMP
NMON_ONE_IN=1
export NMON_ONE_IN
unset NMON_END
# fifo_started variable is exported by the function start_fifo_reader
case $fifo_started in
"fifo1")
case $nmon_external_generation in
1)
# nmon_external
create_nmon_external
NMON_START="${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo1.sh"
export NMON_START
NMON_SNAP="${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo1.sh"
export NMON_SNAP
;;
esac
echo "`log_date`, ${HOST} INFO: starting nmon : ${nmon_command_fifo1} in ${NMON_EXTERNAL_DIR}"
${nmon_command_fifo1} 2>&1 > ${APP_VAR}/nmon_output.txt
if [ $? -ne 0 ]; then
echo "`log_date`, ${HOST} ERROR, nmon binary returned a non 0 code while trying to start, please verify error traces in splunkd log"
fi
# old topas-nmon version might not be compatible with the -y option, let's manage this
cat ${APP_VAR}/nmon_output.txt | grep -i 'invalid option[^y]*y' >/dev/null
if [ $? -eq 0 ]; then
# option -y is not compatible and not mandatory
echo "`log_date`, ${HOST}, ERROR, This system is running a topas-nmon version that does not support the -y option, you might need to consider an AIX upgrade: `cat ${APP_VAR}/nmon_output.txt`"
nmon_command_fifo1=`echo ${nmon_command_fifo1} | sed 's/\-yoverwrite=1//g'`
${nmon_command_fifo1} 2>&1 > ${APP_VAR}/nmon_output.txt
fi
# Store the PID file (very last line of nmon output)
if [ -f ${APP_VAR}/nmon_output.txt ]; then
tail -1 ${APP_VAR}/nmon_output.txt > ${PIDFILE}
fi
;;
"fifo2")
case $nmon_external_generation in
1)
# nmon_external
create_nmon_external
NMON_START="${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo2.sh"
export NMON_START
NMON_SNAP="${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo2.sh"
export NMON_SNAP
;;
esac
echo "`log_date`, ${HOST} INFO: starting nmon : ${nmon_command_fifo2} in ${NMON_EXTERNAL_DIR}"
${nmon_command_fifo2} 2>&1 > ${APP_VAR}/nmon_output.txt
if [ $? -ne 0 ]; then
echo "`log_date`, ${HOST} ERROR, nmon binary returned a non 0 code while trying to start, please verify error traces in splunkd log"
fi
# old topas-nmon version might not be compatible with the -y option, let's manage this
cat ${APP_VAR}/nmon_output.txt | grep -i 'invalid option[^y]*y' >/dev/null
if [ $? -eq 0 ]; then
# option -y is not compatible and not mandatory
echo "`log_date`, ${HOST}, ERROR, This system is running a topas-nmon version that does not support the -y option, you might need to consider an AIX upgrade: `cat ${APP_VAR}/nmon_output.txt`"
nmon_command_fifo2=`echo ${nmon_command_fifo2} | sed 's/\-yoverwrite=1//g'`
${nmon_command_fifo2} 2>&1 > ${APP_VAR}/nmon_output.txt
fi
# Store the PID file (very last line of nmon output)
if [ -f ${APP_VAR}/nmon_output.txt ]; then
tail -1 ${APP_VAR}/nmon_output.txt > ${PIDFILE}
fi
;;
esac
;;
Linux )
# global nmon_external
NMON_EXTERNAL_DIR="${APP_VAR}/var/nmon_repository/${fifo_started}"
export NMON_EXTERNAL_DIR
NMON_EXTERNAL_FIFO="${APP_VAR}/var/nmon_repository/${fifo_started}/nmon.fifo"
export NMON_EXTERNAL_FIFO
TIMESTAMP=0
export TIMESTAMP
NMON_ONE_IN=1
export NMON_ONE_IN
unset NMON_END
# fifo_started variable is exported by the function start_fifo_reader
case $fifo_started in
"fifo1")
case $nmon_external_generation in
1)
# nmon_external
create_nmon_external
NMON_START="${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo1.sh"
export NMON_START
NMON_SNAP="${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo1.sh"
export NMON_SNAP
;;
esac
nmon_command=${nmon_command_fifo1} ;;
"fifo2")
case $nmon_external_generation in
1)
# nmon_external
create_nmon_external
NMON_START="${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo2.sh"
export NMON_START
NMON_SNAP="${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo2.sh"
export NMON_SNAP
;;
esac
nmon_command=${nmon_command_fifo2} ;;
esac
# Retrieve the nmon Linux version
# Nmon 16x or superior is required to run disk group statistics
NMON_VERSION=`$NMON -h | sed -n 's/.*[v|V]ersion[^0-9]*\([0-9][0-9]*\).*$/\1/p' | head -1`
# Assume we can fail
case $NMON_VERSION in
"")
# Set a default to 14 in case of identification failure
NMON_VERSION="14" ;;
esac
if [ $NMON_VERSION -ge "16" ]; then
# Activation of Linux disks extended stats generate a message in stdout
# We don't want this as we need to retrieve the pid from nmon output
# However, we also want to analyse the return code, so we can't filter out in only one operation
# Manage exceptions - these systems will not generate data but will start and output an lsblk message
linux_distrib="${linux_vendor}_${linux_mainversion}"
disk_group_option_exception="sles_11"
echo $disk_group_option_exception | grep -o $linux_distrib >/dev/null
if [ $? -eq 0 ]; then
nmon_command=`echo ${nmon_command} | sed "s/-g ${Linux_disk_dg_group} -D//g"`
fi
echo "`log_date`, ${HOST} INFO: starting nmon : ${nmon_command} in ${NMON_EXTERNAL_DIR}"
${nmon_command} > ${APP_VAR}/nmon_output.txt
#echo "${nmon_command}" >> /tmp/nmon_command.txt
if [ $? -ne 0 ]; then
echo "`log_date`, ${HOST} ERROR, nmon binary returned a non 0 code while trying to start, please verify error traces in splunkd log (missing shared libraries?)"
fi
# Store the PID file (very last line of nmon output)
if [ -f ${APP_VAR}/nmon_output.txt ]; then
awk 'END{print}' ${APP_VAR}/nmon_output.txt > ${PIDFILE}
fi
# old nmon versions might not be compatible with disks extended stats, or the group file does not exist
# In such a case, echo a WARN, remove the option and last chance start
if grep 'opening disk group file' ${APP_VAR}/nmon_output.txt >/dev/null; then
echo "`log_date`, ${HOST} WARN, nmon disks extended statistics cannot be collected, either this nmon version is not compatible or the disk group file does not exist, see ${APP_VAR}/nmon_output.txt"
nmon_command=`echo ${nmon_command} | sed "s/-g ${Linux_disk_dg_group} -D//g"`
echo "`log_date`, ${HOST} INFO: starting nmon : ${nmon_command} in ${NMON_EXTERNAL_DIR}"
${nmon_command} &> ${PIDFILE}
if [ $? -ne 0 ]; then
echo "`log_date`, ${HOST} ERROR, nmon binary returned a non 0 code while trying to start, please verify error traces in splunkd log (missing shared libraries?)"
fi
fi
else
# This version is not compatible with the auto group disk
nmon_command=`echo ${nmon_command} | sed "s/-g ${Linux_disk_dg_group} -D//g"`
echo "`log_date`, ${HOST} INFO: starting nmon : ${nmon_command} in ${NMON_EXTERNAL_DIR}"
${nmon_command} > ${PIDFILE}
if [ $? -ne 0 ]; then
echo "`log_date`, ${HOST} ERROR, nmon binary returned a non 0 code while trying to start, please verify error traces in splunkd log (missing shared libraries?)"
fi
fi
;;
SunOS )
# global nmon_external
NMON_EXTERNAL_DIR="${APP_VAR}/var/nmon_repository/${fifo_started}"
export NMON_EXTERNAL_DIR
NMON_EXTERNAL_FIFO="${APP_VAR}/var/nmon_repository/${fifo_started}/nmon.fifo"
export NMON_EXTERNAL_FIFO
TIMESTAMP=0
export TIMESTAMP
NMON_ONE_IN=1
export NMON_ONE_IN
unset NMON_END
# fifo_started variable is exported by the function start_fifo_reader
case $fifo_started in
"fifo1")
case $nmon_external_generation in
1)
# nmon_external
create_nmon_external
NMON_START="${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo1.sh"
export NMON_START
NMON_SNAP="${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo1.sh"
export NMON_SNAP
;;
esac
NMONOUTPUTFILE="${APP_VAR}/var/nmon_repository/${fifo_started}/nmon.fifo"
export NMONOUTPUTFILE
;;
"fifo2")
case $nmon_external_generation in
1)
# nmon_external
create_nmon_external
NMON_START="${APP_VAR}/bin/nmon_external_cmd/nmon_external_start_fifo2.sh"
export NMON_START
NMON_SNAP="${APP_VAR}/bin/nmon_external_cmd/nmon_external_snap_fifo2.sh"
export NMON_SNAP
;;
esac
NMONOUTPUTFILE="${APP_VAR}/var/nmon_repository/${fifo_started}/nmon.fifo"
export NMONOUTPUTFILE
;;
esac
NMONNOSAFILE=1 # Do not generate useless sa files
export NMONNOSAFILE
# Manage UARG activation, default is on (1)
NMONUARG_VALUE=${Solaris_UARG}
if [ ! -z ${NMONUARG_VALUE} ]; then
if [ ${NMONUARG_VALUE} -eq 1 ]; then
NMONUARG=1
export NMONUARG
fi
fi
# Manage VxVM volume statistics activation, default is off (0)
NMONVXVM_VALUE=${Solaris_VxVM}
if [ ! -z ${NMONVXVM_VALUE} ]; then
if [ ${NMONVXVM_VALUE} -eq 1 ]; then
NMONVXVM=1
export NMONVXVM
fi
fi
echo "`log_date`, ${HOST} INFO: starting nmon : ${nmon_command} in ${NMON_REPOSITORY}"
${nmon_command} >/dev/null 2>&1 &
;;
esac
}
verify_pid() {
givenpid=$1
# Verify proc fs before checking PID
if [ -d /proc/${givenpid} ]; then
case $UNAME in
AIX )
ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | grep $givenpid ;;
Linux )
ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | grep $givenpid ;;
SunOS )
/usr/bin/pwdx $givenpid ;;
esac
else
# Just return nothing
echo ""
fi
}
# Search for running process and write PID file
write_pid() {
# Only SunOS will look for running processes to identify nmon instances
# AIX and Linux will save the pid at launch time
case $UNAME in
SunOS)
# In main priority, use pgrep (no truncation trouble), pgrep should always be available
# whether running on Solaris 10 or 11
if [ -x /usr/bin/pgrep ]; then
PIDs=`pgrep -f ${NMON}`
# Second priority, use BSD ps command with the appropriated syntax (mainly for Solaris 10)
elif [ -x /usr/ucb/ps ]; then
PIDs=`/usr/ucb/ps auxww | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`
# Last, use the ps command with BSD style syntax (no -) for Solaris 11 and later
# Solaris 10 cannot use BSD syntax with native ps, hopefully previous options should have been found !
else
if grep 'Solaris 10' /etc/release >/dev/null; then
PIDs=`/usr/ucb/ps -ef | grep sarmon | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`
else
PIDs=`/usr/ucb/ps auxww | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`
fi
fi
for p in ${PIDs}; do
verify_pid $p | grep -v grep | grep ${APP_VAR} >/dev/null
if [ $? -eq 0 ]; then
echo ${PIDs} > ${PIDFILE}
fi
done
;;
esac
}
# Just Search for running process
search_nmon_instances() {
case $UNAME in
Linux)
PIDs=`ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`
;;
SunOS)
PIDs=`ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`
for p in ${PIDs}; do
verify_pid $p | grep -v grep | grep ${APP_VAR} >/dev/null
done
;;
AIX)
case ${AIX_topas_nmon} in
true )
PIDs=`ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | grep ${NMON_REPOSITORY} | awk '{print $2}'`
;;
false)
PIDs=`ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`
;;
esac
;;
esac
}
start_fifo_reader () {
# Check fifo readers, start if either fifo1 or fifo2 is free
fifo_started="none"
# be portable
running_fifo=`ps auxww | awk '/metricator_reader.py --fifo fifo1/ || /metricator_reader.py --fifo fifo2/ || /metricator_reader.pl --fifo fifo1/ || /metricator_reader.pl --fifo fifo2/' | grep -v awk`
# Initiate
fifo1_running=0
fifo2_running=0
# check fifo1
echo $running_fifo | grep 'fifo1' >/dev/null
if [ $? -eq 0 ]; then
fifo1_running=1
fi
# check fifo2
echo $running_fifo | grep 'fifo2' >/dev/null
if [ $? -eq 0 ]; then
fifo2_running=1
fi
# Start
if [ $fifo1_running -eq 0 ]; then
echo "`log_date`, ${HOST} INFO: starting the fifo_reader fifo1"
case $INTERPRETER in
"perl")
nohup $APP/bin/metricator_reader.pl --fifo fifo1 </dev/null >/dev/null 2>&1 & ;;
"python"|"python3")
nohup $INTERPRETER $APP/bin/metricator_reader.py --fifo fifo1 </dev/null >/dev/null 2>&1 &
pids+=("$!")
#echo "metricator_reader - $!" >> /tmp/pids.txt
;;
esac
echo $! > ${APP_VAR}/var/fifo_reader_fifo1.pid
fifo_started="fifo1"
export fifo_started
elif [ $fifo1_running -eq 1 ] && [ $fifo2_running -ne 1 ]; then
echo "`log_date`, ${HOST} INFO: starting the fifo_reader fifo2"
case $INTERPRETER in
"perl")
nohup $APP/bin/metricator_reader.pl --fifo fifo2 </dev/null >/dev/null 2>&1 & ;;
"python"|"python3")
nohup $INTERPRETER $APP/bin/metricator_reader.py --fifo fifo2 </dev/null >/dev/null 2>&1 &
pids+=("$!")
#echo "metricator_reader - $!" >> /tmp/pids.txt
;;
esac
echo $! > ${APP_VAR}/var/fifo_reader_fifo2.pid
fifo_started="fifo2"
export fifo_started
elif [ $fifo1_running -eq 1 ]; then
echo "`log_date`, ${HOST} INFO: The fifo_reader fifo1 is running"
elif [ $fifo2_running -eq 1 ]; then
echo "`log_date`, ${HOST} INFO: The fifo_reader fifo2 is running"
fi
}
####################################################################
############# Main Program ############
####################################################################
# Initialize PID variable
PIDs=""
# Initialize nmon status
nmon_isstarted=0
# Check nmon binary exists and is executable
if [ ! -x ${NMON} ]; then
echo "`log_date`, ${HOST} ERROR, could not find Nmon binary (${NMON}) or execution is unauthorized"
remove_mutex
exit 2
fi
# cd to root dir
cd ${NMON_REPOSITORY}
# Check PID file, if no PID file is found, start nmon
if [ ! -f ${PIDFILE} ]; then
# search for any App related instances
search_nmon_instances
case ${PIDs} in
"")
start_fifo_reader
sleep 1
start_nmon
sleep 5 # Let nmon time to start
write_pid
remove_mutex
exit 0
;;
*)
echo "`log_date`, ${HOST} INFO: found Nmon running with PID ${PIDs}"
# Retry to write pid file
write_pid
remove_mutex
exit 0
;;
esac
else
# PID file found
SAVED_PID=`cat ${PIDFILE} | awk '{print $1}'`
if [ ${endtime_margin} -gt 0 ]; then
# Initialize PIDAGE to 01 Jan 2000 00:00:00 GMT for later failure verification
EPOCHTEST="946684800"
PIDAGE=$EPOCHTEST
case ${INTERPRETER} in
"perl")
# Use Perl to get PID file age in seconds
perl -e "\$mtime=(stat(\"$PIDFILE\"))[9]; \$cur_time=time(); print \$cur_time - \$mtime;" > ${APP_VAR}/metricator_helper.sh.tmp.$$
;;
"python"|"python3")
# Use Python to get PID file age in seconds
$INTERPRETER -c "import os; import time; now = time.strftime(\"%s\"); print(int(int(now)-(os.path.getmtime('$PIDFILE'))))" > ${APP_VAR}/metricator_helper.sh.tmp.$$
;;
esac
PIDAGE=`cat ${APP_VAR}/metricator_helper.sh.tmp.$$`
rm ${APP_VAR}/metricator_helper.sh.tmp.$$
case $PIDAGE in
"")
echo "`log_date`, ${HOST} WARN: failed to eval the age of the current pid file, gaps may occur between nmon processes run."
PIDAGE=0
;;
esac
# Estimate the end time of current Nmon binary less 4 minutes (enough time for new nmon process to start collecting)
# Use expr for portability with sh
endtime=`expr ${fifo_interval} \* ${fifo_snapshot}`
endtime=`expr ${endtime} - ${endtime_margin}`
fi
case ${SAVED_PID} in
# PID file is empty
"")
echo "`log_date`, ${HOST} INFO: Removing stale pid file (empty file)"
rm -f ${PIDFILE}
# search for any App related instances
search_nmon_instances
case ${PIDs} in
"")
start_fifo_reader
sleep 1
start_nmon
sleep 5 # Let nmon time to start
# Relevant for Solaris Only
write_pid
remove_mutex
exit 0
;;
*)
echo "`log_date`, ${HOST} INFO: found Nmon running with PID ${PIDs}"
# Relevant for Solaris Only
write_pid
remove_mutex
exit 0
;;
esac
;;
# PID file is not empty
*)
case $UNAME in
Linux)
if [ -d /proc/${SAVED_PID} ]; then
istarted="true"
else
istarted="false"
fi
;;
SunOS)
verify_pid ${SAVED_PID} | grep -v grep | grep ${NMON_REPOSITORY} >/dev/null
if [ $? -eq 0 ]; then
istarted="true"
else
istarted="false"
fi
;;
AIX)
if [ -d /proc/${SAVED_PID} ]; then
istarted="true"
else
istarted="false"
fi
;;
esac
case $istarted in
"true")
# Check if outdated state file exists, if so the current instance needs to be terminated and next script execution will spawn a new process
# there is an acceptable risk of 1 minute gaps in metrics
if [ -f $OUTDATED_NETIF_NMON_STATE ]; then
echo "`log_date`, ${HOST} WARN: the current nmon process has an outdated definition of network interfaces. The current process with PID ${SAVED_PID} will be terminated and a new process will be spawned at next occurrence."
kill ${SAVED_PID}
rm -f $OUTDATED_NETIF_NMON_STATE
remove_mutex
exit 0
fi
if [ ${endtime_margin} -gt 0 ]; then
# If the current age of the Nmon process requires starting a new one to prevent data gaps between collections
# Note that the pidfile will be overwritten, for a few minutes 2 Nmon binaries are running in the same time
# Data duplication will be managed by nmonparser files
# Prevent any failure in determining nmon process age
if [ $PIDAGE -gt $EPOCHTEST ]; then
echo "`log_date`, ${HOST} ERROR: Failed to determine age in seconds of current Nmon process, gaps may occur between Nmon collections"
else
case $PIDAGE in
"")
echo "`log_date`, ${HOST} ERROR: Failed to determine age in seconds of current Nmon process, gaps may occur between Nmon collections"
;;
*)
if [ $PIDAGE -gt $endtime ]; then
echo "`log_date`, ${HOST} INFO: To prevent data gaps between 2 Nmon collections, a new process will be started, its PID will be available on next execution"
start_fifo_reader
sleep 1
start_nmon
sleep 5 # Let nmon time to start
# Relevant for Solaris Only
write_pid
fi
;;
esac
fi
# Process found
echo "`log_date`, ${HOST} INFO: Nmon process is $PIDAGE sec old, a new process will be spawned when this value will be greater than estimated end in seconds ($endtime sec based on parameters)"
fi
# Prevent infinite spawn of nmon external snap processes (in case of unexpected issue)
check_duplicated_external_snap
echo "`log_date`, ${HOST} INFO: found Nmon running with PID ${SAVED_PID}"
remove_mutex
exit 0
;;
"false")
# Process not found, Nmon has terminated or is not yet started
echo "`log_date`, ${HOST} INFO: Removing stale pid file (process not found)"
rm -f ${PIDFILE}
start_fifo_reader
sleep 1
start_nmon
sleep 5 # Let nmon time to start
# Relevant for Solaris Only
write_pid
remove_mutex
# we could just "wait" but if the nmon process exits it will no restart
# if nmon fails/terminates we should exit allowing the restart occur
# but we cannot use wait against a pid that is owned by parent id 1
while true; do
nmon_pid=`ps -ef | grep ${NMON} | grep -v grep | grep -v metricator_helper.sh | awk '{print $2}'`;
# if we sleep too long we miss the signal and don't terminate fast enough...
sleep 3;
if [ -z $nmon_pid ]; then break; fi
done
#wait
exit 0
;;
esac
;;
esac
fi
remove_mutex
exit 0
####################################################################
############# End of Main Program ############
####################################################################
Version-specific findings
While this modification can fix the issue on Oracle Linux version 8, be aware that the same unmodified TA might not have any restart related issues on Oracle Linux version 7 or Oracle Linux version 9. You might notice similar problems with the Splunk Add-on for Unix and Linux.
If you review the journal logs on Oracle Linux 7 servers, you might find no indication of any processes being killed after Splunkd exits. However, the same TA can continue running and spawning processes that aren't children of Splunkd. This suggests that the older systemd version might be performing the kill silently without recording it.
On an Oracle Linux 9 server with systemd version 252, you might see the kill logged and the service moving on immediately (never lingering in the deactivating (final-sigkill) state):
splunkd.log:
01–14–2026 14:10:07.665 +1100 INFO loader [4172766 MainThread] - All pipelines finished.
journal logs:
Jan 14 14:08:45.527901 prod-001 splunk[4172766]: 2026–01–14 14:08:45.527 +1100 splunkd started (build 75595d8f83ef) pid=4172766 Jan 14 14:09:57.957844 prod-001 systemd[1]: Stopping Systemd service file for Splunk, generated by 'splunk enable boot-start'… Jan 14 14:09:57.958089 prod-001 splunk[4172766]: 2026–01–14 14:09:57.957 +1100 Interrupt signal received sent by PID 1, which is my parent, command="/usr/lib/systemd/systemd rhgb - switched-root - system - deseria…" (UID 0) Jan 14 14:10:07.686943 prod-001 systemd[1]: Splunkd.service: Killing process 4175265 (python3) with signal SIGKILL. Jan 14 14:10:07.686977 prod-001 systemd[1]: Splunkd.service: Killing process 4175267 (metricator_read) with signal SIGKILL. Jan 14 14:10:07.686995 prod-001 systemd[1]: Splunkd.service: Killing process 4175295 (nmon_x86_64_ol9) with signal SIGKILL. Jan 14 14:10:07.703748 prod-001 systemd[1]: Splunkd.service: Deactivated successfully. Jan 14 14:10:07.704050 prod-001 systemd[1]: Stopped Systemd service file for Splunk, generated by 'splunk enable boot-start'. Jan 14 14:10:07.704202 prod-001 systemd[1]: Splunkd.service: Consumed 1min 6.535s CPU time, 1.9G memory peak.
In this example, the service goes from killing processes to Splunkd.service: Deactivated successfully in fractions of a second. It even includes a Consumed line.
Conclusion
This issue appears to be specific to particular systemd versions - in this case, the versions shipped with Oracle Linux version 8.
Your systems are only affected by this problematic systemd version if you have TAs or scripts that spawn child processes and do not handle killing them on termination. Using the Splunk Add-on for Unix and Linux as an example, an iostat.sh script can spawn an iostat command. If that command is running on shutdown of the Splunk platform and the iostat.sh script doesn't trap the required SIGTERM, it can trigger the issue discussed in this article.

