Obtaining stacks from a Kubernetes instance
When collaborating with Splunk support about a performance issue - whether it’s related to CPU usage, memory, or general system behavior - they might ask you to supply them with thread call stacks.
Collecting pstacks is relatively simple in a non-Kubernetes environment by installing the necessary packages to enable the pstack command. But doing this within a container is more challenging, especially since Splunk containers operate as a non-root user, restricting the use of the package manager.
This article outlines a method you can use to create a container image capable of running the eu-stack command, leveraging a script provided by Splunk Support.
As an alternative method, you can access a running pod as root by combining lsns and nsenter on the worker node. However, any packages installed this way will be lost upon container restart.
How to use Splunk software for this use case
Collect stacks script
This script was supplied by Splunk support. Save this file as collect_stacks.sh:
#!/usr/bin/env bash
SPLUNK_HOME=${SPLUNK_HOME:-/opt/splunk}
set -e
set -u
#set -x
platform="$(uname)"
if [[ "$platform" != 'Linux' ]]; then
echo "ERROR: This script is only tested on Linux, \`uname\` says this is '$platform'" >&2
exit 1
fi
function usage()
{
echo "Usage: $0 [OPTION]"
echo "Collect stack dumps for splunk support. A good number of samples is in the hundreds, preferably 1000."
echo
echo " -b, --batch Non-interactive mode, doesn't ask questions."
echo " -c, --continuous Collect data continuously, keeping only the latest <samples>"
echo " dump files."
echo " -d, --docker=CONTAINER_ID Collect data from inside docker container CONTAINER_ID."
echo " Remember you must use the PID you see inside the container."
echo " -f, --freeze Freeze process during data collection to obtain consistent snapshots."
echo " This is very disruptive, avoid if possible."
echo " -h, --help Print this message."
echo " -i, --interval=INTERVAL Interval between samples, in seconds. Default is 0.5."
echo " -o, --outdir=PATH Output directory. Default is '/tmp/splunk'"
echo " -p, --pid=PID PID of process. Default is to use main splunk process if"
echo " SPLUNK_HOME is set or splunk is under '/opt/splunk'"
echo " -q, --quiet Silent mode, output no messages after parameter"
echo " confirmation (or none at all in batch mode)."
echo " -r, --rest Use Splunk's own '/services/server/pstacks' endpoint instead of an"
echo " external tool like eu-stack for data collection. This will require"
echo " entering valid username/password credentials for a user with the"
echo " 'request_pstacks' capability."
echo " -s, --samples=COUNT Number of samples. Aim for more than 100. Default 1000."
}
#
# Handing command-line arguments
#
batch=0
continuous=0
container=''
freeze=0
interval=0.5
outdir='/tmp/splunk'
pid=''
rest=0
quiet=0
samples=1000
while [ "$#" != "0" ] ; do
case "$1" in
-b|--batch) batch=1;;
-c|--continuous) continuous=1;;
-d|--docker)
shift
container="$1"
;;
-f|--freeze) freeze=1;;
-h|-\?|--help) usage; exit;;
--interval=*) interval=${1#*=};;
-i|--interval)
shift
if [ "$#" != "0" ]; then interval="$1"; else interval=''; fi
;;
--outdir=*) outdir=${1#*=};;
-o|--outdir)
shift
if [ "$#" != "0" ]; then outdir="$1"; else outdir=''; fi
;;
--pid=*) pid=${1#*=};;
-p|--pid)
shift
if [ "$#" != "0" ]; then pid="$1"; else pid=''; fi
;;
-q|--quiet) quiet=1;;
-r|--rest) rest=1;;
--samples=*) samples=${1#*=};;
-s|--samples)
shift
if [ "$#" != "0" ]; then samples="$1"; else samples=''; fi
;;
*)
echo "ERROR: invalid option '$1'" >&2
exit 1
;;
esac
if [ "$#" != "0" ]; then shift; fi
done
if [ "$rest" == "1" ]; then
if [ ! -x "$SPLUNK_HOME/bin/splunk" ]; then
echo "ERROR: can't execute $SPLUNK_HOME/bin/splunk" >&2
exit 1
fi
echo "Please provide credentials with the request_pstacks capability:"
if ! "$SPLUNK_HOME/bin/splunk" login; then
echo "ERROR: can't proceed without valid credentials, aborting" >&2
exit 1
fi
fi
nsenter_prefix=""
if ! [ -z "$container" ]; then
set +e
err="$(docker inspect --format {{.State.Pid}} "$container" 2>&1)"
if [ "$?" != "0" ]; then
echo "ERROR: invalid value for --docker option, are you sure your container ID is running? '$container': $err" >&2
exit 1
fi
nsenter_prefix="nsenter --target $err --mount --pid"
fi
if ! [[ "$interval" =~ ^[0-9]*(|\.[0-9]*)$ && "$interval" =~ [1-9] ]]; then
echo "ERROR: invalid value for --interval option, '$interval'" >&2
exit 1
fi
function run() {
if [ -z "$nsenter_prefix" ]; then
eval "$@"
else
$nsenter_prefix bash -c "$*"
fi
}
set +e
err="$(mkdir -p "$outdir" 2>&1)"
if [ "$?" != "0" ]; then
echo "ERROR: invalid value for --outdir option, '$outdir': $err" >&2
exit 1
fi
set -e
if [ -z "$pid" ]; then
set +e
pid="$(run head -n1 "$SPLUNK_HOME/var/run/splunk/splunkd.pid" 2>/dev/null)"
set -e
fi
if [ -z "$pid" ]; then
echo "ERROR: pid not specified and could not infer main splunkd server process id from SPLUNK_HOME='$SPLUNK_HOME'." >&2
exit 1
fi
if [[ ! "$pid" =~ ^[1-9][0-9]*$ ]]; then
echo "ERROR: pid must be a positive integer, not '$pid'." >&2
exit 1
fi
set +e
err="$(run kill -0 $pid 2>&1)"
if [ "$?" != "0" ]; then
echo "ERROR: not able to get data about PID $pid; wrong pid or missing sudo? Attempt to read returned '${err#*- }.'"
exit 1
fi
[[ ! "$(run readlink -f /proc/$pid/exe)" =~ splunkd$ ]]
readonly isSplunkd=$?
set -e
if [[ ! $samples =~ ^[1-9][0-9]*$ ]]; then
echo "ERROR: number of samples must be a positive integer, not '$samples'." >&2
exit 1
fi
#
# Now check what we'll use for stack collection
#
set +e
cmd=(eu-stack -lip PID)
if [ "$rest" == "1" ]; then
cmd=("$SPLUNK_HOME/bin/splunk" _internal call /services/server/pstacks -get:output_mode json)
elif [ "${FORCE_PSTACK:-0}" == "1" ]; then
cmd=(pstack PID)
elif [ "${FORCE_GDB:-0}" == "1" ]; then
cmd=(gdb -batch -n -ex "'thread apply all bt'" -p PID)
echo "*******************"
echo -e "WARNING: Use of GDB is being enforced because FORCE_GDB=$FORCE_GDB -- this is not recommended, please avoid if at all possible."
echo "*******************"
fi
if ! run type ${cmd[0]} > /dev/null; then
extra_help=''
if [[ ${cmd[0]} == "eu-stack" ]]; then
if [ -z "$nsenter_prefix" ]; then
extra_help=" Please install the 'elfutils' package in your system."
else
extra_help=" Please install the 'elfutils' package by running:\n docker exec $container bash -c 'sudo apt update && sudo apt install -y elfutils'."
fi
fi
if [ -z "$nsenter_prefix" ]; then
echo "ERROR: ${cmd[0]} is unavailable!$extra_help" >&2
else
echo -e "ERROR: ${cmd[0]} is unavailable in container with id=$container!$extra_help" >&2
fi
exit 7
fi
set -e
if [ "$batch" == "0" ]; then
echo "Parameters:"
echo " SPLUNK_HOME='$SPLUNK_HOME'"
echo " --batch=$batch"
echo " --continuous=$continuous"
echo " --docker=$container"
echo " --interval=$interval"
echo " --outdir='$outdir'"
echo " --pid=$pid"
echo " --samples=$samples"
echo
if [[ $samples -lt 100 ]]; then
read -p "Number of samples should really be at least 100 -- are you sure you want to continue? (y/n) " choice
else
read -p "Do you wish to continue? (y/n) " choice
fi
case "$choice" in
y|Y ) echo;;
* ) exit 0;;
esac
fi
function printout() {
if [ $quiet -eq 0 ]; then
echo "$@"
fi
}
function printstatus() {
if [ $quiet -eq 0 ] || [ "$batch" == "0" ]; then
printf "Completion status: % ${#2}d/$2\r" $1
fi
}
function printerr() {
if [ $quiet -eq 0 ]; then
echo "$@" >&2
fi
}
function timestamp() { date +'%Y-%m-%dT%Hh%Mm%Ss%Nns%z'; }
function collect_proc() {
set +e
run 'for d in /proc/'$pid' /proc/'$pid'/task/*; do
lwp=$(basename "$d");
echo "Thread LWP $lwp";
cat "$d/'$1'";
done'
set -e
}
# If the user aborts the script midway through data collection, we still want
# zip up the results.
subprocesses=''
function wait_subprocesses_revive_pid() {
# `wait` won't work because we use setsid for stack collection, so we
# improvise in a very ugly way
for proc in $subprocesses; do
run "while [ -e /proc/$proc ]; do sleep 0.1; done"
done
if run [ -e /proc/$pid ] && run grep -q "[[:space:]]*State:[[:space:]]*[Tt]" "/proc/$pid/status" >/dev/null 2>&1; then
kill -CONT $pid
fi
}
function archive_on_abort() {
trap '' SIGINT
echo "** Trapped CTRL-C. Archiving partial results. Please wait. **"
wait_subprocesses_revive_pid
archive
}
trap archive_on_abort INT
# zip up the results
function archive() {
local outroot="$(dirname "$outdir")"
local outleaf="$(basename "$outdir")"
local archive="$outdir.tar.xz"
tar --remove-files -C "$outroot" -cJf "$archive" "$outleaf"
printout "Stacks saved in $archive"
if [ "$rest" == "1" ]; then
"$SPLUNK_HOME/bin/splunk" logout
fi
}
outdir="$outdir/stacks-${pid}-${HOSTNAME}-$(timestamp)"
mkdir -p "$outdir"
collect_proc "maps" >"$outdir/proc-maps.out" 2>"$outdir/proc-maps.err"
declare -a suffixes
for ((i=0; $i < $samples; i = $continuous ? ($i+1)%$samples : $i+1)); do
if run [ ! -e /proc/$pid ]; then
printerr $'\n'"Process with pid=$pid no longer available, terminating stack dump collection."
break;
fi
printstatus $i $samples
suffix="$(timestamp)"
if [ "${suffixes[$i]+isset}" ]; then
rm -f "$outdir"/*"${suffixes[$i]}".{out,err}
fi
suffixes[$i]=$suffix
# collect application stack
stackdump_fname="$outdir/stack-$suffix"
if [ "$rest" != "1" ]; then
# TODO: use quiet mode (-q) and use `eu-nm` to collect symbol location
# for all libraries in /proc/<pid>/maps that are outside of $SPLUNK_HOME
# that way collection for each dump should take ~50ms and we can resolve
# all symbols at home -- kinda like jeprof does it
cmd[${#cmd[@]}-1]=$pid
if [ "$freeze" != "0" ]; then
kill -STOP $pid
fi
# use setsid to isolate subprocess from signals (like SIGINT)
fi
run setsid "${cmd[@]}" >"$stackdump_fname.out" 2>"$stackdump_fname.err" &
subprocesses=$!
# collect /proc/<pid> information for all tasks
for f in stack status; do
fname="$outdir/proc-$f-$suffix"
collect_proc "$f" >$fname.out 2>$fname.err &
done
wait
# wait for application stack program to wrap up
wait_subprocesses_revive_pid
grep_cmd=(grep -v 'no matching address range\|No DWARF information\|No such process' "$stackdump_fname.err")
if [ -s "$stackdump_fname.err" ] && "${grep_cmd[@]}" >/dev/null; then
printout $'\n'"--- Possibly harmless STDERR from \`${cmd[*]}\`:"
printout "$("${grep_cmd[@]}")"
fi
if ! grep -q "TID\|LWP\|services/server/pstacks" "$stackdump_fname.out" >/dev/null 2>&1; then
printerr $'\n'"ERROR: latest stack dump ($stackdump_fname.out) doesn't contain any thread information! Please try running manually and check output:" >&2
printerr " ${cmd[*]}" >&2
exit 1
fi
if [ "$isSplunkd" != "0" ] && ! grep -qi 'Thread.*main' "$stackdump_fname.out" >/dev/null 2>&1; then
printerr $'\n'"ERROR: latest stack dump ($stackdump_fname.out) doesn't contain 'Thread's or 'main()' call, which is very unexpected for splunkd! Please try running manually and check output:" >&2
printerr " ${cmd[*]}" >&2
fi
if run [ ! -e /proc/$pid ]; then
break;
fi
sleep $interval
done
printout
archive
Creating the debug image
Your first approach to creating the debug image might be to use the splunk:9.3.3 image, but this can trigger the Splunk Ansible automation, which attempts to launch a full Splunk instance.
Instead, you can use the docker build command to create a new debug image using the same operating system as the Splunk platform version you are running. The example below is based on Splunk platform 9.3.3, but the steps should remain applicable for future releases as long as you update the version references in the Docker file.
The eu-stack command requires access to the /opt/splunk/bin/splunkd binary. To address this, copy the /opt/splunk/ directory, although note that this directory is quite large. The final version copies the splunkd from the bin directory of /opt/splunk.
Docker file:
# ---- Stage 1: Extract Splunk from the official image ----
FROM splunk/splunk:9.3.3 AS splunksrc
FROM redhat/ubi8-minimal
# Switch to root user to install packages
USER root
# Install required tools
RUN microdnf install -y tar util-linux elfutils xz bash procps-ng && \
microdnf clean all
RUN groupadd -g 41812 splunk && \
useradd -u 41812 -g 41812 -d /home/splunk -m -s /bin/bash splunk
# Copy Splunk installation from the official image
# Use --chown to ensure correct ownership in the final image
COPY --from=splunksrc --chown=splunk:splunk /opt/splunk/bin/splunkd /opt/splunk/bin/splunkd
COPY --from=splunksrc --chown=splunk:splunk /opt/splunk/lib/*.so* /opt/splunk/lib/
# Copy the script and set permissions
COPY collect_stacks.sh /opt/splunk/collect_stacks.sh
RUN chmod +x /opt/splunk/collect_stacks.sh
CMD [ "/sbin/init" ]
Building the image
Run the following command to build the image. If you’re not using a proxy, you can omit the build-arg included in the example below to enable Docker's internet access:
docker build --build-arg http_proxy=proxy --build-arg https_proxy=proxy -t harborlocal/splunk/splunk-debug:9.3.3 .
Tag this image to be pushed into a local harbor repository with the command:
docker push harborlocal/splunk/splunk-debug:9.3.3
Gathering stacks from an instance
The kubectl debug command can be used to launch a container in the Splunk platform pod. The example below uses the splunk-test-cluster namespace and names the debug container “debugger”:
kubectl debug splunk-test-idxc-site1-indexer-0 --profile=general -n splunk-test-cluster --image=harborlocal/splunk/splunk-debug:9.3.3 --target=splunk -c debugger -it -- /bin/bash
Unfortunately you cannot access the file system of the running Splunk platform pod since the container is running as a non-root user. From the debug pod you can run a ps command to find the Splunk platform mothership process, or in the Splunk platform instance you can run:
head -n 1 /opt/splunk/var/run/splunk/splunkd.pid
When you have the pid, you can run the collect_stacks command. The command below collects 100 stacks; if you supply only the pid argument it will use the default settings:
/opt/splunk/collect_stacks.sh --pid 6780 -s 100 -b kubectl cp -n splunk-test-cluster splunk-test-idxc-site1-indexer-0:"/tmp/splunk/stacksfile" /tmp/stackfiles.xz -c debugger
The copy command is provided because after you exit the debug container, it will be terminated and its temporary file system will be lost.
You can test cdebug as an alternative to kubectl debug, but due to the lack of root access it cannot mount the file system of the Splunk platform container.
At this point, you have a stacks file you can upload to Support.
Uploading the file to support
Copy the file to a running Splunk instance, then run:
/opt/splunk/bin/splunk diag --upload-file= --case-number= --upload-user= --upload-description="`hostname` diag"
As an alternative, you might be able to script this with:
/opt/splunk/bin/splunk cmd rapidDiag upload --upload_description="`hostname` diag" --auth user:password
Next steps
These resources might help you understand and implement this guidance:
- Splunk GitHub: Splunk Operator for Kubernetes
- Splunk Lantern Article: Splunk Operator for Kubernetes

