Skip to main content

 

Splunk Lantern

Splunk Operator for Kubernetes: Advanced operational learnings

 

This article provides insights and lessons learned from my experience of working with the Splunk Operator for Kubernetes (SOK) for a year. For an overview of initial implementation experiences and common challenges, see Splunk Operator for Kubernetes: Initial implementation learnings.

If you are interested in understanding how to use the Splunk Operator for Kubernetes (SOK) to create Splunk indexer clusters running on a new Kubernetes environment, see Enabling access between Kubernetes indexer clusters and external search heads. If you are interested in the benefits that can be achieved at the indexing tier using the SOK, see Improving hardware utilization by running multiple indexers on bare metal servers.

Lessons learned from the Splunk Operator for Kubernetes

Lesson 1 - Using appropriate probe timeouts

The Splunk Operator for Kubernetes documentation outlines default values for each probe, which are generally suitable for instances with light workloads. However, in a production environment, indexers under heavy load became unresponsive on port 8089, causing both liveness and readiness probes to fail.

When the readiness probe fails, Kubernetes stops routing traffic to the pod, which reduces ingestion traffic and is a relatively harmless outcome. However, a liveness probe failure triggers a kill signal, leading to an unclean shutdown of the Splunk instance. This type of shutdown can corrupt buckets, which can be challenging to repair in a SmartStore environment.

►Click for instructions to repair corrupt buckets and upload repaired buckets into SmartStore.

If you are unsure about this procedure, I suggest contacting Splunk support, as this is non-trivial procedure.

  1. Take a copy of the corrupt bucket. I usually copy the bucket from a live production instance to another Splunk instance where I can run fsck/repair. If the repair succeeds then continue to step 3.
  2. Exit out of the Splunk instance and repair the bucket. I do this on a remote server:
    splunk fsck repair --one-bucket --bucket-path=/path/to/bucket/theindexname/db/db_1654566842_1646714175_137_4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3/
  3. On the live indexer in question, freeze the local or corrupt bucket:
    curl -ku admin https://localhost:8089/services/data/indexes/theindexname/freeze-buckets -d bucket_ids=theindexname~137~4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3
  4. Check remote storage to confirm deletion. This should show no results if frozen remotely:
    splunk cmd splunkd rfs -- ls --starts-with bucket:theindexname~137~4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3
  5. If the bucket exists on other members (note this should not be required), then run this curl command on the cluster manager which will remove the bucket from all members:
    curl -ku admin https://localhost:8089/services/cluster/master/buckets/theindexname~137~4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3/remove_all -X POST
  6. Copy the repaired bucket to the chosen live indexer, start the platform, then run the following command to update /opt/splunk/var/run/splunk/cachemanager_upload.json:
    curl -ku admin https://localhost:8089/services/admin/cacheman/_bulk_register -d cache_id="bid|theindexname~137~4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3|" -X POST
    
  7. Restart the indexer again. This second restart will trigger the upload to SmartStore. At this point the corruption error should have stopped and you have a repaired bucket that will be uploaded to S3 again.

Other notes

If you want to back up the raw data, use:

index = theindexname | eval bkt=_bkt | search bkt=theindexname~137~4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3

If you want to evict the bucket from the cache, use:

curl -ku admin: -X POST https://localhost:8089/services/admin/cacheman/"bid|theindexname~137~4A5E2AB2-C1F5-5ABD-1560-2D93211C82E3|"

To mitigate probe failure, I modified the liveness probe so that it would not fail unless the indexer is truly down.

For indexers with a large number of buckets, the startup probe sometimes fails. The settings I configured for these indexers are as follows:

startupProbe:
  timeoutSeconds: 30
  periodSeconds: 30
  failureThreshold: 40 # default 12
livenessProbe:
  failureThreshold: 30

The cluster managers do not need similarly long response times for failures. The settings I used for the cluster managers are:

startupProbe:
  timeoutSeconds: 30
  periodSeconds: 30
  failureThreshold: 30 # default 12
livenessProbe:
  failureThreshold: 14

A side effect of these settings is that if an indexer crashes, the probe will wait for 30 consecutive failures at 30-second intervals (totaling 900 seconds) before restarting.

SOK version 3.0.0 includes an enhancement for custom probes. You can configure the liveness probe to run /opt/splunk/bin/splunk status to confirm if the instance is truly running. The readiness probe can continue checking port 8089 to validate a ready state.

Any changes to probes will trigger a rolling restart of the pods, so you should aim to get the probe values right before going live.

Lesson 2 - Reducing DNS failure

While DNS generally works well, occasional DNS resolution failures can occur. A DNS failure at the wrong moment can result in the cluster manager getting into a state where the search factor and replication factor can never be met. Restarting the cluster manager consistently resolves this issue, and using a local DNS cache helps prevent the issue.

Kubernetes has a node local DNS option, with the YAML available in the Kubernetes GitHub. Implementing the node local cache reduced our DNS traffic and improved the reliability of the indexer cluster. Additionally, Spegel was implemented, which, while not fixing any issues, did reduce the time to restart Splunk pods by allowing a node to retrieve images locally.

Lesson 3 - Monitoring

You can retrieve operator logs using:

kubectl logs -n splunk-operator deployment/splunk-operator-controller-manager -c manager

However, it's often better to access this logs within the Splunk platform for alerting purposes. I did this using the Splunk OpenTelementry Collector installed via Helm with the following overrides:

logsEngine: otel
logsCollection:
  containers:
    useSplunkIncludeAnnotation: true
  clusterReceiver:
    enabled: true
    eventsEnabled: true

When combined with the splunkPlatform settings, this configuration allows for the collection of logs specifically from namespaces or pods tagged with the "splunk.com/include" annotation, as detailed on the OTel advanced configuration page. This setup ensures that pod logs are indexed exclusively from designated namespaces or pods.

The clusterReceiver guarantees that Kubernetes events are indexed. In contemporary Kubernetes versions, this is equivalent to: kubectl events -A

For older Kubernetes versions where the above command doesn't work, an alternative is: kubectl get events -A --sort-by=.metadata.creationTimestamp

As we were also the administrator of the Kubernetes environment, a Splunk universal forwarder was deployed on the host OS to collect journal logs for Kubernetes services (which, in my setup, run under systemd):

[journald://kubernetes_services]
journalctl-include-fields = PRIORITY,_SYSTEMD_UNIT,MESSAGE
journalctl-filter = _SYSTEMD_UNIT=etcd.service + _SYSTEMD_UNIT=kube-apiserver.service + _SYSTEMD_UNIT=kube-controller-manager.service + _SYSTEMD_UNIT=kube-scheduler.service + _SYSTEMD_UNIT=kubelet.service + _SYSTEMD_UNIT=kube-proxy.service
index = monitoring
sourcetype = journald

[journald://oom_errors]
journalctl-include-fields = PRIORITY,_SYSTEMD_UNIT,MESSAGE
journalctl-grep =Kill
journalctl-priority=3
index = monitoring
sourcetype = journald

The OOM errors are useful to track when the OOM killer is active on any nodes.

For OS level monitoring we used the OTel host metrics receiver through the Splunk Add-On for OpenTelemetry Collector on the universal forwarder. The Metricator application for Nmon can also be used for OS monitoring.

Lesson 4 - Using a defaultsUrl instead of inline defaults

While a minor improvement, I found that placing default values inline within the CRD files caused a rolling restart of all pods whenever any configuration was modified. By utilizing defaultsUrl instead, the ConfigMap could be updated without initiating a restart.

defaultsUrl: /mnt/defaults/default.yml

The example above can be paired with a volume setting, and the ConfigMap simply provides defaults for the Ansible playbook:

apiVersion: v1
kind: ConfigMap
metadata:
 name: ansible-defaults-site1
 namespace: example
data:
 default.yml: |
 splunk:
   multisite_master: splunk-example-cm-cluster-manager-service
     site: site1

Lesson 5 - Extremely slow pod performance

For an extended period, we experienced a performance issue affecting indexer pods, which in turn impacted our entire Splunk environment. The problem presented with multiple symptoms, and despite opening support cases with both Red Hat and our hardware vendor, the root cause remained unclear.

Symptoms

The symptoms observed for this problem included:

  • High system CPU, often between 55% and 99.9% system CPU on the Linux server.
  • iostat would show 100% (or even 100.10%) disk utilization on multiple disks, but the disk writes across all involved NVMe drives (mdraid with RAID level 0) would show 0 writes during the issue.
  • Attempting to launch a shell inside the indexer pod would freeze (blank screen) until the issue subsided.
  • Zombie processes would increase at the host level during the issue and disappear when the issue subsided.
  • Searches would continue to involve the problematic indexer, as the indexer was still online but responding extremely slowly.
  • At the Splunk level, the search head experienced long handoff times when initiating searches and took noticeably longer to return results. From the user’s perspective, dashboard panels appeared to be nearly loaded but remained in that state for minutes rather than seconds. This behavior led to search head clusters hitting concurrency limits, resulting in skipped searches — and a frustrated user base.

Restarting the Splunk indexer would fully resolve the issue, provided the indexer eventually responded to the offline or stop request, which could itself take several minutes.

The elevated system CPU usage typically persisted for about 10 minutes before often subsiding. However, as the workload increased, the issue began to occur for longer durations.

To measure the issue in real-time I used top and iostat. I also tried iotop, Redhat eBPF tools (bcc-tools), and considered tools such as glances or htop, but these tools did not reveal any new information. While a bcc-tool like ext4slower could identify slow I/O transactions, all I/O was slow during the problem.

Our first attempt to solve this issue

After removing swap from the hosts “just in case”, we attempted to mitigate the issue by enabling the Slow peer disconnect feature in Splunk. Within the limits.conf file the slow_peer_disconnect stanza mentions:

# This stanza contains settings for the heuristic that will detect and
# disconnect slow peers towards the end of a search that has returned a
# large volume of data.

This solution does not apply to the "handoff" time, which involves the indexer in the initial phase of a search. Unfortunately, even though we accepted that we wouldn’t have perfectly accurate search results in a failure scenario, this option did not help.

Tracing the performance issue

You can use the limits.conf to enable search_metrics:

[search_metrics]
debug_metrics = true

This setting can be dynamically enabled or disabled by updating the limits.conf file on the filesystem, requiring no reload or restart.

debug_metrics shows statistics on a per-indexer basis, including the “handoff” time, which is normally a single value across all indexers. However if you have a large number of peers, the job inspector becomes unreadable due to the large number of responses.

In the Alerts for Splunk Admins application (Splunkbase or GitHub) the below search provides the handoff time per-indexer:

SearchHeadLevel - Job performance data per indexer handoff time

Or for more detailed information, you can use:

SearchHeadLevel - Job performance data per indexer

These searches use the jobs endpoint to help view the performance per-indexer, the former only works with debug_metrics enabled.

Clarifying the issue

I eventually identified the issue as being related to memory usage within the cgroup or container. We have memory limits on all indexer pods, for example, 150GB on the larger indexers.

When the memory limit is exceeded, the OOM killer terminates the process. However, if the pod approaches the limit, I suspect the Linux system initiates an aggressive memory reclamation resulting in degraded performance.

This behavior is similar to how a Linux system thrashes its swap file when memory is scarce. The key difference here is that there is no swap file within the pod, and swap on the host was proactively disabled as a preventative measure.

At the OS level, this behavior is largely hidden, with the only observable behaviors including elevated system CPU usage and the pod becoming almost unresponsive. It’s only by inspecting the cgroup that you can observe the cache availability dropping and other activity consistent with a low-memory condition in Linux.

The issue isn’t exclusive to Kubernetes, as I’ve encountered similar behavior on Linux servers utilizing swap. Based on my research, there’s little documentation addressing performance degradation within pods or Linux cgroups under memory stress. While I’ve made some assumptions about the underlying cause, the implemented solution seems effective, which leads me to believe memory pressure is the root cause.

Solution discussion

My environment uses cgroupsv1. Although a memory pressure file exists within the cgroup, it can’t be read directly, and you must subscribe to it via the cgroup notification API.

This GitHub gist provides an example C implementation to subscribe to memory pressure events. A C example for a cgroup event listener also exists, and there is also a python equivalent.

Since memory pressure appeared to be challenging to measure, I found that cgroups have a memory.limit_in_bytes and a memory.usage_in_bytes. While this provides a simple calculation, it is not useful as Linux servers can cache memory.

Fortunately, the cgroups directory contains a memory.stat file with numerous metrics in plain text. A few relevant metrics included:

cache 110069547008
rss 42799886336
pgpgin 192985903033
pgpgout 192948581417
pgfault 205169464127
pgmajfault 1545206

Since the page faults, page-in, and page-out metrics seemed to be cumulative counters, I opted to use the available cache value instead, as it offered a more straightforward approach.

cgroupsv2 seems to provide even more statistics, which might simplify this process.

In Kubernetes 1.28 and later, the default OOM kill behavior changes under cgroupsv2. This workaround allows the old behavior of not killing the entire process tree, or Kubernetes 1.32 introduces a flag named singleProcessOOMKill that restores the previous OOM kill behavior.

Solution

This bash script was developed to address the issue and is configured for use with containerd. If you’re using a different container runtime, you might need to replace nerdctl with ctr or an equivalent command.

The code below:

  • Looks for any Splunk containers.
  • For each Splunk container, it identifies its cgroup directory.
  • If the memory.stat file shows the cache as below the threshold (800MB in the below code), it then finds the largest search process. If there is enough memory, the loop continues to the next container or exits.
  • After logging the details of the largest search process, it sends a kill signal to that process.
  • After a 2-second sleep, the memory check re-runs and it kills a process if required or exits the loop.

The while loop is included because I found that when an indexer pod nears its memory limit, terminating the largest search alone often isn’t sufficient for recovery. The loop ensures the pod’s memory usage drops back below the threshold.

In my environment, this script is configured to run every 20 seconds to prevent issues:

#!/bin/bash

# Log file
LOG_FILE="/opt/splunkforwarder/var/log/splunk/splunk_oom_killer.log"
MAX_LOG_SIZE=$((2 * 1024 * 1024))  # 2MB in bytes
# MB at which point we kill searches due to a potential issue
low_threshold=800

# Logging function
log() {
    local message="$1"
    if [ -f "$LOG_FILE" ]; then
        local log_size
        log_size=$(stat -c%s "$LOG_FILE")
        if [ "$log_size" -ge "$MAX_LOG_SIZE" ]; then
            mv "$LOG_FILE" "${LOG_FILE}.1"
        fi
    fi
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $message" | tee -a "$LOG_FILE"
}

cache_check() {
    local dir="$1"
    # use cache as the proxy measure for memory that can be freed if required
    # memory.limit_in_bytes - memory.usage_in_bytes did not prove to be a good measure
    # memory.pressure_level will work well but you must be "notified" of the issue, you cannot just read it as a text file
    # for now keeping this simple and just watching the cache
    cache=`grep "^cache" $dir/memory.stat | awk '{ print $2 }'`
    megabytes=$(awk "BEGIN {printf \"%.0f\", $cache / (1024 * 1024)}")
    log "Memory in cache: $megabytes MB for $id and instance $instance"
    # if we have less than $low_threshold MB remaining, prepare for process killing
    if [ $megabytes -lt ${low_threshold} ]; then
      log "Container id=$id, instance $instance is below the available cache of ${low_threshold} currently with memory=$megabytes MB available"
      pid=`ps -eo pid,rss,command --sort=-rss --cols 1400 | grep -E "search-launcher|search --id" | grep -Ff $dir/cgroup.procs | head -n 1 | awk '{ print $1 }'`
      if [ "x$pid" = "x" ]; then
        log "Unable to find a pid to kill -- $pid"
        return 0
      fi      
      log "Killing $pid -- `ps -o user,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args -p $pid --cols 1400`"
      kill $pid
      return 1
    fi
    return 0
}

#ctr shows all containers, nerdctl is showing running containers
#list=`ctr -n k8s.io containers list | grep "splunk:" | awk '{ print $1 }'`
nerdctl -n k8s.io ps --no-trunc | grep "splunk:" | awk '{ print $1 " " $NF }' > /tmp/oom_killer_pod_ids.txt
while read -r id instance; do
  # find any Splunk pods and check memory limits
  dir=`find /sys/fs/cgroup/memory -name "*$id"`
  if [ "x$dir" = "x" ]; then
      log "Unable to find directory for $id, continue"
      continue
  fi
  while true; do
      cache_check "$dir"
      if [ $? -eq 0 ]; then
        break  # Exit the loop if cache_check succeeds
      fi
      # give the system time to update the stats
      sleep 2
  done
done < /tmp/oom_killer_pod_ids.txt

chown splunk:splunk $LOG_FILE

Since deploying this solution, indexer overload caused by memory exhaustion has stopped completely.

SmartStore downloads still present challenges. Unfortunately, a solution for this is not currently available beyond increasing cache sizes or reducing workload.

To further prevent memory issues, the built-in memory tracker in limits.conf can also be utilized:

[search]
enable_memory_tracker = true
search_process_memory_usage_threshold = 50000

Data model acceleration jobs can consume up to 50GB of memory, so this setting is configured high in our environment. Often, a lower value can be used (though a rolling restart is required).

Reducing the chance of re-occurrence

Splunk platform version 9.3 introduced a new feature called usage-based data rebalancing that helps distribute search workloads more evenly among peers.

To help with running data rebalancing more frequently, I’ve built an application, Automatic Data Rebalance (Splunkbase or GitHub), that provides a modular input to trigger indexer cluster data rebalancing on a schedule. This modular input only triggers a rebalance when when the search factor is met and the cluster is not in maintenance mode.

Lesson 6 - Preventing the cluster manager pod restart from triggering a indexer cluster rolling restart

In my previous article I discussed unexpected restarts, which unfortunately have persisted despite updated Operator and Splunk platform versions. However, after an extensive support case involving several teams at Splunk, we were able to identify the root cause of the issue and understand why it doesn't consistently manifest across all environments.

When the cluster manager pod restarts, the Ansible playbook automatically initiates a bundle push. This action can trigger a re-encryption of encrypted secrets located in the manager-apps directory. The re-encryption of existing values generates a new bundle version, which results in a rolling restart.

Ultimately, we discovered the problem was linked to the s3.access_key and s3.secret_key.A fix is available in Splunk platform versions 9.2.8, 9.3.6, 9.4.4 and above. In SOK versions before 3.0.1, add this environment variable to the cluster manager:

SPLUNK_SKIP_CLUSTER_BUNDLE_PUSH: "true"

Starting with SOK versions above 3.0.0, this setting will be enabled by default, eliminating the need for the environment variable. For users running Splunk platform versions earlier than the fixed releases, use the below configuration. This modifies encrypt_fields to exclude s3.access_key and s3.secret_key. Note that these must be decrypted on the cluster manager before applying this change:

# Specify Splunk default.yaml to override intial environment config
defaults:
  ## Example: Deploy a multi-site cluster manager
  splunk:
    conf:
      server:
        content:
          general:
            encrypt_fields: '"server: :sslKeysfilePassword", "server: :sslPassword", "server: :password", "outputs:tcpout:sslPassword", "outputs:tcpout:socksPassword","outputs:indexer_discovery:pass4SymmKey", "outputs:tcpout:token", "inputs:SSL:password", "inputs:SSL:sslPassword", "inputs:http:sslPassword", "inputs:http:sslKeysfilePassword", "inputs:splunktcptoken:token", "alert_actions:email:auth_password", "app:credential:password", "app:credential:sslPassword", "passwords:credential:password", "passwords:credential:sslPassword", "authentication: :bindDNpassword", "authentication: :sslKeysfilePassword", "authentication: :attributeQuerySoapPassword", "authentication: :scriptSecureArguments", "authentication: :sslPassword", "authentication: :accessKey", "web:settings:privKeyPassword", "web:settings:sslPassword", "server:indexer_discovery:pass4SymmKey", "server:clustermanager:pass4SymmKey", "server:dmc:pass4SymmKey", "server:kvstore:sslKeysPassword", "indexes: :remote.s3.access_key", "indexes: :remote.s3.secret_key", "indexes: :remote.s3.kms.key_id", "indexes: :remote.azure.access_key", "indexes: :remote.azure.secret_key", "indexes: :remote.azure.client_id", "indexes: :remote.azure.client_secret", "indexes: :remote.azure.tenant_id", "outputs: :remote.s3.access_key", "outputs: :remote.s3.secret_key", "outputs: :remote.s3.kms.key_id", "outputs: :remote.azure.access_key", "outputs: :remote.azure.secret_key", "outputs: :remote.azure.client_id", "outputs: :remote.azure.client_secret", "outputs: :remote.azure.tenant_id","server:scs:kvservice.principal.client.secret", "federated: :password"'

The s3.access_key and s3.secret_key will be encrypted on the peer nodes, while remaining in plain text on the manager. This configuration prevents rolling restarts from occurring.

An alternative approach is to define s3.access_key and s3.secret_key as defaults in the CRD and remove them entirely from the manager-apps files. I have done this by adding them to the Ansible defaults ConfigMaps and removing them from the manager-apps/indexes.conf file.

Other lessons

Search head clusters

Since publishing the previous article, we’ve begun running search head clusters within the Kubernetes environment. Initially, there was an issue where the deployer required resource parity with all search head cluster members, but this has been resolved in SOK version 2.7.1.

Generally, search heads worked seamlessly within Kubernetes.

Platform upgrades

Upgrading Splunk platform with the SOK has proved quite straightforward, as the SOK appears to be following the recommended upgrade order.

Since the searchable rolling restart of the indexer cluster temporarily suspends data model acceleration searches, we were unable to perform a fully automated upgrade. The idea Splunk indexing tier searchable rolling restart should allow the scheduler to run jobs as expected, which is expected to be implemented in a future version of the platform, could allow searchable rolling restarts and upgrades on the indexing tier.

Indexer scaling

Given our existing investment in the Kubernetes stack, we’re now procuring higher-capacity hardware. This results in larger cost savings, as a server with double the CPU, memory, and disk capacity does not cost twice as much. This also helps reduce our data center footprint.

The upgraded high-capacity hardware also provides increased flexibility for mixing workloads. Indexers with greater CPU requirements can temporarily utilize the unused CPU resources of those handling fewer active searches.

Disk cache limits

We discovered that pods ingesting approximately 200GB of data per day were unable to sustain heavy search workloads with 7TB disks (providing 5.5TB of usable cache), as the capacity proved insufficient.

Unfortunately, this led to SmartStore downloads, which temporarily disrupted ingestion and degraded search performance. To address this, we plan to invest in higher-capacity disks for the servers and reduce ingestion workloads on those that cannot be upgraded.

Server restarts

In the previous article, I noted that server-level restarts did not gracefully shut down the Splunk platform, and this issue persists. However, I’ve made several updates to my systemd unit file and associated scripts to better manage search head restarts. For the Splunk pods, implementing a PreStop hook could be an effective solution, as this would ensure that any shutdown event automatically triggers a splunk offline or splunk stop command.

Backups

Velero provides Kubernetes configuration backups. I’ve tested a full restore in development and the only issue I had was duplicated replicaSets, otherwise this tool works perfectly.

Next steps

The SOK continues to deliver significant value in my environment, and our investment in higher-capacity physical servers is expected to further enhance the system’s overall cost-efficiency.

Although we’ve encountered challenges throughout the journey, I hope that sharing the insights I’ve gained will be beneficial to others.

The most difficult challenge we faced was extremely slow pod performance. If you have any thoughts or feedback on this issue, feel free to share them in the comments or on the Splunk Community Slack.

These additional resources might help you understand and implement this guidance: