Using the Splunk Cloud Monitoring Console effectively

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

As a Splunk Cloud Platform customer, you rely on your Splunk team to operate, manage, and control the service components of your Splunk platform, easing your operational burden. However, you still want insight into what is happening in your deployment. You want to identify KPIs to monitor to be aware of the health of your system and quickly be alerted to any potential issues. You also want a convenient method of monitoring those KPIs.

KPIs to monitor

Proactive monitoring

Proactive monitoring involves looking at trends or changes that might lead to degradation of the service. This helps you limit the issues that reach your users by detecting potential problems before impact and anticipating future needs. You look to see how KPI values are trending and if they are nearing prescribed thresholds. KPIs that fall into the proactive category include:

Daily ingest and index distribution
User count
Search execution count
Indexer/search head load
Skipped search percentage
Scheduled search count and schedules
Indexer and intermediate forwarding tier queue percentage
Event latency
GDI error rate
SVC consumption over time
Search execution delays

Reactive monitoring

Reactive monitoring occurs when you already have an issue and want to find out what happened and what to do about it. The searches you run target a specific issue, and isolating and remediating the issue generally requires in-depth Splunk knowledge. Users might already be aware of the issue at the point you begin reactive monitoring. KPIs that fall into the reactive category include:

Splunk logs:
- audittrail
- splunkd
- mongod
- splunk_python
- web_access, splunkd_ui_access
Performance metrics (_metrics):
- spl
- processor
- process
- system
Messages in GUI
- Health Check. Be cautious with health check messages. Because the health check doesn't always have access to everything it needs for accurate projection of the health of the stack, these messages can be unreliable.

The Cloud Monitoring Console

The Cloud Monitoring Console (CMC) helps more with reactive than proactive monitoring, but you can find KPIs for both. Splunk Docs (Introduction to the Cloud Monitoring Console) provides comprehensive detail on all dashboards and panels in the CMC, so the following presentation will present the ones most often used from the perspective of a Splunk Professional Services Consultant. This abbreviated guidance can help you get started quickly with troubleshooting.

Searches

While the Cloud Monitoring Console provides pre-built panels and dashboards, you can use this section to peek under the hood at the some of the searches that power the KPIs discussed in this article. You can use these searches to build your own dashboards, making adjustments as needed to fit your environment.

Search execution by provenance

index=_audit sourcetype=audittrail host=sh* action=search 
| where NOT match(info, "^granted|^denied|^finalize|^pause|^enable|^save") 
| eval search_type = case( match(search_id, "^SummaryDirector_"), "summarization", match(savedsearch_name, "^_ACCELERATE_"), "acceleration", match(savedsearch_name, "^ds_"), "Dashboard", match(search_id, "^'*((rt_)?scheduler_|alertsmanager_)"), "scheduled", match(search_id, "d{10}.d+(_[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12})?$"), "ad hoc", savedsearch_name="","ad hoc", info="canceled", "Cancelled", info="cancel", "Cancelled", match(provenance,"^UI:[Dd]ashboard"), "Dashboard", provenance="rest", "REST", provenance="splunkjs", "SplunkJS", info="terminate", "Terminated", info="setttl", "Timedout", match(search_id,"_itsi_"), "ITSI", match(app,"^DA-ESS"), "DA-ESS", 1=1, "other") 
| append    [search index=_internal host=sh* sourcetype=scheduler status="skipped" search_type!="*acceleration"     
| eval search_type="Skipped"]
| timechart limit=25 span=10m count BY search_type

Indexer load

index=_introspection sourcetype=splunk_resource_usage component=Hostwide host=idx*
| timechart limit=20 avg(data.normalized_load_avg_1min) AS avg_load BY host

Skipped search

index=_internal host IN (<SH list>) sourcetype=scheduler status="completed" OR status="success" OR status="skipped" OR status="continued" OR status="deferred" search_type!="*acceleration" 
| fillnull value="no_sid_yet" sid 
| eval tuple=savedsearch_name . ";" . tostring(round(relative_time(scheduled_time, "@m"), 0)).":::".sid.status 
| dedup tuple 
| eval window_time = if(isnotnull(window_time), window_time, 0) 
| eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0) 
| timechart span=10m partial=f avg(execution_latency) AS avg_exec_latency, count(eval(status="completed" OR status="success" OR status="skipped" OR status="continued" OR status="deferred")) AS total_exec, count(eval(status=="skipped")) AS skipped_exec 
| eval skip_ratio = round(skipped_exec / total_exec * 100, 2) 
| eval avg_exec_latency = round(avg_exec_latency, 2) 
| fields _time, avg_exec_latency, skip_ratio 
| stats max(skip_ratio) AS Skip_Ratio BY _time

Indexer blocked queues

index=_internal host=idx* blocked=true group=queue 
| bin _time span=5m 
| stats count BY name host _time 
| timechart partial=f minspan=5m sum(count) AS blocked_events BY name

Event latency

index=_internal sourcetype=splunkd host IN (<list of SHs>) 
| eval latency = _indextime - _time 
| timechart minspan=10m perc80(latency) BY host

SVC consumption (minute of the hour)

index=summary source="splunk-svc" 
| stats max(utilized_svc) AS svcs values(date_minute) AS minute by _time, role, indexer_type 
| stats sum(svcs) AS svcs values(minute) AS minute BY _time 
| stats max(svcs) max min(svcs) AS min BY minute

SVC consumption (5-min)

index=summary source="splunk-svc" 
| stats max(utilized_svc) AS utilized_svc BY _time, role, indexer_type 
| stats sum(utilized_svc) AS utilized_svc BY _time 
| timechart span=5m max(utilized_svc) AS utilized_svc 
| trendline sma24(utilized_svc) AS "Trend"

Resources

The following additional Splunk resources might help you implement the guidance provided in this article.

Splunk Docs: Introduction to the Cloud Monitoring Console
Splunk Lantern: Creating efficient searches and dashboards for cost reduction
Splunk Lantern: Managing your Splunk Cloud Platform deployment
Splunk Lantern: Preventing concurrency issues and skipped searches
Splunk Lantern: Understanding workload pricing in Splunk Cloud Platform
Splunk Lantern: Running a Splunk platform health check