Using the Splunk Cloud Monitoring Console effectively
As a Splunk Cloud Platform customer, you rely on your Splunk team to operate, manage, and control the service components of your Splunk platform, easing your operational burden. However, you still want insight into what is happening in your deployment. You want to identify KPIs to monitor to be aware of the health of your system and quickly be alerted to any potential issues. You also want a convenient method of monitoring those KPIs.
KPIs to monitor
Proactive monitoring
Proactive monitoring involves looking at trends or changes that might lead to degradation of the service. This helps you limit the issues that reach your users by detecting potential problems before impact and anticipating future needs. You look to see how KPI values are trending and if they are nearing prescribed thresholds. KPIs that fall into the proactive category include:
- Daily ingest and index distribution
- User count
- Search execution count
- Indexer/search head load
- Skipped search percentage
- Scheduled search count and schedules
- Indexer and intermediate forwarding tier queue percentage
- Event latency
- GDI error rate
- SVC consumption over time
- Search execution delays
Reactive monitoring
Reactive monitoring occurs when you already have an issue and want to find out what happened and what to do about it. The searches you run target a specific issue, and isolating and remediating the issue generally requires in-depth Splunk knowledge. Users might already be aware of the issue at the point you begin reactive monitoring. KPIs that fall into the reactive category include:
- Splunk logs:
- audittrail
- splunkd
- mongod
- splunk_python
- web_access, splunkd_ui_access
- Performance metrics (_metrics):
- spl
- processor
- process
- system
- Messages in GUI
- Health Check. Be cautious with health check messages. Because the health check doesn't always have access to everything it needs for accurate projection of the health of the stack, these messages can be unreliable.
- Health Check. Be cautious with health check messages. Because the health check doesn't always have access to everything it needs for accurate projection of the health of the stack, these messages can be unreliable.
The Cloud Monitoring Console
The Cloud Monitoring Console (CMC) helps more with reactive than proactive monitoring, but you can find KPIs for both. Splunk Docs (Introduction to the Cloud Monitoring Console) provides comprehensive detail on all dashboards and panels in the CMC, so the following presentation will present the ones most often used from the perspective of a Splunk Professional Services Consultant. This abbreviated guidance can help you get started quickly with troubleshooting.
Searches
While the Cloud Monitoring Console provides pre-built panels and dashboards, you can use this section to peek under the hood at the some of the searches that power the KPIs discussed in this article. You can use these searches to build your own dashboards, making adjustments as needed to fit your environment.
Search execution by provenance
index=_audit sourcetype=audittrail host=sh* action=search | where NOT match(info, "^granted|^denied|^finalize|^pause|^enable|^save") | eval search_type = case( match(search_id, "^SummaryDirector_"), "summarization", match(savedsearch_name, "^_ACCELERATE_"), "acceleration", match(savedsearch_name, "^ds_"), "Dashboard", match(search_id, "^'*((rt_)?scheduler_|alertsmanager_)"), "scheduled", match(search_id, "d{10}.d+(_[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12})?$"), "ad hoc", savedsearch_name="","ad hoc", info="canceled", "Cancelled", info="cancel", "Cancelled", match(provenance,"^UI:[Dd]ashboard"), "Dashboard", provenance="rest", "REST", provenance="splunkjs", "SplunkJS", info="terminate", "Terminated", info="setttl", "Timedout", match(search_id,"_itsi_"), "ITSI", match(app,"^DA-ESS"), "DA-ESS", 1=1, "other") | append [search index=_internal host=sh* sourcetype=scheduler status="skipped" search_type!="*acceleration" | eval search_type="Skipped"] | timechart limit=25 span=10m count BY search_type
Indexer load
index=_introspection sourcetype=splunk_resource_usage component=Hostwide host=idx* | timechart limit=20 avg(data.normalized_load_avg_1min) AS avg_load BY host
Skipped search
index=_internal host IN (<SH list>) sourcetype=scheduler status="completed" OR status="success" OR status="skipped" OR status="continued" OR status="deferred" search_type!="*acceleration" | fillnull value="no_sid_yet" sid | eval tuple=savedsearch_name . ";" . tostring(round(relative_time(scheduled_time, "@m"), 0)).":::".sid.status | dedup tuple | eval window_time = if(isnotnull(window_time), window_time, 0) | eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0) | timechart span=10m partial=f avg(execution_latency) AS avg_exec_latency, count(eval(status="completed" OR status="success" OR status="skipped" OR status="continued" OR status="deferred")) AS total_exec, count(eval(status=="skipped")) AS skipped_exec | eval skip_ratio = round(skipped_exec / total_exec * 100, 2) | eval avg_exec_latency = round(avg_exec_latency, 2) | fields _time, avg_exec_latency, skip_ratio | stats max(skip_ratio) AS Skip_Ratio BY _time
Indexer blocked queues
index=_internal host=idx* blocked=true group=queue | bin _time span=5m | stats count BY name host _time | timechart partial=f minspan=5m sum(count) AS blocked_events BY name
Event latency
index=_internal sourcetype=splunkd host IN (<list of SHs>) | eval latency = _indextime - _time | timechart minspan=10m perc80(latency) BY host
SVC consumption (minute of the hour)
index=summary source="splunk-svc" | stats max(utilized_svc) AS svcs values(date_minute) AS minute by _time, role, indexer_type | stats sum(svcs) AS svcs values(minute) AS minute BY _time | stats max(svcs) max min(svcs) AS min BY minute
SVC consumption (5-min)
index=summary source="splunk-svc" | stats max(utilized_svc) AS utilized_svc BY _time, role, indexer_type | stats sum(utilized_svc) AS utilized_svc BY _time | timechart span=5m max(utilized_svc) AS utilized_svc | trendline sma24(utilized_svc) AS "Trend"
Resources
The following additional Splunk resources might help you implement the guidance provided in this article.
- Splunk Docs: Introduction to the Cloud Monitoring Console
- Splunk Lantern: Creating efficient searches and dashboards for cost reduction
- Splunk Lantern: Managing your Splunk Cloud Platform deployment
- Splunk Lantern: Preventing concurrency issues and skipped searches
- Splunk Lantern: Understanding workload pricing in Splunk Cloud Platform
- Splunk Lantern: Running a Splunk platform health check