Skip to main content
Registration for .conf24 is open! Join us June 11-14 in Las Vegas.
 
 
 
Splunk Lantern

Combining multiple detector conditions into a single detector

 

Combining many separate detector conditions into a single detector can be useful for consolidating alerts, maintaining an appropriate number of detectors within your organization’s limits (by default, 1000), and preserving context when multiple alerting conditions fire.

To combine more complex compound detector conditions into a single detector, see this article.

Using SignalFlow for detectors

To combine multiple separate alerting conditions into a single detector you’ll need to use SignalFlow. SignalFlow can also be used in your detector configurations defined with the Terraform provider.

Using SignalFlow for detectors is easiest to accomplish with specific query parameters on your URL that look like the following:

  • Create new detectors: https://app.us1.signalfx.com/#/detector/v2/new?SignalFlow
  • Edit existing detectors: https://app.us1.signalfx.com/#/detector/v2/<Detector_ID>/edit?detectorSignalFlowEditor=1

The realm (for example, us1) might need to be changed in the above URLs to match your realm. <Detector_ID> needs to be replaced with the ID of the detector you want to edit.

Multiple alert thresholds with one detector

SignalFlow allows adding an arbitrary number of alerting signals and conditions to a single detector. The example below shows how to create alert detection signals for the following metrics on a specific service (paymentservice):

  • service.request.count is used for determining if too little or too much traffic is being served. It will detect on either condition (LOW or HIGH).
    • Additionally service.request.count will be used a second time to make a signal for only the errors and create an error rate signal out of the total requests and errored requests metrics. That error rate will detect on the ERRRATE condition.
  • cpu.utilization is used for determining if a given host CPU utilization is too saturated for paymentservice and will alert on the CPUHIGH condition.
  • memory.utilization is used for determining if a given host memory utilization is too saturated for paymentservice and will alert on the MEMHIGH condition.
  • disk.utilization is used for determining if a given host disk utilization is too saturated for paymentservice and will alert on the DISKHIGH condition.

If any of these conditions are breaching threshold, the alert will fire and the detector will send out notifications. This covers Latency, Errors, Traffic, and Saturation (L.E.T.S.) or the 4 golden signals for your service.

Example SignalFlow

REQ = data('service.request.count', filter=filter('sf_service', 'paymentservice')).sum(by=['sf_service']).publish(label='demo')
ERR = data('service.request.count', filter=filter('sf_service', 'paymentservice') and (not filter('sf_error', 'false'))).sum(by=['sf_service']).publish(label='demo error rate', enable=False)
RATE = (((REQ-ERR)/REQ)*100).publish(label='error_rate')
ERRRATE = detect((when(RATE > threshold(99)))).publish('Error rate too high')
LOW = detect((when(REQ < threshold(1)))).publish('Request traffic too low')
HIGH = detect((when(REQ > threshold(50000)))).publish("Request traffic exceeding capacity")
CPU = data('cpu.utilization', filter=filter('sf_service', 'paymentservice')).sum(by=['service.name', 'host']).publish(label='cpu')
CPUHIGH = detect((when(CPU > threshold(95)))).publish("cpu.utilization exceeding threshold")
MEM = data('memory.utilization', filter=filter('sf_service', 'paymentservice')).sum(by=['service.name', 'host']).publish(label='mem')
MEMHIGH = detect((when(MEM > threshold(90)))).publish("memory.utilization exceeding threshold")
DISK = data('disk.utilization', filter=filter('sf_service', 'paymentservice')).sum(by=['service.name', 'host']).publish(label='disk')
DISKHIGH = detect((when(DISK > threshold(97)))).publish("disk.utilization exceeding threshold")

Here is an example detector with included SignalFlow applied:

Here is an example of a detector’s alert rules in Splunk Observability Cloud:

Alert message context

It's important to use the appropriate variable tags to pass context along within your alert rules. Use the detector message tagging to pass along valuable context from dimensions like servicename, host, or any other dimension contained in the alerting metric signal.

You can find variable tag naming references in Splunk Docs.

For example, a set of variable tags like the ones in the screenshot above provides the service name and host along with all other dimensions in the message body of the alert email.

{{#if anomalous}}
Rule {{{ruleName}}} in detector {{{detectorName}}}
triggered at {{timestamp}}.
{{else}}
Rule {{{ruleName}}} in detector {{{detectorName}}} cleared at {{timestamp}}.
sf_service: {{dimensions.[service.name]}}
host: {{{dimensions.[host]}}}
{{/if}}


{{#if anomalous}}
Triggering condition: {{{readableRule}}}
{{/if}}


{{#if anomalous}}
Signal value for Requests: {{inputs.REQ.value}}
Signal value for Error Rate: {{inputs.RATE.value}}
Signal value for cpu.utilization: {{inputs.CPU.value}}
Signal value for memory.utilization: {{inputs.MEM.value}}
Signal value for disk.utilization: {{inputs.DISK.value}}
sf_service: {{dimensions.[service.name]}}
host: {{{dimensions.[host]}}}
{{else}}
sf_service: {{dimensions.[service.name]}}
host: {{{dimensions.[host]}}}
{{/if}}


{{#notEmpty dimensions}}
Signal details:
{{{dimensions}}}
{{/notEmpty}}


{{#if anomalous}}
{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}
{{#if tip}}Tip: {{{tip}}}{{/if}}
{{/if}}

The message preview gives you an example of what the message body looks like with the variable tags filled in:

Next steps

These resources might help you understand and implement this guidance:

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at ondemand@splunk.com if you require assistance.