Combining multiple detector conditions into a single detector
Combining many separate detector conditions into a single detector can be useful for consolidating alerts, maintaining an appropriate number of detectors within your organization’s limits (by default, 1000), and preserving context when multiple alerting conditions fire.
To combine more complex compound detector conditions into a single detector, see this article.
Using SignalFlow for detectors
To combine multiple separate alerting conditions into a single detector you’ll need to use SignalFlow. SignalFlow can also be used in your detector configurations defined with the Terraform provider.
Using SignalFlow for detectors is easiest to accomplish with specific query parameters on your URL that look like the following:
- Create new detectors:
https://app.us1.signalfx.com/#/detector/v2/new?SignalFlow
- Edit existing detectors:
https://app.us1.signalfx.com/#/detector/v2/<Detector_ID>/edit?detectorSignalFlowEditor=1
The realm (for example, us1
) might need to be changed in the above URLs to match your realm. <Detector_ID>
needs to be replaced with the ID of the detector you want to edit.
Multiple alert thresholds with one detector
SignalFlow allows adding an arbitrary number of alerting signals and conditions to a single detector. The example below shows how to create alert detection signals for the following metrics on a specific service (paymentservice):
service.request.count
is used for determining if too little or too much traffic is being served. It will detect on either condition (LOW
orHIGH
).- Additionally
service.request.count
will be used a second time to make a signal for only the errors and create an error rate signal out of the total requests and errored requests metrics. That error rate will detect on theERRRATE
condition.
- Additionally
cpu.utilization
is used for determining if a given host CPU utilization is too saturated forpaymentservice
and will alert on theCPUHIGH
condition.memory.utilization
is used for determining if a given host memory utilization is too saturated forpaymentservice
and will alert on theMEMHIGH
condition.disk.utilization
is used for determining if a given host disk utilization is too saturated forpaymentservice
and will alert on theDISKHIGH
condition.
If any of these conditions are breaching threshold, the alert will fire and the detector will send out notifications. This covers Latency, Errors, Traffic, and Saturation (L.E.T.S.) or the 4 golden signals for your service.
Example SignalFlow
REQ = data('service.request.count', filter=filter('sf_service', 'paymentservice')).sum(by=['sf_service']).publish(label='demo') ERR = data('service.request.count', filter=filter('sf_service', 'paymentservice') and (not filter('sf_error', 'false'))).sum(by=['sf_service']).publish(label='demo error rate', enable=False) RATE = (((REQ-ERR)/REQ)*100).publish(label='error_rate') ERRRATE = detect((when(RATE > threshold(99)))).publish('Error rate too high') LOW = detect((when(REQ < threshold(1)))).publish('Request traffic too low') HIGH = detect((when(REQ > threshold(50000)))).publish("Request traffic exceeding capacity") CPU = data('cpu.utilization', filter=filter('sf_service', 'paymentservice')).sum(by=['service.name', 'host']).publish(label='cpu') CPUHIGH = detect((when(CPU > threshold(95)))).publish("cpu.utilization exceeding threshold") MEM = data('memory.utilization', filter=filter('sf_service', 'paymentservice')).sum(by=['service.name', 'host']).publish(label='mem') MEMHIGH = detect((when(MEM > threshold(90)))).publish("memory.utilization exceeding threshold") DISK = data('disk.utilization', filter=filter('sf_service', 'paymentservice')).sum(by=['service.name', 'host']).publish(label='disk') DISKHIGH = detect((when(DISK > threshold(97)))).publish("disk.utilization exceeding threshold")
Here is an example detector with included SignalFlow applied:
Here is an example of a detector’s alert rules in Splunk Observability Cloud:
Alert message context
It's important to use the appropriate variable tags to pass context along within your alert rules. Use the detector message tagging to pass along valuable context from dimensions like servicename
, host
, or any other dimension contained in the alerting metric signal.
You can find variable tag naming references in Splunk Docs.
For example, a set of variable tags like the ones in the screenshot above provides the service name and host along with all other dimensions in the message body of the alert email.
{{#if anomalous}} Rule {{{ruleName}}} in detector {{{detectorName}}} triggered at {{timestamp}}. {{else}} Rule {{{ruleName}}} in detector {{{detectorName}}} cleared at {{timestamp}}. sf_service: {{dimensions.[service.name]}} host: {{{dimensions.[host]}}} {{/if}} {{#if anomalous}} Triggering condition: {{{readableRule}}} {{/if}} {{#if anomalous}} Signal value for Requests: {{inputs.REQ.value}} Signal value for Error Rate: {{inputs.RATE.value}} Signal value for cpu.utilization: {{inputs.CPU.value}} Signal value for memory.utilization: {{inputs.MEM.value}} Signal value for disk.utilization: {{inputs.DISK.value}} sf_service: {{dimensions.[service.name]}} host: {{{dimensions.[host]}}} {{else}} sf_service: {{dimensions.[service.name]}} host: {{{dimensions.[host]}}} {{/if}} {{#notEmpty dimensions}} Signal details: {{{dimensions}}} {{/notEmpty}} {{#if anomalous}} {{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}} {{#if tip}}Tip: {{{tip}}}{{/if}} {{/if}}
The message preview gives you an example of what the message body looks like with the variable tags filled in:
Next steps
These resources might help you understand and implement this guidance: