Configuring action rules in the ITSI Notable Event Aggregation Policy for Splunk On-Call Integration
You want to create a Splunk On-Call incident using an Splunk Observability Cloud Notable Event Aggregation Policy (NEAP) action rule. This rule should also have appropriate context and data to allow Splunk On-Call annotations or URI drill-downs for accelerated mean-time-to-detect (MTTD) and mean-time-to-restore (MTTR) in Splunk Observability Cloud.
Before you follow these steps, make sure you have done the following:
- Integrated Observability Cloud alerts with Splunk Cloud Platform or Splunk Enterprise}
- Normalized Observability Cloud alerts into the ITSI Universal Alerting schema
- Configured Universal Correlation Searches to create notable events
- Configured the ITSI Notable Event Aggregation Policy (NEAP)
- Configured ITSI correlation searches for monitoring episodes
- Configured the Splunk On-Call integration with IT Service Intelligence
Solution
The diagram below shows the overarching architecture for the integration that's described in Managing the lifecycle of an alert: from detection to remediation. The scope for this article is indicated by the pink box in the diagram.
The Notable Event Aggregation Policy (NEAP) within the Content Pack for ITSI Monitoring and Alerting processes notable events from the ITSI_Notable_Event index, groups them into episodes, and processes them using action rules. Notable events that match the filtering rules in the NEAP then become part of a new or existing episode in Splunk ITSI.
In a previous step of this workflow, you configured the ITSI Notable Event Aggregation Policy (NEAP) but did not fully configure the action rules associated with it. In this article, now that Splunk On-Call is integrated with Splunk ITSI, you'll return back to the NEAP and configure its action rules. Your rules should create the Splunk On-Call incident and also auto-close the incident when the Splunk ITSI correlation search has determined the episode is healed.
You'll also round off this workflow by visualizing the full lifecycle of the alert, from the initial Splunk Application Performance Monitoring alert, to the auto-closing of the Splunk ITSI episode and auto-resolution of the Splunk On-Call incident.
Procedure
- From the Splunk ITSI main menu, click Configuration > Notable Event Aggregation Policies > Episodes by Application/SRC o11y > Edit > Action Rules.
-
Expand the first rule by clicking the > next to the rule. The rule looks for a source name of "Episode Monitoring - Trigger OnCall Incident". This source name matches the originating correlation search you set up previously which indicates that a new Splunk On-Call incident should be created. When found, a comment is added and a call to "Create VictorOps Incident" is performed.
- Click the Configure button next to the "Create VictorOps Incident" selection. This takes you to a configuration screen. Set the following details:
- Message Type: Set this to CRITICAL. This tells the integration to create a new Splunk On-Call incident.
- Routing Key: Set this to the routing key created in Splunk On-Call. This provides information to route the new incident to the appropriate support team.
-
The second rule looks for a source name of "Episode Monitoring - Set Episode to Highest Alarm Severity o11y". The source name matches the correlation search you set up previously which indicates that the Splunk On-Call incident should be closed. The rule is also looking for the "set_episode_status" to equal 5. When both conditions are true, the following actions are performed by the rules engine:
- The episode status is changed to "Closed".
- A comment is added to the episode to indicate closure.
- The "Create VictorOps Incident" integration is called with a value of "RECOVERY", indicating the incident should be closed. This allows for synchronization between the Splunk ITSI episode and the Splunk On-Call incident.
- Click the Configure button next to the "Create VictorOps Incident" selection. This takes you to a configuration screen. Set the following details:
- Message Type: Set this to RECOVERY. This tells the integration to close the Splunk On-Call incident.
- State Message: Set to "ITSI Episode is back to normal and closed".
-
Now you can validate the creation and clearing of the Splunk On-Call incident in Splunk ITSI. In Splunk ITSI, navigate to Episode Review to view the incident. It will be prefixed with "VictorOps - Incident" and the new incident number, and its status will be marked as "Closed".
- Click on the Events Timeline tab. The events timeline provides you with a visual of the progression of a lifecycle of an alert from Splunk Observability Cloud. The events timeline is also a great indicator of root cause analysis, especially when you have many alerts occurring for the same microservice-based application. You can group by app_name metadata so all alerts originating from the services associated with the app_name are grouped into a single episode. Normally the first alert indicates the root cause, and the subsequent alerts are a symptom of the first.
- In the example below, note the numbered annotations. These tell you several things about the event:
- The event type bands allow you to visualize the different events that occurred during the entire alert management workflow.
- The first red event shows the original notable event that triggered the creation of the episode. Red indicates that the event's status is critical.
- In the same band, the green line indicates a cleared notable event. Green indicates an event has a cleared status.
- In the third band labeled "Episode Monitoring Alert" you can see notable events generated by the correlation searches enabled with the "Episode Monitoring" prefix. The first light blue event tells the rule engine workflow to create an Splunk On-Call incident. Blue indicates that the event has a status of "Info". The time shown here is a little after the original critical event. This time difference is intentional, since the originating event might not be at a severity that would require an incident to be created.
- The second blue event indicates that the episode and the Splunk On-Call incident should be closed. This happens seconds after the green alert clear event.
- You can hover over the events in the bands to view further information about each, if needed.
Next steps
Still having trouble? Splunk has many resources available to help get you back on track.
- Splunk Answers: Ask your question to the Splunk Community, which has provided over 50,000 user solutions to date.
- Splunk Customer Support: Contact Splunk to discuss your environment and receive customer support.
- Splunk Observability Training Courses: Comprehensive Splunk training to fully unlock the power of Splunk Observability Cloud.