Planning an organizational on-call plan

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Incidents happen. While you can't always control how and when they hit your organization, you can control the response plan you have in place to deal with them. Given that, the overall goals for an incident response tool are to:

Automate alert and escalation behavior
Minimize Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR)
Reduce on-call and alert fatigue for your team
Speed resolution by immediately routing alerts to the most relevant team members
Provide reporting and tools to continually improve the response plan

Splunk On-Call is incident response software that helps you accomplish all these goals. However, a tool is only as good as the way you implement it and the processes you put around it. You need a framework for planning your Splunk On-Call implementation.

Solution

If you have recently purchased Splunk On-Call, or if you want to review and improve the way you use it now, the following is a guide to the questions you and your team should discuss so you can set schedules, escalation policies, and alert behavior that will best serve your organization.

Review your current and proposed processes

For each of the following, consider what is working well and what is not working well for your team. If any of these processes don't happen in your organization, think about why not and whether you would like to implement them.

What is your current incident resolution process?
What triggers an incident in your current system (i.e. do you use a monitoring tool or a service desk ticket)?
What different systems do you integrate with?
What team and scheduling systems do you use?
What do your escalation policies look like? How are they implemented?
How do responders collaborate during incident resolution?
How do you share status with key stakeholders?
How do you track actions and collect data for post-incident review?

Determine your needs and ideal workflows

What ways would you like to be notified of an alert?
How many times and how often should you be notified?
What should happen if no one responds to an alert notification?
Are any responders receiving notifications from too many alerts? Or too many notifications from the same alert?
How could you reduce the number of notifications that any individual receives?
What sort of automated alert processing would you like to have?
Can any alerts be resolved automatically?
What reporting do you want to see in an incident response system?

Create a plan

When you have brainstormed and aggregated the answers to the two question sets above, it's time to craft a plan. Decide on the following:

Which teams will respond to issues?
What kind of hourly coverage do you need? Some examples are:
- 24x7
- Working hours only
- Follow-the-sun coverage
How will you decide who takes what shift?
Will you have an incident commander? Who will it be?
How will an alert escalate if not responded to?
- Who will it escalate to?
- With what timing?
What kind of automated alert resolution can you implement?

With a clear plan in place, configuring Splunk On-Call becomes easy.

Next steps

If you found this article useful and want to advance your skills, Splunk Education offers a 4.5-hour, instructor-led course on Splunk On-Call Administration. The hands-on labs in the course will teach you how to:

Create new policies and schedules
Create teams and add users and managers using both the UI and API
Create a routing key using best practices
Configure Splunk On-Call integrations
Differentiate between the types of reports
Track flow of incidents after the fact using the Incident Frequency report
Use the Alert Rules Engine to add annotations to an incident and transform an alert
Create outgoing Webhooks to extend product functionality
Use the public API portal to find details on the public API

Click here for the course catalog where you can read the details about this and other Splunk On-Call courses, as well as register.