You run an operations center for a global retailer. The center is open 24 hours a day, every day of the year to accommodate business from anywhere in the world. Events come in from multiple sources and must be triaged, prioritized, and routed to the appropriate resolver to minimize service disruption.
Your processes are heavily manual, which makes them prone to error due to the complexities of maintaining knowledge, tooling, and resolver escalation paths. The number of misrouted events and delays to your incident recovery times has steadily increased, especially at peak times such as holiday shopping. You need a tool that can unify alerts and simplify the resolution process.
Splunk On-Call can improve many of your operations center processes. Within as little as six weeks, you can integrate all your monitoring tools to provide your team with a consolidated view of events and have all your teams onboarded and working.
Before you begin an Splunk On-Call implementation, you should complete the following:
- Identify an executive sponsor. Your Chief Information Officer or Head of Technical Support can be good options. This person will help ensure you have the funding and the process change support you need.
- Identify your timeline. You might want to start using Splunk On-Call before you are able to onboard all teams and processes.
- Map out which teams and individuals have access to which systems and processes so that those permissions can be replicated accurately in Splunk On-Call.
- Map your current triage, prioritization, and routing processes. You cannot replicate these into Splunk On-Call if you don't know what they are. In addition, it is possible that not all of them can be automated. You'll need to know which processes need to remain in their current workflows. You might need the help of system architects and engineers at this stage. The following diagram is a sample output for this stage:
The Splunk On-Call Incident Pane
The Incident Pane is where your team members spend most of their time. When responders receive an incident in this pane, their immediate option is to snooze it, reroute it to someone else, or acknowledge it. If they acknowledge it, all the other features in this pane help improve their response by giving them the information they need to resolve the incident in a single tool.
Right panel. This panel keeps track of how long the incident has been open, which helps responders manager service-level agreements. Responders can also see what integrations are available for notifications, what other team members are working on this issue, and any response or escalation policies that are relevant.
Center panel. In addition to showing the basic information about an alert, such as systems involved, responders can view similar alerts, which might help them more quickly decide on an appropriate response. This information might also help them see other simultaneous incidents that could be a related cause or effect of the one they are actively managing. Responders can also see a list of non-responder stakeholders, so they know who needs to be informed about the incident outside of the response team.
Left panel. This panel keeps a running list of the incident history. Any action that the responder takes in Splunk On-Call and any automated responses are recorded here, and the responder can add comments as needed.
Other Splunk On-Call features
While most of your users will spend their time in the incident pane, as an administrator or manager, the following features of Splunk On-Call can help you improve overall operations.
Use the timeline for important operational information, not just incident review. Use the filters to track code merges, releases, and other significant events that are important to your organization.
- Post-Incident Review. Use this report to understand the issues your organization or department has faced to find ways to improve the overall response or opportunities for training and operational improvement.
- Response Metrics. Review the path to recovery of each incident to learn how to automate remediation or how you can improve the response time. Review the business impact of each incident to learn how to prioritize permanent resolutions for recurring problems.
- On-Call Review. Use this report to ensure that the workload is spread out evenly among your team members and no one is overworked or underutilized.
- Incident Frequency. Use this report to understand recurring failures in your systems that might need to be addressed at a higher level.
Use the Teams section of Splunk On-Call to improve operational efficiency by:
- Syncing team member calendars (such as Outlook or GCal) to rotation schedules
- Sharing escalation policies with the team so they always have the information they need to make good decisions
- Creating scheduled overrides to account for team members calling out sick
Expected outcomes and benefits
After your organization has set up Splunk On-Call for incident notifications, you can expect the following:
- Events will be automatically routed and escalated to the appropriate resolver teams without delay, which improves response times.
- Resolver teams will be empowered to ensure their on-call rotas are accurate and to make rapid changes when required based on their resource availability, which reduces errors and unnecessary escalations
- You will be prepared to adopt an overarching Command & Control posture with a focus on:
- coordinating overall incident response
- managing your stakeholders more effectively
- consolidating underpinning tooling to improve both richness of functionality and efficiency
Measuring improvements in the areas described above is useful for reporting to executive leadership.
After you have implemented Splunk On-Call for one use case, look into what else you can do with it. Splunk On-Call has a wide variety of applications, and the more you use, the more value you get from your investment.
Now that you understand the basics of using Splunk On-Call, watch the full demo in this .Conf22 Talk (How an Online Retailer Group Tamed Black Friday). Then, download the add-on and get started in your deployment.
These additional Splunk resources might help you understand and implement this use case: