Service-level definitions are a contract between a service provider and the organization it serves that defines particular aspects of the service, such as quality, availability, and responsibilities. Service-level definitions consist of service-level objectives (SLOs), service-level agreements (SLAs), case priority levels, and incident response times. When Splunk is operated as a service offering, service-level definitions provide all teams and organizations assurance that Splunk operations and response models meet their needs without impacting other areas.
Audience
- Developer
- Engineer
- Search expert
- Program manager
- Project manager
- User community
For more about these roles, see Setting Roles & Responsibilities.
Key terms
- Mission critical
- An outage impacting revenue, ability to hit agreed SLA/OLA, or a noted mission-critical data source or app.
- Core service
- Indexing, Searching, or Alerting.
- Routine change
- A low-impact, low-risk change not requiring a change review.
- Emergency change
- A change required to resolve a P1/P2 condition.
- Service request
- A request to add new capacity of lower complexity than a project, for example, new inputs or new add-on installs.
Guidelines for implementing service level definitions
There are many factors to consider when making a service-level commitment. The table below lists some guidelines to follow.
Don't over-commit |
|
Don't under-commit |
|
Consider the time it takes to gather the necessary information |
|
Think about the process for incoming requests |
|
SLO templates
SLOs provide expectations for maintenance planning, release planning, and communication with business partners. You can divide SLOs into administrative tasks (day-to-day activities) and implementation tasks.
Use the example provided below or make any updates as necessary.
Administrative SLOs (day-to-day activities) | Target |
---|---|
Delete new user | 5 business days |
Add new user | 5 business days |
Uplifting user | 25 business days |
Dashboard creation support | 10 business days |
Report generation support | 5 business days |
Alert creation and changes support | 5 business days |
Create a new Active Directory group for access
(External dependencies) |
25 business days |
Create new role | 15 business days |
Implementation SLOs | Target |
---|---|
First response to new support request |
1 day |
Data ingest (standard add-on) | 5 days |
Data ingest (custom add-on) | 10 days |
New app install | 1 day |
Universal forwarder deployment
(Does not include change control SLO) |
10 business days |
Data source monitor (http, WMI, TCP/UDP) | 2 days |
Implement new global knowledge object | 1 day |
Upload data into Splunk
(for example, static log, file, CSV) |
5 business days |
SLA templates
SLAs are key service definitions for platform availability and incident response.
Use the example provided below or make any updates as necessary.
SLAs | Target |
---|---|
Platform availability | 99.9% uptime for all core services (< 8.76 hours unplanned downtime per year) |
Incident first response | Based on priority |
Incident status update | Based on priority |
Restore loss of data feed (ingestion) | 1 business day |
Restore universal forwarder not reporting (standard) | 5 business days |
Restore universal forwarder not reporting (mission critical applications) | 1 business day |
Case priorities
Case priorities are assigned based on the technical importance of the problem. The following case priorities are intended only as examples.
Use the examples provided below, or make any updates as necessary.
Case priority levels
Case priorities may vary by service or source. The following are general guidelines.
Case priority level | Definition |
---|---|
P1 | A mission critical outage for which there is no workaround. This may be a complete service outage of a core service. |
P2 | A mission critical outage for which a less than ideal workaround exists. This may be a partial service outage of a core service. |
P3 | An outage or issue impacting a single user. |
P4 | Standard service requests or routine changes. For example, access requests, data onboarding, app installation, etc. |
Incident response times
Incident response | P1 | P2 | P3 | P4 |
---|---|---|---|---|
First response | 1 hour | 2 hours | 4 hours | 1 business day |
Communicated updates | Every 2 hours | Every 4 hours | Every business day | Every 5 business days |
Resolution time | Within 4 hours | Within 2 business days | Within 3 business days | Agreement with the customer |
Business hours | 24 hours / 7 days per week | 8:00am to 5:00pm / 5 days per week
excluding holidays |
8:00am to 5:00pm / 5 days per week
excluding holidays |
8:00am to 5:00pm / 5 days per week
excluding holidays |
Comments
0 comments
Article is closed for comments.