Building a self-serve and scalable observability practice

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

In an ideal world, platform engineers would have a single internal developer platform to power all their developer teams, across multiple use cases and processes. Every developer team would be able to rely on that single extensible platform, allowing them to easily collaborate, and get what they need without compromise. This platform would also provide the central tools team visibility across data usage and role management for cost control.

In reality, most organizations deal with more than 10 different tools at a time (on average, 23 according to our recent State of Observability). Because many tools in the market are built for team-specific needs, each has its own standards and processes, which leads to developer teams operating in silos and longer war rooms. Not only that, but the lack of shared visibility doesn't allow platform engineers to monitor and control data usage

The result is bad for the engineering team, the business, and the customer.

Tool sprawl leads to analytics silos, which eventually hinders innovation pace and insights, as well as affects digital experiences
A cumbersome observability workflow also impacts the developer experience and creates toil
High costs and an inability to scale effectively

The solution is to build a self-service model with a mature observability platform. This leads to:

Higher productivity and increased velocity for better and more competitive digital experiences
Reduced effort and fewer errors
Subject matter expertise becomes more readily available with no need to rely on a central tool teams
More agile processes and workflows that allow people to accomplish what they need to on their own without waiting for someone else who might have a huge backlog
Templates with good defaults and a single, repeatable process or pattern, rather than starting from zero every time
Fewer bottlenecks, more insights from other teams, and better knowledge transfer of analytics

How Splunk software can help with this use case

The Splunk observability architecture, shown below, has a number of integrations and components that will help you establish observability as a service. The six sections in this article explain how to leverage them and what their benefits are.

Splunk o11y architecture.png

OpenTelemetry

What is it?

OpenTelemetry (OTel) combines distributed tracing, metrics, and logging into a single set of system components and language-specific libraries. It is an industry-backed, extensible architecture that is completely vendor-agnostic and that makes robust, portable telemetry a built-in feature of cloud-native software. Specifically, the components include

Client libraries
- Application instrumentation
- Support for traces, metrics, events, and logs
- Mobile and browser instrumentation
Collector
- Receive, process, and export data
- Default way to collect from instrumented apps
- Can be deployed as an agent or service
Specification
- API: Baggage, tracing, metrics
- SDK: Tracing, metrics, resource, configuration
- Data: Semantic conventions, protocol

What problems does it solve?

Cloud providers need to increase visibility to users because data volume is increasing and getting insights quickly is required.
Standardizes how observability telemetry is collected and transmitted across the industry.
Leverages the wider community for velocity, expertise, and innovation to manage the vast and ever changing needs for observability across digital systems.
End-users want a vendor-agnostic solution, while being able to choose the tools that are right for their business.
Everyone’s use-cases are different, which means that data portability and data flexibility are critical. Users need to be able to collect and analyze custom metrics.
OpenTelemetry is easy to set up, requiring to instrument only one time in the way that works best for you.

OTel Standards.png

How to do it

This is the foundation for everything else you will do to create observability as a service. To get started adopting OpenTelemetry standards, go deep into a few workflows and service to understand the key questions and KPIs you need to answer within your organization and the metadata you need to support them. Implement a few first to be sure you get it right before moving on. Building out this maturity across a smaller set of key workflows will provide immediate value to your business, but also help ensure a smoother mass roll-out without propagating technical debt.

Metrics Pipeline Management

What is it?

Metrics pipeline management is a centralized ingestion data processing system that helps organizations tune and control data before being ingested. Pipelines can be configured to filter, aggregate, downsample, and archive data depending on data volume, cardinality, retention, granularity, and urgency needs. In Splunk Infrastructure Monitoring, metrics pipeline management allows admins flexibility and choice to control their metrics data at point of ingest, without re-instrumentation. With a UI or API approach out-of-the-box, admins can create aggregations on specific metrics identified. They also have the choice to filter for any unused data using dynamically defined policy rules.

From this refined set of aggregated custom metrics, SREs/observability experts continue working in their familiar workflows of detectors and charts but efficiently manage their volume of data without sacrificing service reliability. With this metrics aggregation and filtering capability, distributed teams maintain end-to-end visibility and extend on their Splunk data platform without having to compromise on insights.

What problems does it solve?

Manage your metric cardinality centrally without the hassle of updating your collectors or ingest pipelines.
Control your costs by downsampling, aggregating, archiving, or filtering less-important data.
Improve performance by aggregating away high cardinality attributes.

How to do it

Use the guidance in Splunk Help to create rules that manage data in ways that reflect how you actually use the data. Keep in mind the following:

Real-time metrics:
- Real-time alerting and query processing in single digit secs
- Select a whole metric or only part of one
- Low to high cardinality
Aggregated metrics
- At a select time interval
Archived metrics:
- Cold-storage. Pull into real time when necessary
- No direct visualization or alerting
- Good for high-noise, low value data
- Ideal for occasional/infrequent access
Drop metrics completely

Metrics management.png

Role and User Management

What is it?

Pre-defined role-based access control is about reducing the order of privileges so that you don't have to worry about users breaking anything in a self-service model. You will manage users and roles across Splunk Cloud Platform and Splunk Observability Cloud through a single administrator experience.

What problems does it solve?

One admin source of truth: Use Splunk Cloud Platform to assign out-of-the-box roles for Splunk Cloud Platform and Splunk Observability Cloud.
Additional layer of security: Easily make sure users in your organization have the right and consistent level of privilege and access with a consolidated experience across both Splunk Cloud Platform and Splunk Observability Cloud.
More extensibility with third-party identity provider compatibility: Extend your existing Splunk Cloud PlatformSAML integration to manage users and roles in Splunk Observability Cloud through a single integration with Splunk Observability Cloud.

How to do it

Use the guidance in Splunk Help to establish your Splunk platform to become the identity provider for Splunk Observability Cloud. The following roles are available after you set this up.

Role	Privileges	Persona
Admin	Full admin privileges to manage the observability environment and other users	Observability administrator
Power	Create and edit dashboards, detectors and other resources Cannot create tokens, use integrations, access billing/usage	Developers, SREs
Usage	Read-only privileges. Access to all data, including subscription usage and billing	Billing, finance, MSPs
Read-only	Only read access to all resources like dashboards/detectors Cannot view tokens	External teams, business owners

Observability As Code

What is it?

An observability-as-code approach includes the following elements:

Leveraging automation and orchestration for observability and monitoring
Ensuring validation, checks, and visibility around observability changes within the organization
Use a Terraform provider or APIs for configuration
Keeping a comments history of why things are the way are, an essential reference as people move on from roles or companies
Using access tokens

What problems does it solve?

Sets the foundation for templating and reusable observability assets
Allows easy rollback to known working state when things go wrong
Simplifies ongoing maintenance and upkeep of observability assets

Focusing specifically on tokens,

Tokens are a safety net. If something coming through is too noisy and you want to stop the integration but not sever everyone's integration, tokens allow for that. You can allocate per team, per service, or per business unit.
You can measure and control usage by token. You can set tokens and quota limits on all billable metrics to visualize and control consumption by team or application.
Maintain granular, real-time visibility into all billable metrics and receive proactive alerts as you approach consumption limits.

How to do it

Use the guidance in Splunk Help to set up usage tokens.

Access Tokens.png

Team Landing Page

What is it?

A team landing page in Splunk Observability Cloud lets you:

Highlight any active alerts specific to the team
View relevant detectors
Create a ‘go-to’ list of the team’s dashboards and mirrors
Favorite your most frequently viewed dashboards

What problems does it solve?

Helps you find exactly what you need for your work instead of sifting through everything
Saves on duplicated content because everyone can leverage what one person has built
Prevents needing to update a bunch of copies of a dashboard when an integration changes

How to do it

Use the guidance in Splunk Help to create a team landing page and mirrored dashboards. Note the following about mirrored dashboards:

Mirrors of the same dashboard can be added to multiple dashboard groups (or even multiple times to one dashboard group)
Users can make changes to the dashboard, as long as they have write permissions for the dashboard
Changes made to the dashboard itself propagate to all of its mirrors

Data Usage Visibility

What is it?

Metrics usage analytics give you detailed visibility of your metrics usage in a self-serve reporting. Out-of-the box dashboards also provide insights into usage and license impact across the platform. These features allows you to

Understand which metrics, dimensions, and tokens generate the most metric time series
Determine how or if metric data is consumed in charts and detectors
Identify high cardinality dimensions contributing to metric usage
Optimize consumption by filtering or aggregating metrics

What problems does it solve?

Surfaces noisy metrics that don't provide value
Helps you determine what rules you should establish around data ingestion

How to do it

Here is what you will want to keep track of:

Metrics usage analytics page: Located in Metrics > Usage Analytics
Splunk Infrastructure Monitoring: Located in-product at Settings > Subscription Usage
- Monthly average usage and detailed hourly usage CSVs
Splunk Application Performance Monitoring: Located in-product at Settings > Subscription Usage
- Monthly average usage + detailed per min usage CSVs
- Usage Analyzer button to drill into “expensive” spans and traces
Splunk Real User Monitoring and Splunk Synthetic Monitoring: Located in-product at Dashboards > Org metrics > RUM or Synthetic Monitoring
- RUM & Synthetics usage dashboards

Next steps

Now that you have an idea of how to make your observability practice more scalable and self-service, watch the full .Conf25 Talk, Build a self-serve, scalable observability practice with Splunk. In the talk, you'll get more detail and practical tips for working through these six steps.

In addition, you might find these Splunk resources helpful:

Splunk Resource: State of observability 2024: Charting the course to success
Splunk Webinar: Mission archivable With Splunk & Atlassian
Splunk How-To Video: Introduction to the Splunk Terraform provider
Splunk How-To Video: Infrastructure and Observability as Code
Splunk Community Blog: Best practices for managing data volume with the OpenTelemetry Collector
Splunk Blog: Data storage costs keeping you up at night? Meet archived metrics
Splunk Resource: Enable self-service observability
Author Blog: OTEL Me What’s All The Fuss About?