Using SPL2 to conduct data quality analysis and validation

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

As a Splunk platform admin, when you build an index with events, you know what the index should contain, what format events should be in, what fields they should have, and what those fields should look like. However, real-world indexes are full of corrupt events, or incorrect or unexpected data. This means that data quality analysis and validation can be a significant challenge.

Data owners send poor quality data to the Splunk platform.
Malformed or incorrect data lowers analysis quality.
There is often no way to detect and block bad data from entering the system.
Making custom Python scripts for quality analysis and validation is complex.

You want to learn how using SPL2 data types let you test that your data looks how you expect it to.

How to use Splunk software for this use case

SPL2 is an evolution of SPL, not a completely new search language. It is available in Splunk Cloud Platform 10.2.0.2511 and higher and Splunk Enterprise 10.2 or higher for *nix operating systems.

There are some versions of Linux that are not supported in version 10.2. See the SPL2 Known issues for a list of these versions.

Switching to SPL2 requires minimal or no rewriting of the SPL queries you have already created. It is as performant as SPL and uses minimal additional processing power. SPL2 has the following characteristics:

More expressive
Multi-modal (SPL and SQL-like syntax)
Standardized
Unified across all Splunk products

How can SPL2 help?

Data types in SPL2 allow you to specify the format and range of values for a given piece of data. The data type definition can be:

Constrained, in which you use a regex to build a strict definition. While SPL allows simple verification, such as a string versus a number, a constrained data type in SPL2 is more granular.
Advanced, in which you have more flexibility. For example, if you are working with JSON data, you might want to allow the absence or presence of a data point, such as a customer address.

After you create your data types, the IS operator in your query will check the incoming data and warn you if there is a mismatch. That way, you can correct your search before you run it.

Data types allow you to create a library of information about your data that you can share with the whole organization to ensure data quality. Additionally, through the use of custom eval functions, data types can be used throughout the data pipeline, including in Splunk Edge Processor and Splunk Ingest Processor, to further guarantee the quality of incoming data.

Watch the following video to learn how data types and custom functions work.

Additional resources

Now that you have an introduction to some of the powerful features of SPL2, watch the full .Conf25 talk, A deep dive into SPL2: How does it actually compare to SPL?. In the talk, you'll learn about additional features and listen to questions and answers from the live audience.

Splunk Help: What is SPL2?
Splunk Help: SPL2 Search Reference Introduction
Splunk Help: Creating and using data schemas with SPL2 data types
Splunk Help: IS operator
Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their Success Plan. Engage the ODS team at ondemand@cisco.com if you would like assistance.