Using grok custom classifiers to improve your Federated Search experience

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Log files in structured columnar formats like Parquet and JSON are easily searched through Federated Search for Amazon S3 using key values to identify, filter, and aggregate data. However, the nature of plain text logs consisting only of unstructured text means that you cannot rely on pre-defined keys or mandatory delimiters. This results in entire events being treated as a single field, which limits the ability to query field values and often results in additional processing requirements in the Splunk platform to extract fields.

In this article, you'll learn how to use regular expression-based grok patterns to define your data format and improve your search experience for unstructured logs.

This article is primarily applicable to unstructured, plain text data in S3. Data routed to S3 by Edge and Ingest Processor will use the HEC format.

Example

When you search unstructured data without a custom classifier, the entire event is treated as a single field:

| sdselect event from federated:squid_nogrok

This results in the need to perform additional processing to be able to easily filter, aggregate, or analyze specific field values within your log data.

By creating a custom classifier with a grok pattern, you can search your unstructured data with individual fields extracted:

| sdselect * from federated:squid

This article uses squid proxy formatted events. The format is:

timestamp duration client_address result_code/status_code bytes method URL user hierarchy_code/server mime_type

An example event:

1769025600.123 155 192.168.1.12 TCP_MISS/200 4200 GET http://www.buttercupgames.com/portal/home - DIRECT/10.10.1.5 text/html

If you don't create your own classifier (grok pattern), running the AWS Glue crawler against the sample dataset results in a Glue table with UNKNOWN classification and an empty schema.

How to use Splunk software for this use case

Step 1: Create a grok pattern

Use the grok pattern below to create a custom classifier in the Glue crawler and run it against your sample data set:

%{NUMBER:timestamp} %{NUMBER:duration} %{IP:client_address} %{DATA:result_code}/%{NUMBER:status_code} %{NUMBER:bytes} %{DATA:method} %{URI:URL} %{DATA:user} %{DATA:hierarchy_code}/%{DATA:server} %{GREEDYDATA:mime_type}

Grok patterns follow a %{PATTERN:field-name} format where the PATTERN is a built-in or custom defined pattern, and the field name is arbitrary. The available built-in patterns are listed in the Writing grok custom classifiers documentation, which also includes additional information on creating custom patterns.

Step 2: Create an AWS Glue classifier

With your custom grok pattern, create an AWS Glue classifier and associate the classifier with a Glue crawler as described in step 2 of the Configuring a crawler guide in the AWS documentation.

Step 3: Run the crawler

After the crawler has been created, run the crawler against your data set to create the Glue table for this data source. After the crawler run completes, the output should include a Glue table created with the classification name you created in the custom classifier.

Reviewing the Glue table should reveal a schema that aligns with the classifier. The Glue table and schema enable querying the unstructured text like structured data.

By default, the data type for each column will be string. However, fields can be cast to supported types in the grok patterns using the following format or by updating the Glue table after creation: %{PATTERN:field-name:data-type}
Data type is important when attempting to use statistical or conversion functions. Splunk’s Federated Search for Amazon S3 automatically casts data types as shown below, but explicitly setting the correct type can be helpful when searching from tools like Athena: | sdsql catalog=glue:arn:aws:glue:us-east-1:0123456789123:catalog database=default table="squid" providerType=aws_s3 reuse_search_results=1 "SELECT sum(TRY_CAST(bytes AS double)) AS \"sum(bytes)\" FROM \"squid\" WHERE (result_code = 'TCP_MISS') LIMIT 100000"

Step 4: Create a federated provider and index in the Splunk platform

After creating a federated provider and federated index in the Splunk platform for this Glue table and data set, the data returned in your SPL or SPL2 queries will be in a structured format, reducing or eliminating the need for additional processing.

Additional resources

These resources might help you understand and implement this guidance:

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their Success Plan. Engage the ODS team at ondemand@cisco.com if you would like assistance.