Using federated search for Amazon S3 with Edge Processor

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Splunk Federated Search for Amazon S3 (FS-S3) allows you to search your data in Amazon S3 buckets directly from Splunk Cloud Platform without the need to ingest it. Ingest Actions (IA), Edge Processor (EP), and Ingest Processor (IP) are Splunk features and products that offer the capability to route data to customer-managed Amazon S3 buckets.

In this article we’ll identify the fundamental considerations when considering use of federated search with data routed to object storage using Splunk Edge Processor. We’ll review unique considerations for:

EP pipelines
AWS Glue
Splunk Cloud Platform

Prerequisites

You should ensure you are familiar with Amazon S3, AWS Glue, Amazon Athena, Splunk’s ingest actions, and Splunk’s federated search for Amazon S3. If any of these topics are not familiar, consider taking a few minutes to review them or make sure the documentation is handy. You can also find additional information about partitioning in the article Partitioning data in S3 for the best FS-S3 experience.

EP pipelines support Search Processing Language 2, SPL2. If the SPL2 syntax is new to you, please review the SPL2 Search Reference documentation.

Simple pipeline and output

Splunk Edge Processor provides the capability to route data to S3 through defined pipelines. Configuration of EP and authoring pipelines are well-defined in the documentation so we won't repeat them here, but we will show syntax used in the following examples.

In the example pipeline below, we take a data source ($source) and route it to S3 ($destination). Both $source and $destination are defined during the pipeline creation.

pipeline=
| from $source
| into $destination

The Splunk Edge Processor documentation describes how EP writes data to S3 in the following bucket and folder structure:

<bucket_name>/<folder_name>/<year>/<month>/<day>/<instance_ID>/<file_prefix>-<UUID>.json

EP supports the use of an optional prefix prior to the year key that might be useful in reuse of a bucket for multiple subsets of data, applying different partition schemes, and supporting individual Amazon S3 lifecycle rules. File prefixes can be used to describe the data further, but are not available as a partition.

The folder name and file prefix are optional keys. In this example, we’ll omit the folder and prefix resulting in data output in the following format:

<bucket_name>/<year>/<month>/<day>/<instance_ID>/<UUID>.json

Remove implicitly added fields

Splunk Edge Processor might implicitly write fields like path and other metadata information to output to S3. It is recommended to remove implicit fields to maintain control of the schema written to Amazon S3 by EP and to prevent special characters from impacting table creation. Implicit fields can include special characters in the key names such as ':' that need special handling to be supported by the AWS Glue catalog. Specifically, Splunk protocol metadata has the format of key::value.

The example below shows an event that was written to S3 using the using the minimal from $source | into $destination syntax, showing in red an example of an included implicit field:

{
  "time": "1720117251.000",
  "host": "ip-10-0-0-81.ec2.splunkit.io",
  "source": "/var/log/audit/audit.log",
  "sourcetype": "linux_audit",
  "index": "default",
  "fields": { "_path": "/var/log/audit/audit.log" },
  "event": "type=CRYPTO_KEY_USER msg=audit(1720117251.966:268): pid=9306 uid=0 auid=2023 ses=1 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=destroy kind=session fp=? direction=from-server spid=9322 suid=2023 rport=58640 laddr=10.0.0.81 lport=22  exe=\"/usr/sbin/sshd\" hostname=? addr=10.0.0.59 terminal=? res=success'\u001dUID=\"root\" AUID=\"myuser\" SUID=\"myuser\""
}

The following example contains a pipeline, shown in red, that removes the _path and any implicit fields with key names that include '::':

$pipeline=
| from $source
| where _raw.fieldName=”some value”
| fields - ‘_path’,’*::*’
|into $destination

Using the fields ‘_*’ removes the _raw field and results in EP writing metadata with empty events to your S3 bucket organized in a year=1970/month=01/day=01/ folder structure.

An example event written to S3 should look like:

{
  "time": "1720121403.000",
  "host": "ip-10-202-39-169.ec2.splunkit.io",
  "source": "/var/log/audit/audit.log",
  "sourcetype": "linux_audit",
  "index": "default",
  "fields": {},
  "event": "type=CRED_DISP msg=audit(1720121403.300:358): pid=29574 uid=0 auid=2023 ses=4 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_env,pam_localuser,pam_unix acct=\"root\" exe=\"/usr/bin/sudo\" hostname=? addr=? terminal=/dev/pts/1 res=success'\u001dUID=\"root\" AUID=\"myuser\""
}

Index-time fields

Index-time fields in the JSON payload might be defined as columns by the AWS Glue crawler or DDL, allowing these fields to be searchable in FS-S3 queries. Use eval statements in your pipelines to create new fields in events.

Extracting fields from JSON events can also be accomplished using json_extract and rex commands.

$pipeline=
| from $source
| eval indextimefieldx = json_extract (_raw, “<your_json_field>”),
| fields - ‘_path’,’*::*’
|into $destination

{
“time”:”<epoch time>”,
“host”:”<host>”,
“source”:”<source>”,
“Sourcetype”: “<sourcetype>”,
“index”: “<index>”,
“fields”: {
“<indextimefieldx>”: “<value>”,
},
“event”:”<event string>”
}

Any new fields created using Edge Processor will automatically become indexed extractions at index time. Indexed extractions, especially high cardinality fields, can significantly increase the size of your indexes.

Also keep in mind that data for FS-S3 consumption should be written according to a schema shared by all records written to the Amazon S3 location you will query with FS-S3.

If shared modules are used to build pipelines that operate on the same source type, special care must be taken to remove the extra fields or explicitly build the desired schema prior to sending to the Splunk platform or Amazon S3.

Writing your data to a temporary location to ensure the fields you expect in your events is recommended.

Remember to remove any temporary values (eval or rex) that were created for use in pipeline logic and should not be written to S3.

Pipeline updates and deployment considerations

When updating Splunk Edge Processor pipelines used to write data to Amazon S3, use dedicated test pipelines and temporary S3 locations to test and review output. This will help avoid unintended data output in production storage locations and limit interruptions to data collection.

Also, keep in mind that deployment of Edge Processor pipelines occurs in place and deployment is not coordinated when pipelines are deployed across multiple instances. This might result in ingestion interruption while pipelines are deployed. In multiple-instance deployments, deploying new pipelines to new EP instances, then shutting down EP instances that run the older pipelines is one way to help eliminate ingestion interruptions.

Glue catalog considerations

Creating a table in the AWS Glue catalog

The schema implemented by EP can be parsed with the AWS Glue crawler. Use of DDL table creation with Athena is also an option for users familiar with DDL statements. Specialized use cases might require creation or modification of AWS Glue table definition through DDL statements.

After configuring the EP pipeline to send to your S3 bucket, run the AWS Glue crawler to examine the data in S3 and generate the AWS Glue table definition. The AWS Glue crawler configuration below is suitable for most use cases.

In the Glue crawler’s Set output and scheduling configuration screen, expand the Advanced options and do the following:

Select the Update the table definition in the data catalog radio button.
Select the Update all new and existing partitions with metadata from the table checkbox.
Select the Create partition indexes automatically checkbox.

Leave the other fields as the default settings.

unnamed - 2024-08-01T123859.804.png

After the AWS Glue crawler run completes, inspect the table definition. It should look similar to either this example or the one below it.

unnamed - 2024-08-01T123945.880.png

In this one, a field has been defined so that the data type for the fields column is of type struct. The struct data type also applies to events of complex types.

unnamed - 2024-08-01T124102.711.png

The Glue Crawler should be enabled to run on a schedule to ensure that new partitions are added. The schedule should align with partition creation. In the case of EP, partitions are created daily as new events are written, so scheduling the crawler to align with the creation of the new partitions ensures they are promptly detected and included in future searches.

Alternatively, it is possible to define the table using Data Definition Language (DDL) in Athena without using the Glue crawler at all. Manually creating the table definition allows complete control over the table, including partitions and permits remapping of special characters (see Handling field names with special characters). The sample Athena query below shows how Glue table columns and partitions can be defined using DDL.

CREATE EXTERNAL TABLE `my_table_name`(
  `time` string, 
  `host` string, 
  `source` string,
  `sourcetype` string,
  `index` string,
  `fields` string,
  `event` string, 
)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `instanceid` string)

When using DDL defined tables, new partitions created by EP must also be created in the table metadata. The ALTER TABLE statement below shows how this can be accomplished.

ALTER TABLE <REPLACE-NAME> ADD PARTITION (year = 'REPLACE', month = 'REPLACE', day=’REPLACE', instanceid='REPLACE');

Unlike the Glue Crawler with built-in scheduling, it is necessary to implement a solution to update the table metadata as partitions are created.

To produce a table definition that will be queryable by Federated Search for S3, even when partitioning based on source type in addition to date, you can either make the configuration changes to the Glue crawler or manually create a Glue table using DDL.

Handling field names with special characters

Special characters need to be considered in field names. For example, any “:” characters in your data will be interpreted as column name: datatype. Key names that contain special characters such as ':' or '.' need to be remapped via DDL. Field names can be remapped in the SERDEPROPERTIES using the syntax below.

WITH SERDEPROPERTIES (
  "mapping.my_field_name"="my:field-name"
  )

SERDEPROPERTIES can also be used to substitute field names that contain a “.” to an underscore“_”. The statement below shows an example of this application:

WITH SERDEPROPERTIES (
  "dots.in.keys"="true"
  )

Splunk platform configuration considerations

Timestamp format

The federated index definition requires use of Time format in the Time settings configuration that are not part of the standard Splunk platform time format variables. The AWS Glue tables for data written by EP should create a time column by default with the string data type. As shown below, it is necessary to use the %UT time format with the numeric double data type.

%s for UNIX time values in string data type.
%UT for UNIX time values in numeric data type.
%ST for values in SQL timestamp data type.

unnamed - 2024-08-01T124634.386.png

As discussed in Partitioning data in S3 for the best FS-S3 experience, you’ll want to include Partition time settings information to minimize data scanning. The screenshot below illustrates values for examples in this document.

unnamed - 2024-08-01T150643.979.png

Search your data

After you have your federated provider and federated index defined, you can start searching your S3 data using Federated Search for Amazon S3. The example below shows a basic sdselect command returning events from S3.

unnamed - 2024-08-01T150715.747.png

Service limits

When planning your architecture, you'll need to consider service limits. Review the latest Edge Processor and federated search for S3 documentation for the latest service limits.

Next steps

There are many technical details outside the scope of this article that might influence your decision making when implementing these concepts at scale. Documentation for Federated Search for S3 and Edge Processor (EP) are a good starting point for expanding your knowledge past the concepts discussed in this article. Error handling and data destination availability are particularly important as you consider EP and FS-S3 for your use cases. Familiarize yourself with these behaviors before implementing EP and FS-S3 in a production environment.

If you need additional technical implementation guidance, engage your Splunk and/or AWS account teams.