Ingesting AWS S3 data written by ingest actions

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Using federated search for S3 allows you to easily search data that has been written to S3 from the Splunk platform. In some cases, you might need to fully ingest certain parts of the data into the Splunk platform. Ingest actions allows you to save data in S3 and ingest it when you need it later on, to avoid extra ingest usage. This article shows you how to configure the AWS S3 Splunk add on to ingest this data after it has been written to S3.

Create S3 destination and route data

Follow the steps in Splunk Docs to create an S3 destination in ingest actions to have a place to write the data to. When setting up partitioning for the destination, it's ideal to partition by day and sourcetype as a secondary key. This separates the data by day and by source type, making it easier to select only the data you want to ingest.

After the new destination is set up with partitioning, create a new ruleset in ingest actions to route the data to the S3 destination you created.

When your ruleset is working properly you will no longer see events in the Splunk platform, and you will see new buckets being created in S3.

Ingesting S3 data

To ingest data after it's been sent to S3 via ingest actions, install the Splunk Add-on for Amazon Web Services (AWS).

You'll need to configure the account the add-on uses to pull data from S3. After the account is set up, you can configure a generic S3 input to ingest the specific data from S3.

Use the Create New Input > Custom Data Type > Generic S3 input type.
Set your Start Date/Time before the Date/Time of when the data was written to S3. If the S3 bucket modification time is before the Start Date/Time, the input will not ingest the data.
Specify a specific source type, ensuring it is not the same source type that is used in the ingest actions ruleset that writes the data to S3. If you don't do this, a loop will exist and the data will never end up in an index. An S3 key prefix or allowlist can also be specified to help limit the amount of data that is reingested.
Amazon S3 buckets with an excessive number of files or abundant size will result in significant performance degradation and ingestion delays.

The Generic S3 input will take a few minutes to pull the data and ingest it. When it is done, you should be able to search the events specified in the index and search the data like normal.

The events will be a JSON blob, meaning you will need to do some additional work to get field extractions to work properly based on the original source type of the data. Here is an example of what the event looks like after it is ingested:

unnamed - 2024-06-18T100524.631.png

If you're not sure how to work with JSON data in the Splunk platform, the following resources might be useful:

Splunk Community: JSON tagged posts
YouTube: Using JSON functions
Splunk Docs: spath command

Next steps

These resources might help you understand and implement this guidance:

Splunk Lantern: Partitioning data in S3 for the best FS-S3 experience
Splunk Lantern: Using federated search for Amazon S3 (FS-S3) with ingest actions
Splunk Docs: Ingest actions
Splunk Docs: Federated Search for S3
Splunk Docs: Ingest Actions Splunk Validated Architecture (SVA)