Implementing a reingestion pipeline for AWS logs using Data Firehose

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

This article describes a solution to efficiently handle log delivery failures from Amazon Web Services (AWS) to the Splunk platform. These failures occur within the Data Firehose pipeline, particularly when attempting to write logs to the Splunk platform via the HTTP Event Collector (HEC). Failures such as connection timeouts or incorrect HEC tokens result in logs being diverted to a Dead Letter Queue (DLQ) stored in an Amazon S3 bucket. The primary challenge arises when attempting to reingest these logs into the Splunk platform. The logs in the DLQ are not only encapsulated in metadata about the failure but also encoded in base64, complicating their extraction and interpretation.

To successfully reingest the failed logs into the Splunk platform, you'll follow these steps:

Deploy the reingestion pipeline. Set up a dedicated pipeline that will handle the extraction, transformation, and loading (ETL) of failed log data from the DLQ. This involves configuring AWS services and ensuring they are ready to process the data.
Prepare the data for reingestion. Before reingestion, it’s necessary to extract the failed log data from the S3 DLQ, decode it from base64 format, and strip any additional metadata that was added during the initial failure. This step ensures that only the relevant log data is sent to the Splunk platform.
Reingest data into the Splunk platform. With the data prepared, the next step is to feed it back into the Splunk platform. This typically involves using the HEC again but with the corrected parameters to ensure a successful ingestion this time.
Confirm the presence of data in the Splunk platform. Finally, verify that the reingested logs are accurately reflected in the Splunk platform. This involves checking for the presence and correctness of the logs in the platform, ensuring that the reingestion process has been completed successfully.

Each of these steps is explained in detail in the rest of this article.

Prerequisites

Before diving into the reingestion pipeline setup, it’s important to understand the data ingestion landscape from AWS into the Splunk platform. There are two primary methods to achieve this:

Push Method: Utilizes a combination of Amazon SQS (Simple Queue Service) and S3 (Simple Storage Service) to push logs into the Splunk platform.
Pull Method: Leverages Amazon Data Firehose to pull logs directly into the Splunk platform.

For the scope of this article, we’ll focus on the pull method using Amazon Data Firehose due to its direct integration and streamlined process for log ingestion into the Splunk platform.

To effectively implement the reingestion pipeline discussed in this guide, familiarity with the following AWS services is beneficial:

Amazon Simple Storage Service (S3): Used for storing failed logs in a Dead Letter Queue (DLQ) and as a temporary storage solution during the reingestion process.
Amazon Data Firehose: Acts as the primary conduit for log data flowing between AWS and Splunk, especially for the initial ingestion and the reingestion process.
AWS Lambda: Provides serverless compute capabilities to process and transform the data (for example, decoding base64 encoded logs) before reingestion.
AWS Identity and Access Management (IAM): Ensures secure access control to AWS resources involved in the reingestion process.

Understanding and having practical experience with these services will facilitate the setup and execution of a robust reingestion pipeline from the DLQ back into the Splunk platform.

Step 1: Deploy the reingestion pipeline

To streamline the deployment of the reingestion pipeline, you can leverage this CloudFormation (CFN) template. This approach automates the setup of critical components, ensuring a smooth and error-free deployment. Here’s an overview of the process and the components involved:

Set up components

S3 Temporary Bucket: Acts as a holding area for failed logs awaiting reingestion. This temporary storage is crucial for preprocessing logs before they are fed back into the Splunk platform.
Lambda Functions: Two Lambda functions are deployed to handle different stages of the log processing.
- S3 Re-ingest Function: Triggered by new log arrivals in the S3 temp bucket. It performs several key operations:
  - Retrieves log files from the S3 bucket.
  - Decompresses and decodes the base64 encoded logs.
  - Removes any extraneous metadata.
  - Forwards the cleaned logs to the Data Firehose stream for reingestion.
- Transformation Function: Invoked by log entries in the Data Firehose stream, it ensures the logs are formatted correctly to meet Splunk’s ingestion requirements.
Data Firehose Stream: Serves as the conduit through which cleaned logs are sent to the Splunk platform, utilizing the HTTP Event Collector (HEC) endpoint.

Deployment process

Utilize the CFN template to initiate the deployment. This template automatically creates and configures all necessary AWS resources according to best practices.
After it is deployed, the pipeline architecture facilitates a seamless flow of logs from the temporary storage in S3, through processing and transformation via Lambda functions, and finally into the Splunk platform via the Data Firehose stream.

To help visualize the deployment and operational flow of the reingestion pipeline, an architecture diagram is provided below. This diagram illustrates the interaction between AWS services, the data flow, and the role of each component in ensuring failed logs are successfully reingested into the Splunk platform.

By following these steps and utilizing the CFN template, you can efficiently deploy a reingestion pipeline that minimizes manual intervention and maximizes the reliability of log delivery to the Splunk platform.

Step 2: Prepare the data for reingestion

The effectiveness of the reingestion pipeline largely depends on the initial preparation of the data. This preparation involves a critical step: transferring the logs that failed ingestion from the Dead Letter Queue (DLQ) in the S3 bucket to a designated temporary S3 bucket. Here’s how to approach this process:

Identify failed logs: Start by identifying the logs in your S3 DLQ bucket that failed to ingest into Splunk. These logs are the target for reingestion.
Manual transfer: The transfer of logs from the DLQ bucket to the temporary S3 bucket is intentionally manual. This approach ensures you have the opportunity to:
1. Review the logs to confirm they are the correct ones for reingestion.
2. Resolve any issues that caused the initial ingestion failure to prevent the same issues from occurring again.
Triggering the reingestion pipeline: By manually moving the logs to the S3 temp bucket, you control when the reingestion process begins. This step acts as a safeguard, ensuring the pipeline is activated only when you’re confident that the ingestion issues have been addressed and that the logs are ready for reingestion.
Verification before reingestion: Before proceeding with the reingestion, it’s advisable to verify that:
1. The ingestion issue (for example, an incorrect HEC token, connection timeouts) has been resolved.
2. The logs are correctly formatted and encoded for successful reingestion.

This manual step is crucial for maintaining the integrity of your data within the Splunk platform. It allows for a final review to ensure that only logs that are ready and correctly formatted are introduced into the reingestion pipeline, minimizing the risk of repeating past ingestion errors.

Step 3: Reingest data into the Splunk platform

To initiate the reingestion pipeline, manually trigger the copying of log files with the following command:aws s3 sync <source> <dest> --include "*splunk-failed/YYYY/MM/DD/HH/*"

Here is an example with the variables completed: aws s3 sync s3://example-splashback s3://example-reingest --include "*splunk-failed/2024/01/01/12/*"

Step 4: Confirm data presence in the Splunk platform

After the reingestion process is complete, the final step is to verify that the data has been successfully ingested into the Splunk platform. This verification ensures that not only is the data present, but it is also correctly formatted and there are no gaps in the log sequences. Follow these steps to confirm:

Log into the Splunk platform: Access your Splunk dashboard using your credentials to start the verification process.
Initiate a query for recent data: Utilize search functionality in the Splunk platform to query for the recently ingested data. You can specify the time range that covers the reingestion period to narrow down the search results.
Review the ingested data:
1. Confirm data presence: Check that the logs from the reingestion process are visible in the platform.
2. Verify correct formatting: Ensure that the logs appear in the expected format, indicating that the transformation steps were successful.
3. Check for continuity: Look for any gaps or breaks in the log sequences. The absence of gaps confirms that the reingestion pipeline has worked as intended, and all necessary logs have been successfully transferred.

This step is not just about confirming the success of the reingestion process; it also serves as a quality assurance check to ensure that the data within the Splunk platform is complete, accurately formatted, and ready for analysis.