Selecting the best method for Amazon data ingestion

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Amazon is the industry’s largest cloud provider. Organizations use cloud service providers for a variety of utilities and services, and each generally represents a specific type of telemetry data. Typically organizations want to analyze logs, metrics, traces or a combination of those items from the following services:

Serverless application deployments: Many organizations use cloud services to deploy serverless applications that can be rapidly deployed and scaled up or down. These are represented by Lambda, Amazon Elastic Kubernetes Service (EKS), EKS Fargate, Amazon Elastic Container Service (ECS), ECS Fargate, and Amazon Common Software.
Object storage or archives. Many organizations use cloud providers to store objects, logs, metrics, or traces. S3 can be used for this purpose.
Virtual servers. Organizations mights also use the cloud to host VMs or servers to perform critical functions or host applications. Typically these are represented by Amazon Elastic Cloud Compute (EC2), ECS, and sometimes Kubernetes distributions like Amazon Elastic Kubernetes Service (EKS).

How to use Splunk software for this use case

Assess methods to collect cloud service provider data

Splunk offers many different services for data collection. These include: Splunk OpenTelemetry Collector (OTel Collector), Splunk add-ons, Splunk Ingest Processor, Splunk Edge Processor, the Universal Forwarder, the Splunk Stream App, and the Data Manager.

The OTel Collector and the Ingest and Edge Processors were built off the same architecture, so there functionality is quite similar. However, the OTel Collector offers more ingest sources and exporting services.
The Universal Forwarder can be used to send logs directly to an indexer or to send them via the HTTP Event Collector (HEC).
The Splunk Stream app is useful for services that don't have a Splunk add-on.
The Data Manager is a more steamlined version of add-ons and can allow you to ingest from multiple cloud organizations.

In addition to considering the difference described above, to find the best collection method for your situation, consider the following questions.

Does the data need any transformations before ingestion?

The first thing you should ask is whether the data needs any transformations, aggregating, or context added before it arrives in Splunk software. Many of the tools Splunk offers will pull the data as-is, so if you want a transformation to occur, that's important to know. For example, if the data is in the correct format and doesn’t need any changes then it can be pulled directly with the appropriate Splunk add-on or the Data Manager. But if the data is sensitive, for example, if it includes PII or IP addresses, you might have a requirement to mask that data. In that case, you'll need the Ingest Processor alongside Edge Processor, or the OTel Collector.

Can this data traverse outside the cloud?

Another key requirement to ask is if the data can traverse outside the cloud or if the ingestion needs to stay in the cloud, peered VPC, AWS PrivateLink, or other similar location. Some companies have data restrictions regarding the internet and you might have to use restricted methods in order to sync the data directly from your cloud instance to Splunk software. In most circumstances, this means that the Splunk instance needs to be in the same region (for example, is it traversing from US East 1 to US West 2) as the originating data or in the same cloud service provider classification (for example, AWS Commercial versus AWS GovCloud).

What if the service isn’t supported as part of the cloud provider add-on, OTel receiver, etc?

There are times when an organization's service isn’t supported in terms of data ingestion in the application or add-on’s documentation. In some cases, you can transport the data through a different method. For example, this could mean forwarding logs from AWS Directory Service to CloudWatch because Splunk software can grab CloudWatch logs but not AWS Directory logs directly. There other times you can use cloud services to stream the data directly from unsupported services. This can be accomplished with Splunk Edge Hub, Amazon Data Firehose, Google Pub/Sub, AWS FireLens, and other such services.

Will you need to scale the integration?

Some solutions require a one-to-one integration for each iteration, whereas other solutions scale much easier.

AWS services you can ingest data from

What services are directly supported out of the box?

S3, SQS, IAM, CloudWatch, SNS, EC2, RDS, Lambda, Cloudfront, Kinesis, ECS, EKS, ELB, and many more. These are supported for all telemetry types: logs, metrics, and traces.

What services aren’t supported directly?

EKS Fargate (logs), ECS Fargate (logs), and AWS Directory

What if the service isn’t supported as part of the cloud provider add-on or the OTel receiver?

In this case, you could use AWS Data Firehose, formerly called Kinesis Firehose. Firehose typically allows you to stream data from one AWS service specifically to another AWS service, or it can also allow you to stream telemetry data from an AWS service directly to your Splunk instance via the HEC. However, note that Firehose grabs all the data as-is and drops it off in the same format. So if you want only portions of your data or you need transformations done, this is not a great option. For example, if you want to pull AWS Active Directory data into CloudWatch or directly into your Splunk instance, you might end up with way more data than you want. You can sample the data, but in terms of formatting or in terms of the breadth of logs, it will grab everything and you'll need to make some adjustments at the ingestion engine level.

What about EKS Fargate and ECS Fargate logs?

AWS offers deployments for their Kubernetes service, as well as their elastic container services, using Fargate. With this method, all the containers are provisioned by AWS directly. There are a couple of key benefits to this. It makes it much easier for users to scale their Kubernetes clusters as well as their container clusters. It's also much cheaper, because AWS selects the cheapest resources in order to scale that, whereas if a customer does it, resources depend on the time of the day, time of the year, the organization, their subscription, and more. So typically customers choose Fargate deployments because it's much cheaper overall. It also allows for a lot more flexibility because AWS does all the heavy lifting when it comes to horizontal and vertical scaling.

The problem is that because Amazon does the work, they make it difficult to grab logs. Grabbing metrics or traces from these services is easy, but not logs. AWS has a separate monitoring service that they use that uses Fluent Bit to push the AWS logs from the Fargate deployments into CloudWatch. You can intercept those logs and use something like Data Firehose or you can use the OTel Receiver to grab those logs directly and then feed them back into your Splunk instance.

Decide which solution to implement

Sending data to Splunk Enterprise or Splunk Cloud Platform only

Use the Data Manager or Splunk AWS Add-on if you want to ingest the data as-is. If you have multiple organizations or regions deployed, you need one Data Manager created for each organization or region. It allows you set up the integrations for multiple AWS organizations and regions at the same time. However, if you ever need to modify those Data Manager integrations, they have to be done on a one-to-one basis. If you have only two or three AWS organizations, this is not a big deal, but if you have hundreds, this is not a scalable option. For example, if you change which index is pointed to or you change source types, going through each one by one would be difficult.

Another option to ingest data as-is is to use AWS Data Firehose, as described above, in conjunction with an HTTP Event Collector Input and token assigned to an index.

If you need transformations performed or to enrich the data, consider using the OTel Collector. This allows you to process and parse the data before it ever makes it into your deployment. It also allows you to see all of your logs and transformations in real time, with no latency in terms of when the transformations occur to the logs themselves. If you use post-ingress actions or Edge Processor, there's a little bit of inherent latency. Sometimes it could be very short, but sometimes it can be very long, depending on the amount of transformations that you going to need and the amount of data sources that you're performing on.

Sending data to Splunk Observability Cloud only

The OTel Collector is essentially the only supported method of ingesting data into Splunk Observability Cloud.

Sending data to both the Splunk platform and Splunk Observability Cloud

The OTel Collector is your best option because you have to set it up for Splunk Observability Cloud, and adding the Splunk platform as an end location is relatively easy. You can add on Data Managers if necessary.

Sending segregated data to multiple indexes

The OTel Collector is your best option because it acts as the ingestion agent. It can also transform/separate the destination index of the logs based off the attributes, location, fields, or regex pattern of the log itself. For example, maybe you want to send different information to stacks for your dev team, security team, and audit team. The OTel Collector can do that, whereas the Data Manager can only pull into one Splunk instance at a time.

What does an OTel AWS receiver look like?

You will need to set up on receiver for every AWS organization you want to pull data from. Use the following configuration file and specify it in your agent_config.yaml or values.yaml file.

receivers:
  awscloudwatch/example:
     region: <region-name>
     logs: 
       poll_interval: <poll_rate_m>
       groups:
           named:
               /specify/cloudwatch/directory
                    names: [<names_of_log_files>]
…
service:
  pipelines:
      logs:
          receivers: [awscloudwatch/example, awscloudwatch/secondsource]

The polling rate can impact throttling on the AWS side. You should talk with your AWS account team or AWS administrators to be sure that you have allocated enough throughput for the logs to be pulled through. If throttling occurs, you would only see a message indicating that on the AWS side. Splunk will not indicate a problem.

What would an OTel multi-index setup look like?

With Data Manager, not only will you have multiple Data Manager instances for multiple AWS Accounts, but you will also be limited to using only one index. With OTel you don’t need entirely separate OTel deployments for multiple accounts and you can assign multiple buckets per AWS integration/account.

#Specify your configuration in your agent_config.yaml or values.yaml file
receivers:
  awscloudwatch/example:
     region: <region-name>
     logs: 
       poll_interval: <poll_rate_m>
       groups:
           named:
               /specify/cloudwatch/directory
                    names: [<names_of_log_files>]
…
exporters:
   splunk_hec/index1:
      index: index1
      token: blah
…
service:
  pipelines:
      logs:
          receivers: [awscloudwatch/example, awscloudwatch/secondsource]
          exporters: [splunk_hec/index1, splunk_hec/index2]

What does a Data Manager AWS integration look like?

You might prefer to the use the Data Manager because it offers a GUI for configuration, rather than using the command line as an OTel receiver does. The interface prompts the decisions you need to make.

You will also have to choose between single or multiple accounts. Let's say you have one AWS account that has data for all of your organization in it. From that one AWS account, you might only have access to a provision role from your AWS administration side where you might only be able to pull CloudWatch logs from the Dev Directory. However, you might also need pull CloudWatch logs from an audit directory. Your current role that's being used to sync this data cannot pull from that directory. You could use multiple AWS accounts in terms of the IAM rules in order to pull the additional data sources.

The other option is that you might have multiple AWS organizations within the same region, say US East 1. You could pull from those multiple organizations, assuming that you have the same role and permissions configured across both. For example, if your AWS administrator created a read-only role that's called splunkand you have two organizations across US East 1. If that same rule is configured across both of those organizations and it has the same permissions, you could pull data from both those organizations and specify the Data Manager. But it will have separate iterations of the resulting data manager integration. So if for some reason roles permissions change or you need to specify a different index, you'll have to go through and manually change on each of those integrations as discussed above.

Additional resources

To see all available AWS receivers for OTel, see the OpenTelemetry Collector GitHub page. For more information on additional Splunk and AWS integrations, see the Amazon data descriptor page.