Selecting the best method for Google data ingestion

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Google has the the smallest market share for cloud services. Organizations use cloud service providers for a variety of utilities and services, and each generally represents a specific type of telemetry data. Typically organizations want to analyze logs, metrics, traces or a combination of those items from the following services:

Serverless application deployments: Many organizations use cloud services to deploy serverless applications that can be rapidly deployed and scaled up or down. These are represented by Google Kubernetes Engine or Google Cloud Functions.
Object storage or archives. Many organizations use cloud providers to store objects, logs, metrics, or traces. Google Cloud Storage can be used for this purpose.
Virtual servers. Organizations mights also use the cloud to host VMs or servers to perform critical functions or host applications. Typically these are represented by Google Compute Engine, Google Cloud Run, and sometimes Kubernetes distributions like Google Kubernetes Engine (GKE).

How to assess methods to collect cloud service provider data

Splunk offers many different services for data collection. These include: Splunk OpenTelemetry Collector (OTel Collector), Splunk add-ons, Splunk Ingest Processor, Splunk Edge Processor, the Universal Forwarder, the Splunk Stream App, and the Data Manager.

The OTel Collector and the Ingest and Edge Processors were built off the same architecture, so there functionality is quite similar. However, the OTel Collector offers more ingest sources and exporting services.
The Universal Forwarder can be used to send logs directly to an indexer or to send them via the HTTP Event Collector.
The Splunk Stream app is useful for services that don't have a Splunk add-on.
The Data Manager is a more steamlined version of add-ons and can allow you to ingest from multiple cloud organizations.

In addition to considering the difference described above, to find the best collection method for your situation, consider the following questions.

Does the data need any transformations before ingestion?

The first thing you should ask is whether the data needs any transformations, aggregating, or context added before it arrives in Splunk software. Many of the tools Splunk offers will pull the data as-is, so if you want a transformation to occur, that's important to know. For example, if the data is in the correct format and doesn’t need any changes then it can be pulled directly with the appropriate Splunk add-on or the Data Manager. But if the data is sensitive, for example, if it includes PII or IP addresses, you might have a requirement to mask that data. In that case, you'll need the Ingest Processor alongside Edge Processor, or the OTel Collector.

Can this data traverse outside the cloud?

Another key requirement to ask is if the data can traverse outside the cloud or if the ingestion needs to stay in the cloud, peered VPC, AWS PrivateLink, or other similar location. Some companies have data restrictions regarding the internet and you might have to use restricted methods in order to sync the data directly from your cloud instance to Splunk software. In most circumstances, this means that the Splunk instance needs to be in the same region (for example, is it traversing from US East 1 to US West 2) as the originating data or in the same cloud service provider classification (for example, AWS Commercial versus AWS GovCloud).

What if the service isn’t supported as part of the cloud provider add-on, OTel receiver, etc?

There are times when an organization's service isn’t supported in terms of data ingestion in the application or add-on’s documentation. In some cases, you can transport the data through a different method. For example, this could mean forwarding logs from AWS Directory Service to CloudWatch because Splunk software can grab CloudWatch logs but not AWS Directory logs directly. There other times you can use cloud services to stream the data directly from unsupported services. This can be accomplished with Splunk Edge Hub, Amazon Data Firehose, Google Pub/Sub, AWS FireLens, and other such services.

Will you need to scale the integration?

Some solutions require a one-to-one integration for each iteration, whereas other solutions scale much easier.

GCP services you can ingest data from

What services are directly supported out of the box?

Pub/Sub, Cloud Monitoring, BigQuery Billing, Storage, Workspace, Compute Engine, and GKE

What services aren’t supported directly?

DataFlow, Dataproc, and Machine Learning APIs

What if the service isn’t supported as part of the cloud provider add-on or the OTel receiver?

For sources that aren't directly supported, you can use Pub/Sub to stream data to a source that is supported or you can use third-party Integrations in order to ingest the data. Note, however, that not all third-party services are Splunk-owned or Splunk-supported. You can also you Pub/Sub to stream telemetry data from a GCP service directly to your Splunk instance via the HEC. However, note that Pub/Sub grabs all the data as-is and drops it off in the same format. So if you want only portions of your data or you need transformations done, this is not a great option.

What about Autopilot?

For Google Kubernetes Engine (GKE) or Cloud Run using Autopilot, these deployments use containerization deployments which require sidecars for monitoring. With Autopilot, GCP owns the the heavy lifting and the provisioning of resources, so it's a much cheaper option and allows for horizontal and vertical scaling to occur more quickly and much more easily than if you had to manually deploy it. The logs can be collected directly via OTel Collector or sent via Pub/Sub.

Deciding which solution to implement

Sending data to Splunk Enterprise or Splunk Cloud Platform only

Use the Data Manager or Splunk Add-on for Google Cloud Platform if you want to ingest the data as-is. If you have multiple organizations or regions deployed, you need one Data Manager created for each organization or region. It allows you set up the integrations for multiple GCP organizations and regions at the same time. However, if you ever need to modify those Data Manager integrations, they have to be done on a one-to-one basis. If you have only two or three GCP organizations, this is not a big deal, but if you have hundreds, this is not a scalable option. For example, if you change which index is pointed to or you change source types, going through each one by one would be difficult.

Another option to ingest data as-is is to use Pub/Sub in conjunction with an HTTP Event Collector Input and token assigned to an index.

If you need transformations performed or to enrich the data, consider using the OTel Collector. This allows you to process and parse the data before it ever makes it into your deployment. It also allows you to see all of your logs and transformations in real time, with no latency in terms of when the transformations occur to the logs themselves. If you use post-ingress actions or Edge Processor, there's a little bit of inherent latency. Sometimes it could be very short, but sometimes it can be very long, depending on the amount of transformations that you going to need and the amount of data sources that you're performing on.

Sending data to Splunk Observability Cloud only

The OTel Collector is essentially the only supported method of ingesting data into Splunk Observability Cloud.

Sending data to both the Splunk platform and Splunk Observability Cloud

The OTel Collector is your best option because you have to set it up for Splunk Observability Cloud, and adding the Splunk platform as an end location is relatively easy. You can add on Data Managers if necessary.

Sending segregated data to multiple indexes

The OTel Collector is your best option because it acts as the ingestion agent. It can also transform/separate the destination index of the logs based off the attributes, location, fields, or regex pattern of the log itself. For example, maybe you want to send different information to stacks for your dev team, security team, and audit team. The OTel Collector can do that, whereas the Data Manager can only pull into one Splunk instance at a time.

Additional resources

To see all available GCP receivers for OTel, see the OpenTelemetry Collector GitHub page. For more information on additional Splunk and AWS integrations, see the Google data descriptor page.