Ingesting Google Cloud data into Splunk using command line programs
If you are ingesting Google Cloud data into your environment, you're likely familiar with the Splunk Add-on for Google Cloud Platform and the Splunk Dataflow template. Both of these solutions are great for moving logs via Pub/Sub into Splunk. But there are ways of extracting other non-logging Google Cloud data and quickly shipping it to the Splunk platform using common command-line tools. You can then leverage this data to gain useful insights about your Google Cloud environment.
This article follows the Unix philosophy of "do one thing and do it well" by showing you how to use small single-purpose tools, then how to combine them to accomplish more complex tasks and gain useful insights about your Google Cloud environment.
First, read the prerequisites, and the tools you'll need to use. Then, learn the approach you'll use to retrieve a list of assets in a specified GCP project or organization. You can then use this same approach to:
- Export Google's IAM Recommender findings and send them to a Splunk HEC
- List all the static and reserved IP addresses in a project
- List all the SSL/TLS certificates that have been created for use with Google Cloud Load Balancer and Cloud CDN
- List all the virtual machine instances that have been created within a project
- List all the snapshots of persistent disks that have been created within a project
- List all the network routes that have been created within a project
- List all the firewall rules that have been created within a project
- List all the virtual networks that have been created within a project
- Retrieve VPC flow logs from a GCS bucket using a federated query
Finally, this article shows you how to develop some real-world investigative use cases that leverage this data.
Prerequisites
To follow along with the examples in this article, you need to have the Google Cloud CLI installed as it provides the gcloud
and bq
commands. In addition, you need to have both jq
and curl
installed. These utilities can be downloaded and installed directly or obtained through common package managers such as homebrew
, yum
, or apt
. Alternatively, you can use an environment such as the Google Cloud Shell, which comes pre-installed with the necessary tools.
You will also need a Splunk Cloud Platform or Splunk Enterprise environment configured with an HTTP Event Collector (HEC) token and an index for the data.
- Export the HEC token to the shell environment, as shown below.
export HEC_TOKEN=<TOKEN>
- You'll also need to set the full HEC URL. For example:
export HEC_URL=https://<ADDRESS>:8088/services/collector/event
For more information on how to construct an HEC URL for Splunk Cloud Platform trials, Splunk Cloud Platform on AWS, and Splunk Cloud Platform on GCP, see the Send data to HTTP Event Collector section of the HTTP Event Collector documentation.
- Set an existing destination index name, as shown below. Some of the commands might contain timestamps that are quite old, so ensure the index lifetime is generous enough to avoid aging out data immediately. You should use a dedicated index for the purposes of this process.
export SPLUNK_INDEX=gcp-data
- To set
curl
arguments such as-k
or--insecure
for untrusted certificate environments, export theCURL_ARGS
variable as shown below.export CURL_ARGS=-k
- Set an explicit Google Cloud project. You can find a list of projects using the
gcloud projects list --format=flattened
command. For example:gcloud projects list --format=flattened --- createTime: 2022-01-12T21:34:27.797Z lifecycleState: ACTIVE name: <REDACTED> parent.id: <REDACTED> parent.type: organization projectId: abcd-123456 projectNumber: <REDACTED> --- createTime: 2016-11-29T18:10:06.711Z labels.firebase: enabled lifecycleState: ACTIVE name: <REDACTED> projectId: my-project-123456 projectNumber: <REDACTED>
- Look for the
projectId
value and set it as an environment variable, as shown below.export GCP_PROJECT=<PROJECT_ID>
Tools
The following is a summary of the tools used throughout the examples:
gcloud
is a command-line tool that allows users to manage and interact with GCP resources and services. It is included in the Google Cloud CLI.bq
allows interacting with BigQuery, which is GCP's fully-managed, serverless data warehouse. It is also included in the Google Cloud CLI.jq
is likesed
but for working with JSON data. It is commonly used to parse, filter, and manipulate JSON data from the command line.split
breaks a file into smaller files. It is part of the "GNU Core Utilities" package and is usually available by default on Unix-like systems.curl
is a command-line tool for transferring data using various protocols, primarily used for making HTTP requests to retrieve or send data to web servers.
Examples
Each of these examples follows roughly the same approach:
- The data is extracted using
gcloud
. - The output is parsed and enriched using
jq
to create a payload suitable for sending to a HEC endpoint. - After it is formatted for HEC,
curl
is invoked to deliver the data to the Splunk platform. - In cases where the output of
gcloud
could be large, introducesplit
to break data into chunks.
Retrieve a list of assets in a specified GCP project or organization
- The
gcloud asset list
command can be used to retrieve a list of assets in a specified GCP project or organization. To handle large asset lists, break them into smaller files using thesplit
command before sending them to a Splunk HEC.mkdir assets && cd $_; gcloud asset list --project ${GCP_PROJECT} --format=json | jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '{"host": $host, "source": "gcloud", "sourcetype": "google:gcp:asset", "index": $index, "event": .[]}' | split - assets- ; for FILE in *; do echo processing ${FILE}; curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @${FILE}; done
This creates a directory called
assets
and switches into it, assuming creation is successful.gcloud asset list --project ${GCP_PROJECT} --format=json
- Break the commands into components to better understand each step of the pipeline.
mkdir assets && cd $_
This invokes
gcloud
and requests a list of assets. Using the--format=json
parameter returns the results as JSON. The results are returned as a list of dictionary objects, where each item in the list is an asset. Here is some example data returned from this command.[ { "ancestors": [ "projects/<REDACTED>" ], "assetType": "apikeys.googleapis.com/Key", "name": "//apikeys.googleapis.com/projects/<REDACTED>/locations/global/keys/<REDACTED>", "updateTime": "2022-10-18T09:15:12.026452Z" }, { "ancestors": [ "projects/<REDACTED>" ], "assetType": "appengine.googleapis.com/Application", "name": "//appengine.googleapis.com/apps/<REDACTED>", "updateTime": "2022-10-21T02:43:20.551Z" }, ... { "ancestors": [ "projects/<REDACTED>" ], "assetType": "storage.googleapis.com/Bucket", "name": "//storage.googleapis.com/us.artifacts.<REDACTED>.appspot.com", "updateTime": "2022-10-22T00:35:56.935Z" } ]
- Pipe this lengthy output into
jq
, iterate through each item in the JSON list, and output each individual item as a separate, new-line delimited JSON structure.jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '{"host": $host, "source": "gcloud", "sourcetype": "google:gcp:asset", "index": $index, "event": .[]}'
This
jq
command uses the-c
flag to ensure each JSON object appears on a single line. A variable namedhost
is set to the local system hostname. Additionally, a variable namedindex
is set to the Splunk index name you previously set viaexport
in the prerequisites section of this article. - Provide the scaffolding for a HEC-compliant data structure with a
host
,source
,sourcetype
,index
, andevent
field. The.[]
portion of the JSON payload tellsjq
to iterate through the list of items in the input stream and apply the transformation across each item.The final result is a new-line delimited collection of JSON objects, as seen below. You can see each line is a distinct HEC event message and JSON data structure.
{"host":"cs-<REDACTED>-default","source":"gcloud","sourcetype":"google:gcp:asset","index":"gcp-data","event":{"ancestors":["projects/<REDACTED>"],"assetType":"apikeys.googleapis.com/Key","name":"//apikeys.googleap is.com/projects/<REDACTED>/locations/global/keys/<REDACTED>","updateTime":"2022-10-18T09:15:12.026452Z"}} {"host":"cs-<REDACTED>-default","source":"gcloud","sourcetype":"google:gcp:asset","index":"gcp-data","event":{"ancestors":["projects/<REDACTED>"],"assetType":"appengine.googleapis.com/Application","name":"//appeng ine.googleapis.com/apps/<REDACTED>","updateTime":"2022-10-21T02:43:20.551Z"}} ... {"host":"cs-<REDACTED>-default","source":"gcloud","sourcetype":"google:gcp:asset","index":"gcp-data","event":{"ancestors":["projects/<REDACTED>"],"assetType":"storage.googleapis.com/Bucket","name":"//storage.googl eapis.com/us.artifacts.<REDACTED>.appspot.com","updateTime":"2022-10-22T00:35:56.935Z"}}
- Since asset lists are normally quite lengthy, split this series of new-line delimited JSON into separate file chunks. This can be accomplished using the
split
command.split - assets-
- By supplying
-
as the filename,split
will read the stdin and output chunks to filenames with names whose prefix begins withassets-
. For example:ls -al total 1528 drwxr-xr-x 2 mhite mhite 4096 Jan 31 18:26 . drwxr-xr-x 35 mhite 1001 4096 Jan 31 18:26 .. -rw-r--r-- 1 mhite mhite 407298 Jan 31 18:26 assets-aa -rw-r--r-- 1 mhite mhite 458000 Jan 31 18:26 assets-ab -rw-r--r-- 1 mhite mhite 458000 Jan 31 18:26 assets-ac -rw-r--r-- 1 mhite mhite 226798 Jan 31 18:26 assets-ad
- Iterate through each file in the current directory and send the contents as a batch to the HEC endpoint.
for FILE in *; do echo processing ${FILE}; curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @${FILE}; done
This final series of commands loops through each file, outputs a message to the console indicating the file is being processed, and then invokes
curl
to POST a new-line delimited batch of event messages to a destination HEC URL.
These steps are the basic recipe for most of the following examples. You can refer back to this initial example for the general approach being used.
Export Google's IAM Recommender findings and send them to a Splunk HEC
Google's IAM Recommender service analyzes an organization's Identity and Access Management (IAM) policies and recommends actions to help improve the security posture. For example, it can spot issues such as over-privileged roles and users.
Use the following command to export recommender findings and send them to a Splunk HEC:
gcloud recommender insights list --project=${GCP_PROJECT} --insight-type=google.iam.policy.Insight --location=global --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:recommender:insight", "index": $index, "time": (.lastRefreshTime | fromdateiso8601), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
Notice the (.lastRefreshTime | fromdateiso8601)
portion of the jq
command. This allows you to read the lastRefreshTime
field from the input stream and convert it from an ISO-8601 timestamp into an epoch timestamp. You'll then assign this to the time
field of the HEC event.
List all the static and reserved IP addresses in a project
The command gcloud compute addresses list
lists all the static and reserved IP addresses in a project.
gcloud compute addresses list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:address", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
List all the SSL/TLS certificates that have been created for use with Google Cloud Load Balancer and Cloud CDN
The command gcloud compute ssl-certificates list
lists all the SSL/TLS certificates that have been created for use with Google Cloud Load Balancer and Cloud CDN.
gcloud compute ssl-certificates list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:certificate", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
List all the virtual machine instances that have been created within a project
The command gcloud compute instances list
lists all the virtual machine instances that have been created within a project. We will leverage the split
command again as this can be quite an extensive list.
mkdir instances && cd $_; gcloud compute instances list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:instance", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
split - instances- ; for FILE in *; do echo processing ${FILE}; curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @${FILE}; done
List all the snapshots of persistent disks that have been created within a project
The command gcloud compute snapshots list
lists all the snapshots of persistent disks that have been created within a project.
gcloud compute snapshots list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:snapshot", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
List all the network routes that have been created within a project
The command gcloud compute routes list
lists all the network routes that have been created within a project.
gcloud compute routes list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:route", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
List all the firewall rules that have been created within a project
The command gcloud compute firewall-rules list
lists all the firewall rules that have been created within a project.
gcloud compute firewall-rules list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:firewall", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
List all the virtual networks that have been created within a project
The command gcloud compute networks list
lists all the virtual networks that have been created within a project.
gcloud compute networks list --project=${GCP_PROJECT} --format=json |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:network", "index": $index, "time": (.creationTimestamp | sub("\\.[0-9]{3}"; "") | strptime("%Y-%m-%dT%H:%M:%S%z") | mktime), "event": .}' |
curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @-
Retrieve VPC flow logs from a GCS bucket using a federated query
BigQuery is a fully-managed, serverless data warehouse that allows you to run SQL-like queries on large datasets. One of the features of BigQuery is the ability to federate queries to other data sources such as S3, GCS, or Azure Blob Storage. The following example shows how to retrieve VPC flow logs from a GCS bucket by way of a federated query.
It's important to note that when you use federated queries, you incur additional costs and latency. Additionally, BigQuery charges on a bytes-scanned model, so only perform this example against a small data set. No one likes surprise cloud bills!
mkdir flow && cd $_; bq query --format=json --nouse_legacy_sql 'SELECT * FROM `mh_bq_test.flows`' |
jq -c --arg host $(hostname) --arg index ${SPLUNK_INDEX} '.[] | {"host": $host, "source": "gcloud", "sourcetype": "google:gcp:vpc:flow", "index": $index, "time": (.timestamp | strptime("%Y-%m-%d %H:%M:%S")| mktime), "event": .}' |
split - flow- ; for FILE in *; do echo processing ${FILE}; curl ${CURL_ARGS} ${HEC_URL} -H "Authorization: Splunk ${HEC_TOKEN}" --data-binary @${FILE}; done
To learn more about setting up external tables in BigQuery, see the Google Cloud documentation.
More ideas for insights from this data
So far, you've seen how single-purpose tools like gcloud
, jq
, and curl
can be used together to bring data from Google Cloud into Splunk platform. However, the ultimate goal of transferring data to Splunk platform is to gain insights from it. Let's consider some real-world investigative use cases that leverage this data.
Access a list of virtual machines along with who created each one
Assuming you are also ingesting Google Cloud audit logs, you can enrich the data you collected in the previous examples with related data from the cloud audit logs. The following SPL (Search Processing Language) can achieve this.
(index=gcp-data sourcetype="google:gcp:instance") OR (index="gsa-gcp-log-index" protoPayload.methodName="v1.compute.instances.insert" OR protoPayload.methodName="beta.compute.instances.insert")
| eval resourceName=case(index="gcp-data",mvindex(split('selfLink',"https://www.googleapis.com/compute/v1/"),1),index="gsa-gcp-log-index",'protoPayload.resourceName')
| eval creator=case(index="gsa-gcp-log-index",'protoPayload.authenticationInfo.principalEmail')
| stats values(index) values(creator) by resourceName
| rename values(*) -> *
| where index="gcp-data" AND index="gsa-gcp-log-index"
| fields - index
Assuming gsa-gcp-log-index
contains audit logs, this query performs the equivalent of an INNER JOIN against virtual machine names from your instance list (sourcetype="google:gcp:instance"
) and the audit log recording the machine's creation event. After the join is performed, you can display a list of current virtual machines alongside their creator email address.
This query assumes you have audit logs that go back far enough to find the initial API call used to create a machine.
Establish a record of API calls made by an account listed in an IAM Recommender "Permission Usage" finding
The SPL shown below provides a summary of API calls made by these accounts during a designated time frame. By presenting this information alongside accounts under investigation, you can gain insight into their purpose or typical behavior.
(index="gcp-data" sourcetype="google:gcp:recommender:insight" insightSubtype=PERMISSIONS_USAGE) OR (index="gsa-gcp-log-index")
| eval account=case(index="gcp-data",mvindex(split('content.member',":"),1),index="gsa-gcp-log-index",'protoPayload.authenticationInfo.principalEmail')
| eval methods=case(index="gsa-gcp-log-index",'protoPayload.methodName')
| stats values(index) values(methods) by account
| rename values(*) -> *
| where index="gcp-data" AND index="gsa-gcp-log-index"
| fields - index
Other potential use cases
- Find other interesting
gcloud
commands with "list" options. - To facilitate multiple runs over time, create timestamp checkpoint files and compare against
creationTimestamp
gcloud
fields to avoid duplicates in an index. - Load data into a lookup table for better use within the Splunk platform.
- Find a solution to keep fractional timestamps intact during
jq
extraction. - Consider using alternative ingestion methods, like writing new-line delimited JSON to disk and using a Universal Forwarder or OpenTelemetry Agent to send to Splunk, instead of HEC.
- Try these same techniques with other cloud provider CLIs such as
aws-cli
,az
,linode-cli
, anddoctl
.
Use the examples as inspiration for your own use cases.
Next steps
These resources might help you understand and implement this guidance:
- JQ: Reference manual
- JQ: Cheat sheet
- Splunk Docs: Format events for HTTP Event Collector