Enabling persistent queue for metrics and traces

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Persistent queue is an option for the OpenTelemetry collector exporter helper that can be enabled on exporters for logs, metrics, or traces. The purpose of the persistent queue is to ensure that data is not lost during communication problems that the exporter might encounter. This can be used as a disaster recovery (DR) strategy when combined with the file storage extension, allowing the persistent queue to use disk space storage to handle large outages.

This article reviews setting up persistent queue for metrics and traces to handle an outage of eight hours for a single host.

The guidance provided in this article is for single Splunk collectors in gateway mode (not load balanced). The addition of a load balancer adds complexity to the design, due to how metrics are processed and dropped if received out of order. If you are using a load balanced gateway design, then a persistent queue on the load balanced gateways will not work as outlined here.

Batches

To set up the persistent queue, it's important to understand how batches work within the OpenTelemetry exporters. A batch is a group of data points that are exported together, in a single “batch”. A batch is packaged together and sent when certain trigger events happen. These trigger events are set up in the batch processor object, which becomes part of the data pipeline. You can have different batch processors for each data pipeline.

The important trigger events are send_batch_size and timeout.

The timeout trigger fires after a certain amount of time. The default is 200ms. This means that every 200ms the batch processor sends the data to the exporter. This collection of data it sends to the exporter at that time is called a batch. If the no data has been collected after the 200ms, then a batch is not sent to the exporter.
The send_batch_size trigger sends the data after a certain count of data points and spans has been collected but not sent. The default is 8192. If the processor collects 8192 data points or spans before the timeout occurs, then this data is packaged together and sent to the exporter as a batch.

The send_batch_size is a count of data points and not a size such as kB.

Setup

There are two main configurations that must be in place for the persistent queue to work. You will need a file_storage extension and an exporter that supports Export Helper so you have access to the retry_on_failure and sending_queue settings.

File storage extension settings

Within the extensions section of the config file, you need to define a file storage extension using the file_storage item. You should give this extension a name so that it can easily be accessed within the exporter sections.

The file storage extension has lots of configuration options that are worth exploring. For this example, we are keeping all the default settings except for the timeout value, which is used as a timeout to grab file locks.

extensions:
  # persistent Queue file storage
  file_storage/persist:
    # default windows directory c:\ProgramData\Splunk\OpenTelemetry Collector\FileStorage
    timeout: 10s
    # lots of additional options for file_storage and worth looking at
     # directory: "./myDirectory"    # Define directory
     # create_directory: true           # Create directory
     # timeout: 1s                      # Timeout for file operations
     # compaction:                      # Compaction settings
     #   on_start: true                 # Start compaction at Collector startup
     #   # Define compaction directory
     #   directory: "./myDirectory/tmp"
     #   # Max. size limit before compaction occurs
     #   max_transaction_size: 65536

After we define the file_storage extension, we need to add it to the service extension list to be enabled. Below is an example of the default extensions with the new file_storage extension added.

service:
  extensions: [health_check, http_forwarder, zpages, smartagent, file_storage/persist]

Exporter helper settings

By default, most exporters have access to the exporter helper. This allows trace and metrics exporters to enable a persistent queue.

The exporter helper provides both the retry_on_failure and sending_queue settings. Both items support additional configuration options that are worth exploring.

Within the retry_on_failure section, the important setting is max_elapsed_time. This setting is the amount of time the collector spends trying to send the batch. It's important to understand that this works more like a time-to-live (TTL) value then a timeout. Each time the batch is attempted, the time spent trying to send it counts towards the max_elapsed_time. On the exporters, the default timeout on a data export operation is 5s, and the default max_elapsed_time is 300s. So, as an example, if the data export fails due to timeout (5s), then the batch will have roughly 60 (5s x 60 = 300s) attempts to send the data before the max_elapsed_time is reached. However, errors such as DNS failures or bad URLs result in a faster failure time, which results in more than 60 data export attempts.

This retry rate is further impacted by the config value max_interval, which is the max pause between retry attempts. The default is 30s. With roughly 60 retry attempts and a 30s pause between attempts you end up with 30 min, meaning that with the default settings the queue holds batches that are roughly 30min old before they hit the max_elapsed_time and are dropped with the "No More Retries" error. However, there is some randomness intentionally added to the max_interval to help prevent all queued batches from sending at the same time. This added randomness makes this calculation approximate.

To force the retry to be indefinite, set the max_elapsed_time to 0.

For our example of handling an 8-hour outage, we want to ensure that all data is eventually received, so we set the max_elapsed_time to 0 to force infinite retries. However, infinite retries are not ideal in a production environment, so as an alternative (and to ensure you still have enough retries to handle the outage), you can set the max_elapsed_time to the same time frame as the outage you need to handle. For our example of handling an 8-hour outage, you can set max_elapsed_time to 8h. Note that since this is a duration field it, understands s for seconds, m for minutes, and h for hours.

This will be the upper bound that is required. Using the details above, you can calculate a value that matches closer to your needs.

Troubleshooting `max_elapsed_time`

When troubleshooting issues with the persistent queue, if you see errors in the collector log or event viewer such as “Exporting failed. Dropping Data, No more retries left”, that means this max value is being reached for the given data batch and it's being dropped. Try increasing the value or setting it to 0 to see if the problem is fixed. This error is visible on the metric graphs by showing random gaps in the metric data.

In the following image you can see gaps in the data because of max_elapsed_time not being set large enough (or to zero for infinite). Because this value works like a TTL versus a timeout, you end up with gaps as certain batches of data use up their allotted time to spend on uploading attempts.

Queue size

Getting the queue_size right is important as we need to ensure that we have a large enough queue to hold all the batches that we can expect during the outage duration we are trying to handle. Queue_size is a batch count and is not a size such as kB. We can calculate an approximate size based on the config values timeout and send_batch_size.

Both values, timeout and send_batch_size, are used to trigger a batch export operation, and its the total count of these batch export operations over a duration that is needed for the queue_size. For more details, see the preceding section on Batches. If queue_size is too low, then, when connectivity is restored, you will see data backfilled from when the outage started until the point that the queue was full and no more data could be persisted. From the point the queue was full until the end of the outage, there will be a gap in the data. If you are filling your queue, you will see errors in your collector logs and event viewer like “Sender Error...Sending queue is full”.

For our example of handing an 8-hour outage, we can perform some simple math to calculate the upper bound of the queue_size.

The default timeout is 200ms, meaning that it will send a data batch every 200ms. This equates to five data batches per second, which equals 300 batches per minute, which equals 18,000 batches per hour. To cover an 8-hour outage, you need a queue_size of 8 hours x 18,000 = 144,000. This calculation is meant only as a guide. Many factors can impact queue_size, so you should perform testing to ensure the coverage you require.

Below is a sample of a config showing these settings for the otlp/http exporter (traces) and the signalFX exporter (metrics). These exporters also must be in the traces and metrics section of the data pipeline to be enabled.

exporters:
  # Traces
  otlphttp:
    traces_endpoint: https://ingest.us1.signalfx.com/v2/trace/otlp
    headers:
      X-SF-Token: "${SPLUNK_ACCESS_TOKEN}"
    retry_on_failure:
      max_elapsed_time: 0
    sending_queue:
      storage: file_storage/persist
      queue_size: 144000 #calculated value to handle 8 hour outage
# Metrics + Events
signalfx:
  access_token: "${SPLUNK_ACCESS_TOKEN}"
  api_url: "${SPLUNK_API_URL}"
  ingest_url: "${SPLUNK_INGEST_URL}"
  # Use instead when sending to gateway
  #api_url: http://${SPLUNK_GATEWAY_URL}:6060
  #ingest_url: http://${SPLUNK_GATEWAY_URL}:9943
  sync_host_metadata: true
  correlation:
  retry_on_failure:
    max_elapsed_time: 0
    sending_queue:
      storage: file_storage/persist
      queue_size: 144000 #calculated value to handle 8 hour outage

Caveats

The above calculation assumes that your data batching is triggered by the default 200ms timeout value for the batch processor. If this value has been changed, that will impact the math above.

If your data batching is triggered by the send_batch_size (default 8192) limit for the batch processor, instead of the timeout, then the calculated queue_size value using the math above will be too low. In this situation, you have to figure out roughly how often the batches are being triggered by the timeout versus the send_batch_size, and then apply a multiplier to the queue_size. To help determine which trigger is sending the batches, you can investigate the OtelCol metrics for the collector you are trying to size. The metrics to look at are called:

otelcol_processor_batch_timeout_trigger_send
otelcol_processor_batch_batch_size_trigger_send

If you are sizing this for a single host or very small gateway, then the above calculation will result in an oversized queue. This is because, by default, receivers collect data on a 10s interval, and the data is sent on a 200ms interval. The result is a lot of null batches that are not sent and do not need to be stored by the queue. Test out smaller sizes to see just how small you can make the queue.

Unfortunately, there is not an easy way to calculate this. Some receivers take longer than the 200ms timeout to capture all their data, so it's not as simple as replacing the 200ms timeout with a 10s receiver collection time in the calculation above. However, if you replace the 200ms with 10s and redo the calculation, you will get a lower bound to start experimenting with.

Troubleshooting queue size

When troubleshooting issues with the queue size, if you see errors in the collector log or event viewer such as “Sender Error...<various details>…Sending queue is full”, this means the queue has reached the max number of batches it can hold, and additional data cannot be stored in the queue. Since more data cannot be stored in the queue, if the collector is still in an outage and cannot see the ingest URL, then any data that cannot fit in the queue will be dropped.

Increasing the queue_size helps prevent this error. The queue_size scales somewhat linearly, so if a value of 2000 allows your collector to hold 30 mins of data, then 4000 will allow you to hold approximately 60 mins of data. When the queue_size too small, it is visible on the metric graph. After connectivity is restored, you will see data back filled from when the outage started until the point that the queue became full. Then the metric graph will be null values until the point that connectivity was restored, as no more batches can be persisted into the queue after it hits the queue_size value.

In the following image you can see the alert icon (filled red triangle) that shows the start of the outage, and the alert cleared icon (outline red triangle). The bar chart setting helps highlight blanks which indicate no data points. This is how the data appears if the queue_size is set too small. This graph shows how data is collected at the start of the outage. Roughly halfway through the outage there is no more data being persisted because the queue is full. This situation can be corrected by increasing the queue_size.

A note on traces and Splunk Application Performance Monitoring

While the persistent queue works with traces, the delay in the data arriving into Splunk Application Performance Monitoring results in some unexpected UI quirks. For APM to handle some of its real time calculations, it uses the trace ingest time as the timestamp for certain functions. The effect is that after an outage ends, Tag Spotlight, service maps, and Trace Analyzer see a spike in traces because APM considers all the persisted traces to have been received at the time connectivity is restored, as that is ingest time. However, the traces include a span time stamp and this value is still captured and used for other calculations, so if you open a trace, you will see the correct time stamps because the details display span time stamps.

Additional resources

These resources might help you implement the guidance provided in this article.

GitHub: Exporter Helper
GitHub: File Storage extension
GitHub: Batch processor
Splunk Docs: Batch processor
Splunk Docs: OTLP/HTTP exporter
Splunk Docs: SignalFX exporter