Managing ephemeral infrastructure
In modern application environments, it is increasingly common to make use of ephemeral infrastructure. These are instances that are auto-scaled up or down; containers that are spun up on demand; or ‘immutable’ infrastructure that is brought up with new code versions and discarded or recycled when the next code version is deployed. Although there are many advantages to ephemeral infrastructure, it also comes with some new variants of traditional monitoring challenges. Ephemeral infrastructure can:
- cause you to encounter the per-plot line limits more quickly
- render traditional mechanisms for monitoring useless
- lead to gaps in datapoints from newly created instances’ or containers’ metrics
You need good strategies to deal with these problems.
Make sure the groupings of metric time series fit within the limit
Each metric on a new instance or container is represented in Splunk Infrastructure Monitoring as a new time series. This is especially true when you are looking over a longer time range: the shorter the lifespan of your instances or containers, and the longer the period of time over which you want to look at their metrics, the more quickly you will get to large numbers of time series in a single plot line.
Break down the metric time series into groups that will fit within the per-plot line limits. For example, if you have 17,000 hosts emitting a free memory metric memory.free
, and you want to sum up the total free memory across all of your hosts, then you need to use a dimension to filter the metrics into groupings of 5,000 or fewer hosts. A common example of such a dimension would be by datacenter or region (in the case of Amazon Web Services) or availability zone, each of which might have 3,000 - 4,000 hosts.
Use analytics to understand when a service is unexpectedly down
In an environment where things are constantly going up and down, traditional mechanisms for monitoring - with manual configuration required for new elements and any non-reporting of a metric assumed to be problematic and alert-worthy (vs. being the expected effect of autoscaling, when, say, an instance is turned down on purpose) - do not work. Using analytics, however, you can only alert when the non-reporting is unexpected.
The analytics function that will help in this situation is count
. Be sure to select the analytics function and not the rollup. count
tells you how many time series are reporting a value at a given point in time; if an instance stops reporting a metric, for example if it has been terminated purposefully, then its time series will not be counted.
You can take advantage of this function to tell you how many instances are reporting, but you need one more thing: a property that tells you the expected state of the instance. For example, Amazon publishes the state of an EC2 instance (terminated, running, etc.) and Splunk Infrastructure Monitoring imports that as aws_state
. With this information,
- Set up a plot that uses a heartbeat metric of your choosing (say,
memory.free
). - Filter out the emitters that have been purposefully terminated (
!aws_state:terminated
). - Apply the count function, with a group-by on a dimension that represents a single emitter (e.g. aws_tag_Name).
This plot will then emit a 0 or 1, and an alert when the output is 0 tells you that an emitter (the instance) is unexpectedly down.
You can apply this general concept to anything you want; you just need:
- a heartbeat metric that reports regularly
- a canonical dimension that represents the emitter or source you care about
- a property on that dimension that denotes the expected state of that emitter
This is packaged in the “Heartbeat Check” built-in alert condition (not_reporting in the SignalFlow library).
Configure detectors to account for gaps in datapoints
The amount of time required for newly created instances’ or containers’ metrics to be visible to existing detectors can be problematic. While this typically does not affect detector functioning, in more extreme cases (for example, when an immutable infrastructure approach is being used, where some set of sources goes down completely and is replaced wholesale) your detector may need to account for the gap in datapoints that is seen by the detector. The following are possible solutions:
- Use a duration that is sufficiently long to account for the period Splunk Infrastructure Monitoring detectors need to update the sources that are included in its computations.
- Treat the data source as if it is aperiodic and adjust your detector logic accordingly.
- Re-save the detector when new sources are being added. This forces the detector to update the metric time series included in its computation, and would therefore ensure that newly created sources are included.
Next steps
These additional Splunk resources might help you understand and implement these recommendations:
- Tool: Backup tool for detectors and dashboards
- SignalFlow Library: not_reporting.flow