Skip to main content
Registration for .conf24 is open! Join us June 11-14 in Las Vegas.
 
 
 
Splunk Lantern

Scaling Edge Processor infrastructure

 

In this series we’ll be looking at scaling Edge Processor using Amazon EKS, as well as some steps to get us there. There are a number of factors that can affect the required scale of your Edge Processor infrastructure, including:

  • Changes to data volumes
  • Implementation or retirement of use cases
  • Increasing or decreasing pipeline complexity

All of these are very common scenarios that require Edge Processor scale to be adaptable, and in this first article we’ll explore scaling concepts.

How does Edge Processor scale?

As your data processing requirements change you’ll have to decide whether to scale by altering the available resources for your Edge Processor nodes, or by altering the number of nodes doing the processing. Below we call out some common scenarios and the most common scale result.

Let’s take a look at some scenarios that can lead to different scaling outcomes. As is the case with most technology, Edge Processor nodes scale both vertically and horizontally, depending on the observed behavior.

Scenario Example Scale Example

Scale up (vertical)

  • Data pipeline complexity increases such as:
    • More complex regular expressions
    • Multi-value evals and mvexpand
    • More destinations
  • Significant event size or event volume increases
  • Long persistent queue requirements

From

3.png

To

4.png

Scale out (horizontal)

  • Number of universal forwarders increase
  • Need to improve indexer event distribution (avoid funneling)
  • Spread out persistent queues
  • Improve resiliency, reduce impact of node failures
     

From

1.png

To

2.png

For most purposes we’ll consider any substantial change to any of the following as cause to evaluate our scale:

  • Event volume, both number and size of events
  • Number of agents
  • Number and complexity of pipelines
  • Change in target destinations
  • Risk tolerance

Any change to these factors will play a role in the overall resource consumption and processing speed of your Edge Processor nodes. 

How do I know when it’s time to scale?

Monitor, monitor, monitor. Because Edge Processor nodes run on your infrastructure, you need to initiate and manage changes to scale. With the telemetry reporting built into Edge Processor, you can monitor and react to resource utilization reported by the Edge Processor nodes. This could result in manually adjusting scale, executing automation frameworks, invoking cloud workflows, or, as we’ll explore in this series, using containers and Kubernetes to scale to demand.

You can explore all of the metrics in the Splunk Cloud Platform using mcatalog

| mcatalog values(metric_name) WHERE index=_metrics AND sourcetype="edge-metrics"

And you can also explore common sizing metrics on the sizing dashboards here (link to dashboards).

There’s no one metric that can tell you it’s time to scale up or down, instead we suggest watching some key metrics across your edge processors in order to establish baseline, expected usage metrics. In particular:

  • GB in/out per day
  • Event counts in/out per day
  • Exporter queue size
  • CPU & memory consumption
  • Unique source types and agents per day

Additionally, consider measuring event lag by comparing index time vs. event time as a general practice for GDI health, irrespective of Edge Processor.

As you prepare for or encounter scaling events, you can use these common metrics to help determine the effectiveness of your scale operation.  As a general rule, we recommend scaling once CPU or memory reaches and sustains 70% utilization during normal operations.

Scaling up (vertical)

As CPU and memory become overutilized due to changes in data processing, often the most straightforward approach to scaling is to add more of those constrained resources. Supporting metrics such as data rates, queue size, number of source types, and agent counts will often indicate that the core resources of CPU and memory are likely to become constrained, but the best indicator for deciding when to allocate or deallocate those resources are the measurements of the core resources themselves. As CPU or memory approach 80%+, use the supporting metrics to determine whether additional resources will properly address the situation. You can use the magnitude increases of the supporting metrics to help inform your resource allocation estimates.  

As the diagrams above illustrate, adding more CPU and memory resources is the most common approach when pipeline complexity and event load increases. This organic, gradual growth is the most likely scenario that will happen over time as use of the platform increases. However, at some point you will see diminishing returns on vertical growth, whether due to CPU parallelism constraints, infrastructure limitations, increased risk due to failure, or just practical/internal hardware provisioning guidelines. As you reach vertical limitations, or as your situation aligns better to horizontal scaling, it’s probably time to consider adding more nodes rather than just making bigger nodes

Pros

  • Generally very easy to do in virtualized or containerized environments
  • Fewer nodes to manage
  • No need to provision or maintain new hardware
  • Most common solution
  • Easy to see results

Cons

  • Puts more data at risk due to node failure
  • Infrastructure will likely have maximum allowable sizes
  • Diminishing returns above 6 CPU cores (as of this writing)
  • Additional resource cost

Scaling out (horizontal)

Vertically scaling the hardware resources for a given workload on a given node is the most important, and is typically the first step in scaling, but as your utilization increases beyond your chosen vertical scale and as you look to address other scaling challenges it’s time to scale out. Scaling out Edge Processor is the process of provisioning additional nodes that are associated with a given Edge Processor. Remember, Edge Processors are the logical grouping of configuration settings, and Edge Processors have 1 or more nodes assigned to them. Just as you provisioned your first Edge Processor node by using the Manage Instances screen, you will use that same screen to add more nodes.

5.png

As illustrated in the above diagrams, the primary considerations for scaling out are a bit different than when scaling up. Just like Splunk indexers and intermediate forwarders, with Edge Processors, you will generally look at horizontal scaling to support other horizontally-focused growth like: supporting a significantly large number of forwarders, reducing data loss risk, and providing a wider funnel for event distribution. The great part of horizontal scaling is that more nodes will generally achieve these resiliency-focused goals as well as reducing resource constraints on a per node basis.

Pros

  • Reduce per node resource constraints
  • Reduce per node data loss risk
  • Can improve event distribution
  • Increase workload within vertical scaling constraints

Cons

  • More nodes to manage
  • Time spent onboarding new instances
  • Managing firewall and network routing rules
  • Additional cost

As you can see, the primary challenge with horizontal scaling is that it’s more infrastructure to manage, which also means more administrative time for onboarding and offboarding these instances as your requirements change. This can in turn lead to CMDB churn, more reliance on other teams, additional troubleshooting, and other common infrastructure related toil.

Next steps

This series is meant to help alleviate these horizontal scaling challenges and provide a fast on-ramp for growing your Edge Processor footprint in a rapid and easily supported way.