Scaling Edge Processor infrastructure

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

In this series we’ll be looking at scaling Splunk Edge Processor using Amazon EKS, as well as some steps to get us there. There are a number of factors that can affect the required scale of your Splunk Edge Processor infrastructure, including:

Changes to data volumes
Implementation or retirement of use cases
Increasing or decreasing pipeline complexity

All of these are very common scenarios that require Edge Processor scale to be adaptable, and in this first article we’ll explore scaling concepts.

How does Edge Processor scale?

As your data processing requirements change you’ll have to decide whether to scale by altering the available resources for your Edge Processor nodes, or by altering the number of nodes doing the processing. Below we call out some common scenarios and the most common scale result.

Let’s take a look at some scenarios that can lead to different scaling outcomes. As is the case with most technology, Edge Processor nodes scale both vertically and horizontally, depending on the observed behavior.

Scenario Example Scale Example

Scenario Example	Scale Example
Scale up (vertical) Data pipeline complexity increases such as: More complex regular expressions Multi-value `evals` and `mvexpand` More destinations Significant event size or event volume increases Long persistent queue requirements	From To
Scale out (horizontal) Number of universal forwarders increase Need to improve indexer event distribution (avoid funneling) Spread out persistent queues Improve resiliency, reduce impact of node failures	From To

Scale up (vertical)

Data pipeline complexity increases such as:
- More complex regular expressions
- Multi-value evals and mvexpand
- More destinations
Significant event size or event volume increases
Long persistent queue requirements

From

Scale out (horizontal)

Number of universal forwarders increase
Need to improve indexer event distribution (avoid funneling)
Spread out persistent queues
Improve resiliency, reduce impact of node failures

From

For most purposes we’ll consider any substantial change to any of the following as cause to evaluate our scale:

Event volume, both number and size of events
Number of agents
Number and complexity of pipelines
Change in target destinations
Risk tolerance

Any change to these factors will play a role in the overall resource consumption and processing speed of your Edge Processor nodes.

How do I know when it’s time to scale?

Monitor, monitor, monitor. Because Edge Processor nodes run on your infrastructure, you need to initiate and manage changes to scale. With the telemetry reporting built into Edge Processor, you can monitor and react to resource utilization reported by the Edge Processor nodes. This could result in manually adjusting scale, executing automation frameworks, invoking cloud workflows, or, as we’ll explore in this series, using containers and Kubernetes to scale to demand.

You can explore all of the metrics in the Splunk Cloud Platform using mcatalog

| mcatalog values(metric_name) WHERE index=_metrics AND sourcetype="edge-metrics"

There’s no one metric that can tell you it’s time to scale up or down, instead we suggest watching some key metrics across your edge processors in order to establish baseline, expected usage metrics. In particular:

GB in/out per day
Event counts in/out per day
Exporter queue size
CPU & memory consumption
Unique source types and agents per day

Additionally, consider measuring event lag by comparing index time vs. event time as a general practice for GDI health, irrespective of Edge Processor.

As you prepare for or encounter scaling events, you can use these common metrics to help determine the effectiveness of your scale operation. As a general rule, we recommend scaling once CPU or memory reaches and sustains 70% utilization during normal operations.

Scaling up (vertical)

As CPU and memory become overutilized due to changes in data processing, often the most straightforward approach to scaling is to add more of those constrained resources. Supporting metrics such as data rates, queue size, number of source types, and agent counts will often indicate that the core resources of CPU and memory are likely to become constrained, but the best indicator for deciding when to allocate or deallocate those resources are the measurements of the core resources themselves. As CPU or memory approach 80%+, use the supporting metrics to determine whether additional resources will properly address the situation. You can use the magnitude increases of the supporting metrics to help inform your resource allocation estimates.

As the diagrams above illustrate, adding more CPU and memory resources is the most common approach when pipeline complexity and event load increases. This organic, gradual growth is the most likely scenario that will happen over time as use of the platform increases. However, at some point you will see diminishing returns on vertical growth, whether due to CPU parallelism constraints, infrastructure limitations, increased risk due to failure, or just practical/internal hardware provisioning guidelines. As you reach vertical limitations, or as your situation aligns better to horizontal scaling, it’s probably time to consider adding more nodes rather than just making bigger nodes

Pros

Generally very easy to do in virtualized or containerized environments
Fewer nodes to manage
No need to provision or maintain new hardware
Most common solution
Easy to see results

Cons

Puts more data at risk due to node failure
Infrastructure will likely have maximum allowable sizes
Diminishing returns above 6 CPU cores (as of this writing)
Additional resource cost

Scaling out (horizontal)

Vertically scaling the hardware resources for a given workload on a given node is the most important, and is typically the first step in scaling, but as your utilization increases beyond your chosen vertical scale and as you look to address other scaling challenges it’s time to scale out. Scaling out Edge Processor is the process of provisioning additional nodes that are associated with a given Edge Processor. Remember, Edge Processors are the logical grouping of configuration settings, and Edge Processors have one or more nodes assigned to them. Just as you provisioned your first Edge Processor node by using the Manage Instances screen, you will use that same screen to add more nodes.

As illustrated in the above diagrams, the primary considerations for scaling out are a bit different than when scaling up. Just like Splunk indexers and intermediate forwarders, with Edge Processors, you will generally look at horizontal scaling to support other horizontally-focused growth like supporting a significantly large number of forwarders, reducing data loss risk, and providing a wider funnel for event distribution. The great part of horizontal scaling is that more nodes will generally achieve these resiliency-focused goals as well as reducing resource constraints on a per node basis.

Pros

Reduce per node resource constraints
Reduce per node data loss risk
Can improve event distribution
Increase workload within vertical scaling constraints

Cons

More nodes to manage
Time spent onboarding new instances
Managing firewall and network routing rules
Additional cost

As you can see, the primary challenge with horizontal scaling is that it’s more infrastructure to manage, which also means more administrative time for onboarding and offboarding these instances as your requirements change. This can in turn lead to change management database (CMDB) churn, more reliance on other teams, additional troubleshooting, and other common infrastructure related toil.

Next steps

This series is meant to help alleviate these horizontal scaling challenges and provide a fast on-ramp for growing your Edge Processor footprint in a rapid and easily supported way. Now, move on to the next article to learn about bootstrapping Splunk Edge Processor authentication.

To follow along with the steps in the subsequent articles, you will need to request access to the API Token Automation Beta program, which contains the executables used in this process and in the processes outlined.