Data collection architecture

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

The Splunk platform can index any type of data sent from the collection tier, making it available for search. Efficient and reliable forwarding to the indexing tier is critical to the success of any Splunk deployment. There are several different aspects you should consider when planning your data collection tier architecture:

Data source varieties: log files, syslog, API, Databases, HEC, network inputs, OS event logging facilities, applications, message bus
Requirements for data ingest latency and throughput
Requirements for security and compliance
Requirements for fault tolerance and high availability
Strategy for ideal event distribution across the indexing tier

For more information specific to getting data in (GDI), see:

Data collection components

Forwarders

Universal forwarder. The universal forwarder (UF) is the best choice for a large set of data collection requirements from systems in your environment. It is a purpose-built data collection mechanism with minimal resource requirements. The UF should be the default choice for collecting and forwarding log data.

Heavy forwarder. The heavy forwarder (HF) is a full Splunk Enterprise deployment configured to act as a forwarder with indexing disabled. A HF generally performs no other Splunk roles. The key difference between a UF and a HF is that the HF contains the full parsing pipeline and performs the identical functions an indexer performs without actually writing and indexing events on disk. This enables the HF to understand and act on individual events, for example to mask data or to perform filtering and routing based on event data. Because it is a full Splunk Enterprise install, it can host modular inputs that require a full Python stack to function properly for data collection or serve as an endpoint for the Splunk HTTP event collector (HEC).

Comparison of universal and heavy forwarders
Features and capabilities	Universal forwarder	Heavy forwarder
Type of Splunk Enterprise instance	Dedicated executable	Full Splunk Enterprise (with some features disabled)
Footprint (memory, CPU load)	Smallest	Medium-to-large (depending on enabled features)
Bundles Python?	No	Yes
Handles data inputs?	All types (scripted inputs might require Python installation)	All types
Forwards to Splunk Enterprise?	Yes	Yes
Forwards to third party systems?	Yes	Yes
Serves as intermediate forwarder?	Yes	Yes
Indexer acknowledgment?	Optional	Optional
Load balancing?	Yes	Yes
Data cloning?	Yes	Yes
Per-event filtering?	No	Yes
Event routing?	No	Yes
Event parsing?	Sometimes	Yes
Local indexing?	No	Optional
Searching/alerting?	No	Optional
Splunk Web?	No	Optional
Anonymize data?	No	Yes

HTTP Event Collector (HEC)

The HEC provides a listener service that accepts HTTP and HTTPS connections on the server side, allowing applications to post log data payloads directly to either the indexing tier or a dedicated HEC receiver tier that consists of one or more heavy forwarders (HF). HEC provides two endpoints that support data to be sent either in raw format or in JSON format. Utilizing JSON can allow for additional metadata to be included in the event payload, which might facilitate greater flexibility when searching the data later.

Data collection node (DCN)

Some data sources require collection by using some sort of an API. These APIs can include REST, web services, JMS, and JDBC as the query mechanism. Splunk and third-party developers provide a wide variety of applications that allow these API interactions to occur. Most commonly, these applications are implemented using the Splunk Modular Input framework, which requires a full Splunk Enterprise software install to properly function. The best practice is to deploy one or more servers to work as a heavy forwarder configured to work as a data collection node.

Syslog data collection

The syslog protocol delivers a ubiquitous source for log data in the enterprise. Most scalable and reliable data collection tiers contain a syslog ingestion component. There are multiple ways to get syslog data into Splunk:

Splunk Connect for Syslog (SC4S): This is the current best practice recommendation to collect syslog data. It provides a Splunk-supported turn-key solution and utilizes the HTTP Event Collector to send data to Splunk for indexing. It scales well and addresses the shortcomings of other methods. For more information, see Understanding best practices for Splunk Connect for Syslog.
Universal forwarder: Use a Splunk UF to monitor (ingest) files written out by a syslog server (such as rsyslog or syslog-ng). While still widely in use, we no longer recommend this as a best-practice approach, in favor of SC4S.
Direct TCP/UDP input: Splunk has the ability to listen on a TCP or UDP port (default syslog port is UDP-514) and accept syslog traffic directly. While this is acceptable for lab and test environments, Splunk strongly discourages this practice in any production environment.

Collection tier topology example

Intermediary forwarding tier (IF)

In some situations, intermediary forwarders (IFs) are needed for data forwarding. IFs receive log streams from endpoints and forward on to an indexer tier. IFs introduce architectural challenges that require careful design in order to avoid negative impacts to the overall Splunk environment. Most prominently, IFs concentrate connections from 100s to 10,000s of endpoint forwarders and forward to indexers using a far smaller number of connections. This can negatively impact the data distribution across the indexing tier because only a subset of indexers receives traffic at any given point in time. However, these negative side effects can be mitigated by proper sizing and configuration.

Intermediary forwarding topology example

Data collection tier recommended best practices

Use the UF to forward data whenever possible

Limit use of a heavy forwarder to the use cases that require it. UFs have a number of benefits:

Built-in autoLB
Restart capable
Centrally configurable
Small resource demand

Have at least twice as many IF pipelines as indexers when funneling large amount of UFs

Funneling a large number of endpoint forwarders through a small number of intermediary forwarders can impact balanced event distribution across indexers, which can negatively impact search performance. Having at least twice as many IF pipelines as indexers can reduce this impact. Note that you should only deploy intermediary forwarders if absolutely necessary.

Consider securing UF traffic sent to indexers using SSL/TLS

Encrypting data in transit reduces the amount of data transmitted and improves the security of your deployment.

Use the native Splunk load balancer to spray data to the indexing tier

Network load balancers are not currently supported between forwarders and indexers.

Utilize Splunk Connect for Syslog (SC4S) containers for syslog collection as close to the data sources as possible

SC4S can be quickly deployed and configured to collect data from most popular data sources with minimal effort. For more information, see Understanding best practices for Splunk Connect for Syslog.

Use HEC for agentless collection (instead of native TCP/UDP)

The HTTP Event Collector (HEC) is a listening service that allows events to be posted via the HTTP[/S] protocol. It can be enabled directly on indexers or configured on a heavy forwarder tier, with both served by a load balancer.

Next steps

These resources might help you understand and implement this guidance:

Splunk Docs: Splunk Validated Architectures
Splunk Docs: Reference hardware
Splunk Outcome Path: Monitoring and alerting for key event readiness

Previous step

Next step

Back to the SSF homepage