Skip to main content
 
Splunk Lantern

Data collection architecture

 

The Splunk platform can index any type of data sent from the collection tier, making it available for search. Efficient and reliable forwarding to the indexing tier is critical to the success of any Splunk deployment. There are several different aspects you should consider when planning your data collection tier architecture: 

  • Data source varieties: log files, syslog, API, Databases, HEC, network inputs, OS event logging facilities, applications, message bus 
  • Requirements for data ingest latency and throughput 
  • Requirements for security and compliance 
  • Requirements for fault tolerance and high availability
  • Strategy for ideal event distribution across the indexing tier 

Data collection components

Forwarders

Universal forwarder. The universal forwarder (UF) is the best choice for a large set of data collection requirements from systems in your environment. It is a purpose-built data collection mechanism with minimal resource requirements. The UF should be the default choice for collecting and forwarding log data.

Heavy forwarder. The heavy forwarder (HF) is a full Splunk Enterprise deployment configured to act as a forwarder with indexing disabled. A HF generally performs no other Splunk roles. The key difference between a UF and a HF is that the HF contains the full parsing pipeline and performs the identical functions an indexer performs without actually writing and indexing events on disk. This enables the HF to understand and act on individual events, for example to mask data or to perform filtering and routing based on event data. Because it is a full Splunk Enterprise install, it can host modular inputs that require a full Python stack to function properly for data collection or serve as an endpoint for the Splunk HTTP event collector (HEC).

Comparison of universal and heavy forwarders

Features and capabilities

Universal forwarder

Heavy forwarder

Type of Splunk Enterprise instance

Dedicated executable

Full Splunk Enterprise
(with some features disabled)

Footprint (memory, CPU load)

Smallest

Medium-to-large
(depending on enabled features)

Bundles Python?

No

Yes

Handles data inputs?

All types
(scripted inputs might require Python installation)

All types

Forwards to Splunk Enterprise?

Yes

Yes

Forwards to third party systems?

Yes

Yes

Serves as intermediate forwarder?

Yes

Yes

Indexer acknowledgment?

Optional

Optional

Load balancing?

Yes

Yes

Data cloning?

Yes

Yes

Per-event filtering?

No

Yes

Event routing?

No

Yes

Event parsing?

Sometimes

Yes

Local indexing?

No

Optional

Searching/alerting?

No

Optional

Splunk Web?

No

Optional

Anonymize data?

No

Yes

HTTP Event Collector (HEC) 

The HEC provides a listener service that accepts HTTP and HTTPS connections on the server side, allowing applications to post log data payloads directly to either the indexing tier or a dedicated HEC receiver tier that consists of one or more heavy forwarders (HF). HEC provides two endpoints that support data to be sent either in raw format or in JSON format. Utilizing JSON can allow for additional metadata to be included in the event payload, which might facilitate greater flexibility when searching the data later.

Data collection node (DCN) 

Some data sources require collection by using some sort of an API. These APIs can include REST, web services, JMS, and JDBC as the query mechanism. Splunk and third-party developers provide a wide variety of applications that allow these API interactions to occur. Most commonly, these applications are implemented using the Splunk Modular Input framework, which requires a full Splunk Enterprise software install to properly function. The best practice is to deploy one or more servers to work as a heavy forwarder configured to work as a data collection node.

Syslog data collection

The syslog protocol delivers a ubiquitous source for log data in the enterprise. Most scalable and reliable data collection tiers contain a syslog ingestion component. There are multiple ways to get syslog data into Splunk: 

  • Splunk Connect for Syslog (SC4S): This is the current best practice recommendation to collect syslog data. It provides a Splunk-supported turn-key solution and utilizes the HTTP Event Collector to send data to Splunk for indexing. It scales well and addresses the shortcomings of other methods. For more information, see Understanding best practices for Splunk Connect for Syslog.
  • Universal forwarder: Use a Splunk UF to monitor (ingest) files written out by a syslog server (such as rsyslog or syslog-ng). While still widely in use, we no longer recommend this as a best-practice approach, in favor of SC4S. 
  • Direct TCP/UDP input: Splunk has the ability to listen on a TCP or UDP port (default syslog port is UDP-514) and accept syslog traffic directly. While this is acceptable for lab and test environments, Splunk strongly discourages this practice in any production environment.

Collection tier topology example

unnamed (5).png

Intermediary forwarding tier (IF)

In some situations, intermediary forwarders (IFs) are needed for data forwarding. IFs receive log streams from endpoints and forward on to an indexer tier. IFs introduce architectural challenges that require careful design in order to avoid negative impacts to the overall Splunk environment. Most prominently, IFs concentrate connections from 100s to 10,000s of endpoint forwarders and forward to indexers using a far smaller number of connections. This can negatively impact the data distribution across the indexing tier because only a subset of indexers receives traffic at any given point in time. However, these negative side effects can be mitigated by proper sizing and configuration.

Intermediary forwarding topology example

unnamed (6).png

Data collection tier recommended best practices

Use the UF to forward data whenever possible 

Limit use of a heavy forwarder to the use cases that require it. UFs have a number of benefits:

  • Built-in autoLB
  • Restart capable
  • Centrally configurable
  • Small resource demand

Have at least twice as many IF pipelines as indexers when funneling large amount of UFs

Funneling a large number of endpoint forwarders through a small number of intermediary forwarders can impact balanced event distribution across indexers, which can negatively impact search performance. Having at least twice as many IF pipelines as indexers can reduce this impact. Note that you should only deploy intermediary forwarders if absolutely necessary.

Consider securing UF traffic sent to indexers using SSL/TLS

Encrypting data in transit reduces the amount of data transmitted and improves the security of your deployment.

Use the native Splunk load balancer to spray data to the indexing tier

Network load balancers are not currently supported between forwarders and indexers.

Utilize Splunk Connect for Syslog (SC4S) containers for syslog collection as close to the data sources as possible

SC4S can be quickly deployed and configured to collect data from most popular data sources with minimal effort. For more information, see Understanding best practices for Splunk Connect for Syslog.

Use HEC for agentless collection (instead of native TCP/UDP)

The HTTP Event Collector (HEC) is a listening service that allows events to be posted via the HTTP[/S] protocol. It can be enabled directly on indexers or configured on a heavy forwarder tier, with both served by a load balancer.

Next steps

These resources might help you understand and implement this guidance: