Data collection architecture
The Splunk platform can index any type of data sent from the collection tier, making it available for search. Efficient and reliable forwarding to the indexing tier is critical to the success of any Splunk deployment. There are several different aspects you should consider when planning your data collection tier architecture:
- Data source varieties: log files, syslog, API, Databases, HEC, network inputs, OS event logging facilities, applications, message bus
- Requirements for data ingest latency and throughput
- Requirements for security and compliance
- Requirements for fault tolerance and high availability
- Strategy for ideal event distribution across the indexing tier
For more information specific to getting data in (GDI), see:
Data collection components
Forwarders
Universal forwarder. The universal forwarder (UF) is the best choice for a large set of data collection requirements from systems in your environment. It is a purpose-built data collection mechanism with minimal resource requirements. The UF should be the default choice for collecting and forwarding log data.
Heavy forwarder. The heavy forwarder (HF) is a full Splunk Enterprise deployment configured to act as a forwarder with indexing disabled. A HF generally performs no other Splunk roles. The key difference between a UF and a HF is that the HF contains the full parsing pipeline and performs the identical functions an indexer performs without actually writing and indexing events on disk. This enables the HF to understand and act on individual events, for example to mask data or to perform filtering and routing based on event data. Because it is a full Splunk Enterprise install, it can host modular inputs that require a full Python stack to function properly for data collection or serve as an endpoint for the Splunk HTTP event collector (HEC).
Comparison of universal and heavy forwarders | ||
---|---|---|
Features and capabilities |
Universal forwarder |
Heavy forwarder |
Type of Splunk Enterprise instance |
Dedicated executable |
Full Splunk Enterprise |
Footprint (memory, CPU load) |
Smallest |
Medium-to-large |
Bundles Python? |
No |
Yes |
Handles data inputs? |
All types |
All types |
Forwards to Splunk Enterprise? |
Yes |
Yes |
Forwards to third party systems? |
Yes |
Yes |
Serves as intermediate forwarder? |
Yes |
Yes |
Indexer acknowledgment? |
Optional |
Optional |
Load balancing? |
Yes |
Yes |
Data cloning? |
Yes |
Yes |
Per-event filtering? |
No |
Yes |
Event routing? |
No |
Yes |
Event parsing? |
Yes |
|
Local indexing? |
No |
Optional |
Searching/alerting? |
No |
Optional |
Splunk Web? |
No |
Optional |
Anonymize data? |
No |
Yes |
HTTP Event Collector (HEC)
The HEC provides a listener service that accepts HTTP and HTTPS connections on the server side, allowing applications to post log data payloads directly to either the indexing tier or a dedicated HEC receiver tier that consists of one or more heavy forwarders (HF). HEC provides two endpoints that support data to be sent either in raw format or in JSON format. Utilizing JSON can allow for additional metadata to be included in the event payload, which might facilitate greater flexibility when searching the data later.
Data collection node (DCN)
Some data sources require collection by using some sort of an API. These APIs can include REST, web services, JMS, and JDBC as the query mechanism. Splunk and third-party developers provide a wide variety of applications that allow these API interactions to occur. Most commonly, these applications are implemented using the Splunk Modular Input framework, which requires a full Splunk Enterprise software install to properly function. The best practice is to deploy one or more servers to work as a heavy forwarder configured to work as a data collection node.
Syslog data collection
The syslog protocol delivers a ubiquitous source for log data in the enterprise. Most scalable and reliable data collection tiers contain a syslog ingestion component. There are multiple ways to get syslog data into Splunk:
- Splunk Connect for Syslog (SC4S): This is the current best practice recommendation to collect syslog data. It provides a Splunk-supported turn-key solution and utilizes the HTTP Event Collector to send data to Splunk for indexing. It scales well and addresses the shortcomings of other methods. For more information, see Understanding best practices for Splunk Connect for Syslog.
- Universal forwarder: Use a Splunk UF to monitor (ingest) files written out by a syslog server (such as rsyslog or syslog-ng). While still widely in use, we no longer recommend this as a best-practice approach, in favor of SC4S.
- Direct TCP/UDP input: Splunk has the ability to listen on a TCP or UDP port (default syslog port is UDP-514) and accept syslog traffic directly. While this is acceptable for lab and test environments, Splunk strongly discourages this practice in any production environment.
Collection tier topology example
Intermediary forwarding tier (IF)
In some situations, intermediary forwarders (IFs) are needed for data forwarding. IFs receive log streams from endpoints and forward on to an indexer tier. IFs introduce architectural challenges that require careful design in order to avoid negative impacts to the overall Splunk environment. Most prominently, IFs concentrate connections from 100s to 10,000s of endpoint forwarders and forward to indexers using a far smaller number of connections. This can negatively impact the data distribution across the indexing tier because only a subset of indexers receives traffic at any given point in time. However, these negative side effects can be mitigated by proper sizing and configuration.
Intermediary forwarding topology example
Data collection tier recommended best practices
Use the UF to forward data whenever possible
Limit use of a heavy forwarder to the use cases that require it. UFs have a number of benefits:
- Built-in autoLB
- Restart capable
- Centrally configurable
- Small resource demand
Have at least twice as many IF pipelines as indexers when funneling large amount of UFs
Funneling a large number of endpoint forwarders through a small number of intermediary forwarders can impact balanced event distribution across indexers, which can negatively impact search performance. Having at least twice as many IF pipelines as indexers can reduce this impact. Note that you should only deploy intermediary forwarders if absolutely necessary.
Consider securing UF traffic sent to indexers using SSL/TLS
Encrypting data in transit reduces the amount of data transmitted and improves the security of your deployment.
Use the native Splunk load balancer to spray data to the indexing tier
Network load balancers are not currently supported between forwarders and indexers.
Utilize Splunk Connect for Syslog (SC4S) containers for syslog collection as close to the data sources as possible
SC4S can be quickly deployed and configured to collect data from most popular data sources with minimal effort. For more information, see Understanding best practices for Splunk Connect for Syslog.
Use HEC for agentless collection (instead of native TCP/UDP)
The HTTP Event Collector (HEC) is a listening service that allows events to be posted via the HTTP[/S] protocol. It can be enabled directly on indexers or configured on a heavy forwarder tier, with both served by a load balancer.
Next steps
These resources might help you understand and implement this guidance:
- Splunk Docs: Splunk Validated Architectures
- Splunk Docs: Reference hardware
- Splunk Outcome Path: Monitoring and alerting for key event readiness