Configuring new source types

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Employing good data onboarding practices is essential to seeing a Splunk deployment work well, or, work at all. If a data source is ingested with default configurations, the Splunk platform spends a lot of time and processing power guessing the right settings for each event before it ingests the data.

You need best practices for data onboarding to make your Splunk deployment more efficient and to save money with workload pricing.

Solution

When setting up a new source type, there are eight main configurations that need to be set up in all cases. These save the Splunk platform the most work when parsing events and sending data to indexers. This article explains these eight configurations, as well as two more configurations you might need to fully configure a source type. These ten configurations are summarized in the table below.

If you are unfamiliar with how Splunk event processing works, you might want to read the Splunk Docs overview before continuing with this article.

Setting	Importance
`SHOULD_LINEMERGE` `LINE_BREAKER` `TRUNCATE` `EVENT_BREAKER_ENABLE` `EVENT_BREAKER` `TIME_PREFIX` `MAX_TIMESTAMP_LOOKAHEAD` `TIME_FORMAT`	The Splunk Great Eight (always configure for all source types)
`CHARSET` `ANNOTATE_PUNCT`	Optional extras (but worth considering)

Example data source

The examples in this article are based on the following sample event data set. This is referred to throughout the article as the "Worked Example".

50 Event Date: 2020-07-21 02:04:54.214 fshdc.dom.com iis http://www.google.com/query=fishy
20 Event Date: 2020-07-21 02:05:58.004 fshdc.dom.com iis https://www.outlook.com/login
ERROR 404 Request aborted
90 Event Date: 2020-07-21 03:25:01.023 fshdc.dom.com iis http:/iplayer.bbc.com

Event line breaking

One of the first things that the Splunk platform needs to know about a source type is where it starts and stops. This is called event line breaking. If the data is always on a single line, then this is simple: wait for a new line or carriage return combination.

However, sometimes second and subsequent lines are used. For example, this can happen with error events when the contents of the code stack are included. In this case, looking for the end of a line is not effective. Instead, we need to find the beginning of the next event.

Therefore, the configurations that you must set in all cases are: LINE_BREAKER, TRUNCATE, and SHOULD_LINEMERGE.

LINE_BREAKER

This is a regular expression (regex) that sets the pattern that the Splunk platform looks for to place a line break. The Splunk platform, by default, looks for any number of carriage returns and line feed characters as the line breakers, which is configured as ([\r\n]+). If each event is just one line, then this is adequate. However, best practice is to also check for the start of each event and then add that into the configuration in case the file has extra lines added when the software is upgraded or changed.

For the Worked Example, in order to stop the second event from being chopped up too soon, we should include the following setting that will check whether the timestamp is after a new line, two numbers and the words Event Date: so the previous event is longer than one line:

LINE_BREAKER = ([\n\r]+)\d{2}\sEvent Date:\s\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3}

The Splunk platform requires that the items in the parentheses (known as the regex capture group) are the line breaker. In the example above, these are any number of new lines (\n) and/or carriage returns (\r). Regex101 can help you check your work since writing a regex correctly can be difficult.

TRUNCATE

This setting is, by default, set as 10,000 characters. For most source types, this value needs to be tailored because it can cause information to be lost. For example, a JSON event could be curtailed and the Splunk platform might not show the event in its nice JSON formatting. So, best practice dictates that this value should be set to a higher number.

To select a good value, you can upload some representative data (two days' worth is often enough) to a test index and then look at the highest event length. Take this number and increase it to a margin of about 10 percent. A useful search for finding the recommended TRUNCATE value this way is as follows:

index="test" sourcetype="<Your new Sourcetype name>"
| eval event_size=len(_raw)
| stats max(event_size) as max_event_size 
| eval "Recommended TRUNCATE Value"=(max_event_size * 1.10)
| fields - max_event_size

An alternative approach is to set TRUNCATE to 999999, which allows for the Splunk platform to allow unusually large events, but stops the Splunk platform from crashing because it has ingested a nonsensical file such as a picture or database file that came from a malicious user.

The Peril of Zero

It is possible to set TRUNCATE to zero, meaning the Splunk platform never truncates an event, but that is not recommended. If there is a log file corruption and the line breaker is lost that many times, then the Splunk platform can crash while waiting for an event to end. That is never a good thing!

For the Worked Example, as the maximum size that we have is 119 characters, we will setTRUNCATE to be 119 + 10%, as below:

TRUNCATE = 131

SHOULD_LINEMERGE

The Splunk platform, by default, sets this value to true, in case a new source type is multi-lined. However, leaving this setting as true is not considered a best practice. When SHOULD_LINEMERGE is set to true, this sets the line merge process running in the aggregator pipeline, which looks at the events and appends all of the lines together before then splitting them out again in the next process. If the file is large, this can be a resource-intensive process.

As we should have already used the LINE_BREAKER setting according to best practice, in the Worked Example we use:

SHOULD_LINEMERGE = false

In most cases, we should set SHOULD_LINEMERGE to false, but there can be some niche cases where it should be set to true, such as when there are more variable event types that a simple LINE_BREAKER will not cater for. If this configuration is set to true, then there are a few extra settings that can be used as well.

EVENT_BREAKER_ENABLE and EVENT_BREAKER

These two configurations apply only to Universal Forwarders (UFs) version 6.5 and above. A UF uses this pair of configurations to ascertain when it is safe to swap between indexers in a cluster to ensure that only whole events are sent. In most cases, EVENT_BREAKER_ENABLE needs to be set to true and the EVENT_BREAKER should match the value in LINE_BREAKER. This allows the forwarder to send its events individually to all of the indexers in the cluster evenly.

So in the Worked Example above with the data being read by a Splunk 7.3 universal forwarder, we would provide the following configurations:

EVENT_BREAKER_ENABLE = true
EVENT_BREAKER = ([\n\r]+)\d{2}\sEvent Date:\s\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3}

Further information about event line breaking can be found here.

Event timestamps

The other great performance boost that we can apply to Splunk data onboarding is telling the Splunk platform how to find the timestamps.

The Splunk platform is pretty clever about finding timestamps for its events because without a timestamp, the Splunk platform cannot organize the data. So, when it has separated the data into events, it searches each event to find a timestamp. Several features of note about this generic process are that:

The Splunk platform looks at the first 128 characters in an event for the timestamp.
The Splunk platform uses the uses the first set of characters that looks like a timestamp that it finds in the event.
The Splunk platform uses over 30 different regex patterns to search the event for a suitable timestamp that it can use.

There are a few issues with this behavior:

The timestamp might not be in the first 128 characters of the event.
The first timestamp that is in the data may not be the one that is needed.
The date may be in a different format than the Splunk platform thinks that it is (such as the month/day being switched), although there are some clever checks it does before validating this.
Running over 30 different regex pattern searches for each event is a lot of work.
The time zone may be wrong, and therefore the event time may be recorded incorrectly.
Incorrect timestamps can result in data being aged out prematurely or retained for too long.
Wrong timestamps can also cause the Splunk platform to create unduly-small index buckets, making searches inefficient.
All of this extra work can clog up a heavy forwarder or indexer in the aggregator pipeline and cost more on a workload processing license model.

So the case for configuring timestamps is strong. We can reduce the amount of work that the Splunk platform needs to do drastically and therefore give it more time for other tasks, such as searching, reporting, alerting, and dashboarding by applying the best practices below.

`TIME_PREFIX`

This configuration is really useful for the Splunk platform for several reasons, not least of all that it points to where in the event it will find its most important piece of data - the timestamp. This location could be further than 128 characters into the event, so rather than making the Splunk platform look into the first 128 characters for the right pattern of numbers and letters, TIME_PREFIX points to exactly where to look.

TIME_PREFIX is a regex and has no default value, but it should be set to the unique part of the raw event that exactly precedes the time. For example, taking the event snippet above, the time prefix would contain a part of the LINE_BREAKER, so we use this setting:

TIME_PREFIX = Event Date:\s

`MAX_TIMESTAMP_LOOKAHEAD`

This is a configuration that the Splunk platform uses to determine how far from the beginning of the event or the end of the TIME_PREFIX to look for the timestamp. The default value is 128 characters. However, this value should always be changed to exactly reflect the length of the timestamp the Splunk platform needs to look for it. We do not want to give the Splunk platform any opportunity to find the wrong timestamp.

In the Worked Example, we would set:

MAX_TIMESTAMP_LOOKAHEAD = 23

Some events can have variable-length timestamps, for instance “3 August 2020 1:12:9.001”. In this case, make sure that this setting is given the largest length possible, which would be 30.

`TIME_FORMAT`

This configuration is probably the best one for speed improvement. It has no default, and without it, the Splunk platform processes more than 30 timestamp checks. Specifying the TIME_FORMAT reduces the chance of the Splunk platform getting the wrong date. TIME_FORMAT uses the strptime syntax. If you do not know strptime, review Splunk Docs Date and time format variables for the syntax. With this command, you point to the items in the area of the timestamp that TIME_PREFIX and MAX_TIMESTAMP_LOOKAHEAD outline and say what they actually represent.

In the Worked Example, we would set the value:

TIME_FORMAT = %Y-%m-%d %H:%M:%S.%3N

The strptime command is case sensitive. For instance, %Y indicates a 4-digit year, but %y is a 2-digit year.

Character sets

The CHARSET field tells the Splunk platform what type of character set the file is written in. The Splunk platform, by default, uses UTF-8 decoding, but this can be incorrect and introduce errors in the interpretation if the document is in a different format, such as UTF-7 or CP874. For more information on encoding, read Configure character set encoding.

In the Worked Example, the data is in UTF-7 format, so we set:

CHARSET = UTF-7

The `punct` field

The addition of the punct field to a source type is a default configuration within Splunk platform that is sometimes useful for highlighting events with outlying patterns within searches. Therefore, it is useful for seeing unusual, dangerous, or illegal activity. However, it is rarely used and can be generated at search time in searches that may need it.

Generating the punct field is an extra piece of processing that Splunk platform does not need to do and the field is added to the index of your data which you might never need. Therefore, if you switch it off, you can reduce:

Indexing load on your indexers
The amount of space each event takes up

Depending on your system, number of source types, indexes, and sizes of events, your deployment performance and data ingestion reductions will vary. However, this is something that the Splunk platform allows for and it should be considered as an optimization.

To switch off the punct field, add the following to the source type stanza in the props.conf:

ANNOTATE_PUNCT = false

At search time, if you find that the punct field is not there and you need it, you can use the eval command's replace function in your search to add punct into your results. However, remember that, as the computation might be run upon a large number of events, this requires a lot of extra power and is slow:

| eval punct = replace(replace(_raw, "\s", "_"), "\w", "")

Saving your source type

Name it properly

After you start to create your own source types, make sure that you give them sensible names with a sensible naming convention. Splunk provides recommendations for the Splunk Administrator that follow the overall format:

vendor:product:technology:format

Generally, this is a good method for naming the source types and should be followed. When looking at the Worked Example, the product that generated it was a custom in-house item from Rhododendron Games, running in IIS, so the name and completed definition in the props.conf could be:

[rg:gamesale:iis:webtype]
LINE_BREAKER = ([\n\r]+)\d{2}\sEvent Date:\s\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3}
TRUNCATE = 131
SHOULD_LINEMERGE = false
TIME_PREFIX = Event Date:\s
MAX_TIMESTAMP_LOOKAHEAD = 23
TIME_FORMAT = %Y-%m-%d %H:%M:%S.%3N
CHARSET = UTF-7
ANNOTATE_PUNCT = false
EVENT_BREAKER_ENABLE = true
EVENT_BREAKER = ([\n\r]+)\d{2}\sEvent Date:\s\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3}

Place the definition

Next, on what server do you place the props.conf? In the case of a single, all-in-one Splunk server, this answer is easy - you put it on that server and on the forwarders. However, in the case of a distributed Splunk environment, this is a bit trickier. The short answer is that you place it on the following Splunk servers:

Search heads
Indexers
Heavy forwarders (where accompanied by the inputs.conf or on an intermediate heavy forwarder, where it is the first full Splunk installation in the data path)
Universal forwarders (where accompanied by the inputs.conf)

The long answer is that the configurations in the props can be split out and only the elements that are relevant to the universal forwarders are placed on those servers and the parts that are needed on the search heads, heavy forwarders, and indexers are put on those servers. However, as each Splunk installation is clever enough to only use the elements that it needs, it is not worth using different configurations for each server. This approach simplifies the administration of the Splunk platform.

Regardless, it is worth knowing where every configuration is used, so that if anything goes wrong on one server, you know where to change it.

Use an app, not system\local

Best practice is to add your props.conf into a custom app that is then deployed to your infrastructure via deployment server, cluster master, or search head deployer. To learn how to create a custom app, see Create a Splunk app and set properties.

Next steps

If you need additional help with data onboarding, UK-based Somerford Associates can help. Somerford Associates is an award winning Elite Partner with Splunk and the largest Partner Practice of Consultants in EMEA. We protect data, demonstrate that it is being managed effectively and derive greater value, by providing real-time insights to support effective decision making. With our specialist knowledge, skills, experience and strong reputation for enabling digital transformation at scale and at pace, we provide full delivery, including design, implementation, deployment, and support.

In addition, these Splunk resources might help you understand and implement the recommendations in this article:

Splunk Community: Best practices for defining source types
Splunk Lantern: Understanding workload pricing

The user- and community-generated information, content, data, text, graphics, images, videos, documents and other materials made available on Splunk Lantern is Community Content as provided in the terms and conditions of the Splunk Website Terms of Use, and it should not be implied that Splunk warrants, recommends, endorses or approves of any of the Community Content, nor is Splunk responsible for the availability or accuracy of such. Splunk specifically disclaims any liability and any actions resulting from your use of any information provided on Splunk Lantern.