Checking the quality of your data sources
There might be times when you want to check the quality of your data sources to ensure that your source types are parsing properly. Incorrect line breaking, timestamp parsing problems, and aggregation problems can cause issues with searches and make it harder for you to get value from your data.
You should ideally check your data in a test instance or environment before implementing in production.
Searching on your data
You can check if your data is being parsed properly by searching on it, using the index and source type that your data source applies to.
Run the below search in your environment, with a timeframe of at least the last 15 minutes. This search is a modified version of a search from Splunk Monitoring Console > Indexing > Inputs > Data Quality. This search has been modified so that you can run this on any of your search heads:
index=_internal splunk_server=* source=*splunkd.log* splunk_server=* (log_level=ERROR OR log_level=WARN) (component=AggregatorMiningProcessor OR component=DateParserVerbose OR component=LineBreakingProcessor) | rex field=event_message "Context: source(::|=)(?<context_source>[^\\|]*?)\\|host(::|=)(?<context_host>[^\\|]*?)\\|(?<context_sourcetype>[^\\|]*?)\\|" | eval data_source=if((isnull(data_source) AND isnotnull(context_source)),context_source,data_source), data_host=if((isnull(data_host) AND isnotnull(context_host)),context_host,data_host), data_sourcetype=if((isnull(data_sourcetype) AND isnotnull(context_sourcetype)),context_sourcetype,data_sourcetype) | stats count(eval(component=="LineBreakingProcessor" OR component=="DateParserVerbose" OR component=="AggregatorMiningProcessor")) as total_issues dc(data_host) AS "Host Count" dc(data_source) AS "Source Count" count(eval(component=="LineBreakingProcessor")) AS "Line Breaking Issues" count(eval(component=="DateParserVerbose")) AS "Timestamp Parsing Issues" count(eval(component=="AggregatorMiningProcessor")) AS "Aggregation Issues" by data_sourcetype | sort - total_issues | rename data_sourcetype as Sourcetype, total_issues as "Total Issues"
The results of the search should look like this, showing the number of line breaking issues, timestamp parsing issues, and aggregation issues for each of your source types.
You can drill down into the data by clicking on one of the numbers in the columns.
Adjusting settings in props
To correctly parse your data, Splunk recommends that you always have the following settings in your props.conf:
On Splunk Enterprise:
On the universal forwarder:
These resources might help you understand and implement this guidance:
- Tekstream blog: Data onboarding in Splunk
- Product tip: Improving data pipeline processing in Splunk Enterprise
Want to learn more about improving data quality? Contact us today! TekStream accelerates clients’ digital transformation by navigating complex technology environments with a combination of technical expertise and staffing solutions. We guide clients’ decisions, quickly implement the right technologies with the right people, and keep them running for sustainable growth. Our battle-tested processes and methodology help companies with legacy systems get to the cloud faster, so they can be agile, reduce costs, and improve operational efficiencies. And with hundreds of deployments under our belt, we can guarantee on-time and on-budget project delivery. That’s why 97% of clients are repeat customers.