Skip to main content
Splunk Lantern

Ingesting non-standard data for law enforcement search warrant returns


When responding to a search warrant, investigators, intelligence analysts and digital operations specialists need to easily analyze information from multiple data sources to identify criminal activity, GJ-2703D orders, or counterterrorism activity. This use case could also potentially support national security investigations into insider threat, APTs counterespionage, software patch corruption and supply chain risk management (SCRM).

The data types that often need to be investigated can include MBOX data, chat logs, social media intelligence (SOCMINT), and ZetX, cell phone returns, with data in a range of formats such as .xls, json, .pdf or .csv. When requesting search warrant information from Internet Service Providers (ISPs), you might contact any one of the hundreds of providers on the ISP List, all of whom have huge variations in the format of the data that will be returned to you.

Alongside issues with file types and data formatting, the amount of data that needs to be investigated can also be overwhelming. For example, you might need to investigate millions of emails from a known spear phisher. You need to be able to quickly exclude messages that don't mean much, and focus on messages that are more likely to contain interesting information, for example longer messages that are 500 characters and more. You also need to be able to easily find key words that can indicate criminal activity, for example, the terms Football, Hurricane, or Disneyland could indicate that a money laundering operation is underway.

Law enforcement agencies typically have access to methods that can make this process easier. For example, data scientists can develop Python scripts to parse out and pull To and From data fields or other information. MBOX data can be loaded into a third-party MBOX app, such as Thunderbird, for manual tagging and reviewing of information pertinent to the investigation. But it can still be challenging to parse the full range of data types, and reading through files manually is not feasible.

Required data

Data types are diverse but can include:

  • Social media and messaging account data, for example Facebook, Snapchat, MS Office apps, Instagram, Twitter, Google Gmail, Microsoft Outlook and Skype, Teams, Google Chat, or Parler
  • Mail server data, often including a second email provider so you have differently formatted returns, such as Hotmail, Yahoo, or AOL
  • Microsoft O365 logs
  • Apple iCloud data
  • Spotify or Pandora playlists (in-range IPs for criminal activity)
  • Dark web data
  • 911 proxy data


You can use the Splunk platform to correlate data fields and external contacts (including finding unknown unknowns) and search across large swaths of data for known identifiers. You can also use the Splunk platform to easily investigate conversations, contacts, call timelines, geolocation data, and other data types captured by law enforcement agencies or search warrant return.

Indexing ASCII data

Splunk Enterprise can index various types of ASCII data. It has the ability to monitor archive files such as .tar, .gz, .zip and .tgz formats. Splunk Enterprise can also ingest the metadata associated with Body Worn Cameras (BWC), photos or videos.

The process for an indexer to obtain data for parsing is:

  1. Install a Splunk universal forwarder on a production device.
  2. Set up the universal forwarder to monitor specific directories on the production device and send only changes to indexers.
  3. The universal forwarder sends changes to the monitored files or directories to the indexers where information such as host, source, or source type, as well as interesting fields are extracted. The fields extracted should supply helpful information that might be pertinent to the investigation - for example, email accounts, IP addresses, domain names, or C2 servers.

Users can then perform searches and analyze data and create reports, charts, graphs and alerts.

Indexing non-standard data types

Sometimes the data you want to ingest is not in a supported format, such as a PDF. In order for the Splunk platform to be able to read that data, a conversion process is required. In a *nix environment, the easiest way to convert a PDF to an ASCII format is to download and install a pdftotext application.

To obtain pdftotext:

  • In RedHat/RHEL/Fedora/CentOS, use – yum install poppler-utils .
  • In Debian or Ubuntu , use – sudo apt-get install poppler-utils .
  • You can also use tools like Xpdfreader.

The process to ingest and monitor this data is:

  1. On your *nix UF, download/install a pdftotext application.
  2. Create a .sh script to automatically convert all .pdf files within a directory to .txt files.
  3. Set the script to run at a specific interval.
  4. Set up the UF to monitor the directory where the outputted .txt files are stored.
  5. Set the UF to send the monitored directory logs to your Splunk indexers.

Example of non-standard data format ingestion

The screenshots below do not show a script but rather the process in which a PDF is converted to a text file in a specific location, where the Splunk platform is set to automatically ingest new logs within the specified location. These examples were also created on a stand-alone instance as opposed to a distributed environment, but the principle is still the same.

This is an example of a Splunk platform instance set up to monitor the directory "Test".

Splunk-PUBSEC-Splunk Data Ingestion-101-TB-web (1).jpg

This is an example of a command using pdftotext to convert a .pdf to a .txt file and sending that file to the "Test" directory. In a production environment, this task would be performed with a script and set to repeat on intervals. Other parameters could be added to the script, for example having the script delete the source PDF so that it isn’t converting the file over and over again.

Splunk-PUBSEC-Splunk Data Ingestion-101-TB-web (2).jpg

This is an example of searchable data in the newly-created “Test.” Data in the search results was converted from .pdf to .txt and monitored.

Splunk-PUBSEC-Splunk Data Ingestion-101-TB-web (3).jpg

Ensuring compliance

The Criminal Justice Information Services Division (CJIS) of the U.S. Federal Bureau of Investigation (FBI) sets standards for information security, guidelines, and agreements for protecting Criminal Justice Information (CJI). These standards are reflected in the CJIS Security Policy, which describes the appropriate controls to protect the sources, methods, transmission, storage, and access to data.

The Splunk platform is able to support law enforcement agencies in states that have executed a CJIS Information Management Agreement with Splunk. For certain products, Splunk Cloud Platform offers security controls to protect and store Criminal Justice Information (CJI) data through assured controls and workload for Splunk Cloud Platform.

If you are a Splunk Cloud Platform user, note that Splunk Cloud Platform meets the FedRAMP and StateRAMP security standards, helping U.S. federal agencies and their partners drive confident decisions and decisive actions at mission speeds. Agencies can ingest data in real-time and use that same data to address a variety of challenges across various programs and initiatives that span security and IT operations, as well as modernization and mission objectives.

Next steps

When data is ingested and readily searchable, you can combine it with additional input from investigators unique to their subject matter, TTPs for example characters of interest in Chinese, Russian, Farsi, MS-13 gangs, or criminal, counterterrorism chats. You might seek to add insights such as frequently used cyber actor terms and trade craft. There might also be other information which is relevant to draw in, such as information from fusion centers, leads, tips, referrals, walk-ins, virtual walk-ins, and sources.

At the end of the process, you can use the Splunk platform to produce reports that are court room-friendly and easy for non-technical people to understand.

This additional Splunk resource might help you understand and implement this use case:

For more information on using Splunk software for law enforcement purposes, see Splunk for public safety. You can also contribute to the Splunk law enforcement Github repository, or contact to learn more about Splunk for law enforcement.

Splunk OnDemand Services: Use these credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at if you require assistance.