Skip to main content

 

Splunk Lantern

Identifying fabricated data by using Ahlstrom conjecture

 

When people are asked to create random data, they often use non-random patterns. For instance, if you ask someone to quickly write down a "random" 10-digit number, they will likely avoid using the same digit twice in a row, such as 44 or 99. This is because, to the human mind, repetition doesn't feel random.

This psychological quirk can be a powerful tool. What if you could analyze sequences of numbers in your data to see if they were created by a human trying to appear random, rather than being genuinely generated by a system? This could be a novel method for flagging fabricated data.

Data required

Financial data

About the Ahlstrom conjecture

The Ahlstrom conjecture suggests that in naturally occurring or truly random data, consecutive identical digits (for example, "33", "77") are quite common. However, in data fabricated by a human, these repeated digits appear far less frequently. The person creating the fake data, in an attempt to make it look random, will subconsciously avoid these repetitions.

The theory states that for any given sequence of 10 random digits, the probability of it containing at least one pair of identical consecutive digits is approximately 65%. Therefore, if you analyze a large set of legitimate data, you should expect to find that about two-thirds of your numerical sequences contain repeated digits. A significant deviation below this baseline could indicate that the data was manually created and is potentially fraudulent.

In fraud detection, applying this technique is a powerful complement to using Benford's law. While Benford's law analyzes the distribution of first digits, the Ahlstrom conjecture analyzes the internal structure of numbers, targeting the psychology of a human fraudster.

Solution

After establishing a baseline for your own data, you can use the eval command with regular expressions to detect the presence of consecutive identical digits and stats to calculate the distribution.

Artificial sample data (Potential fraud)

Let's start by running a query against a dataset that was intentionally created to look "random" by a human, and therefore has very few repeated digits.

| makeresults count=10
| streamstats count
| eval transaction_id = case(
    count=1, "4829104721",
    count=2, "3810472912",
    count=3, "9821046284",
    count=4, "1298371092",
    count=5, "5837201839",
    count=6, "8134893458",
    count=7, "7365926450",
    count=8, "8270562374",
    count=9, "7903490472",
    coun=10, "3413850235"    
  )
| fields - count
| eval has_repeat = if(match(transaction_id, "(\d)\1"), "Yes", "No")
| stats count BY has_repeat
| eventstats sum(count) AS total
| eval percentage = round(count*100/total, 2)
| fields - total

unnamed - 2025-07-29T105632.321.png

0% of these fabricated transaction IDs contain a repeated consecutive digit. This is a strong indicator that the data was manually created to avoid patterns, directly opposing what we'd expect from natural data.

Sample real-world data

Now, let's run the same query against a more realistic, machine-generated dataset where natural repetitions are expected to occur.

| makeresults count=10
| streamstats count
| eval transaction_id = case(
    count=1, "8667894929",
    count=2, "8949298949",
    count=3, "4117654517",
    count=4, "1881615806",
    count=5, "6740213173",
    count=6, "7772743758",
    count=7, "1422022321",
    count=8, "2989753760",
    count=9, "2227449921",
    coun=10, "8772307006"
  )
| fields - count
| eval has_repeat = if(match(transaction_id, "(\d)\1"), "Yes", "No")
| stats count BY has_repeat
| eventstats sum(count) AS total
| eval percentage = round(count*100/total, 2)
| fields - total

unnamed - 2025-07-29T105815.298.png

This result, with 60% of records containing repeated digits, is much closer to the theoretically expected 65%. This distribution is more indicative of authentic, randomly generated data.

Next steps

The Ahlstrom conjecture provides another data point for risk scoring or fraud detection models. By establishing a baseline for what is normal in your environment, you can create alerts that trigger when the percentage of transactions with repeated digits drops significantly. A sudden dip could indicate an attempt to inject fabricated records into your systems.

To further advance your use cases, the Splunk Essentials for the Financial Services Industry app helps you automate the searches to detect financial crime. The app also provides more insight on how searches can be applied in your environment, how they work, the difficulty level, and what data can be valuable to run them successfully.

The Splunk App for Fraud Analytics provides Splunk Enterprise Security users a number of other fraud detection solutions for financial services such as account takeover and new account abuse.

If you have questions about monitoring for account takeover in your environment, you can reach out to your Splunk account team or representative for comprehensive advice and assistance. You can contact your account team through the Contact Us page.

For more in-depth support, consult Splunk On-Demand Services to access credit-based services for direct access to Splunk technical consultants with a variety of technical services from a pre-defined catalog. Most customers have OnDemand Services per their license support plan. Engage the ODS team at ondemand@splunk.com if you would like assistance.

In addition, these resources might help you understand and implement this guidance: