Skip to main content
 
 
Splunk Lantern

Identifying indicators of fraud using geometric principles

 

You can use percentiles to look for indicators of fraud. In the Splunk platform, you can use the eventstats command to find the 90th percentile for any numerical variable for all events. In theory, if the variable’s value is in the 90th percentile, then it is an outlier. Furthermore, if there are two variables of interest, both within the 90th percentile of their respective dataset, then the chances of declaring the event as fraudulent increases.

For example, suppose we have financial transaction data containing total amounts transferred and the frequency of the number of transfers done within a day. If a customer is within 90th percentile of both the total amount transferred and their frequency is also over the 90th percentile, then this looks like suspicious behavior and it should be flagged as such. In SPL, this would look like the following:

index=transactions sourcetype=transfers 
| eventstats perc90(Amount) AS percAmount perc90(Frequency) AS percFrequency
|where Amount>percAmount and Frequency>percFrequency 
|eval RiskScore_HighFreqAmount=30 
| table Customer Amount Frequency percAmount percFrequency RiskScore_HighFreqAmount

clipboard_e18503ba37d78f40f5245de4ffe4ea2a0.png

We added a risk score to the end of the search to send to a risk index for further processing. The risk score you assign should be based on your own usage and normalization techniques according to what suits your use case. In this example, all customers who are over the 90th percentile for amount and frequency are flagged.

Because the amount can be magnitudes larger than the frequency value, we can use the log function to bring the amount value within the same magnitude as the frequency. 

index=transactions sourcetype=transfers 
| eval Amount=log(Amount)
|eventstats perc90(Amount) AS percAmount perc90(Frequency) AS percFrequency
|where Amount>percAmount and Frequency>percFrequency 
|eval RiskScore_HighFreqAmount=30 
|table Customer Amount Frequency percAmount percFrequency RiskScore_HighFreqAmount 

clipboard_e3d5925dfe92f72f7180be241ba308f3e.png

Now, the Amount field is at a more reasonable magnitude, but the results are the same. Let's find a better way to look for an outlier.

Solution

Fraud is based on outliers as they show a low probability of occurring. If multiple variables show a very high percentile of occurring at the same time, which should not happen in normal situations, then this could be an indication of fraud. Another way to calculate this is to multiply each influencing variable in the dataset by each other to form a N dimensional shape. This will allow an extremely high value of a variable to manifest itself in the resulting multiplication as its high value influences the final result.

Two variables

We know the area of a rectangle is length times width. What if we view our two variables as measurements and the multiplication of the two produces a virtual area? In this case, we multiply the log(Amount) by the Frequency. This produces the virtual area. Let’s continue this with our example to get the minimum area that uses the 90th percentile of both variables. We can use this search to accomplish this:

index=transactions sourcetype=transfers 
|eval Amount=log(Amount)
|eventstats perc90(Amount) AS percAmount perc90(Frequency) AS percFrequency
|where Amount>percAmount and Frequency>percFrequency 
|eval area = Amount * Frequency
|stats min(area) AS min_area

clipboard_e1391a58d4cf4560caf042880cef1c41c.png

Now we can use the lowest possible 90th percentiles to find the lowest possible area such that the two numbers multiplied stay within the virtual area, in which case this is still considered an outlier. However, can we assume that any combination of Frequency * Amount that is greater than 20.87 should be considered suspicious, because our training dataset told us this involved numbers from each variable in the 90th percentile?

If so, then, we no longer need to use 90th percentiles to find possible fraud. We can now say that any virtual area less than 20.87 will be dismissed, but if any combination of Frequency * Amount is greater than that number, it will lead to a risk score. This heuristic is different from the previous one, but it allows for an easier calculation so that an extreme value in just one variable will not skew the results. Our new search is as follows:

index=transactions sourcetype=transfers 
|eval Amount=log(Amount)
|eval area=Amount * Frequency
|where area > 20.87
|eval RiskScore_HighFreqAmount=30 
|table Customer Amount Frequency area RiskScore_HighFreqAmount
|sort - area 

clipboard_ee7a5f55c0b464865520fc949d8ad5e89.png

The first thing to notice is that the search is simpler. It’s based on any measurement greater than the virtual area is considered suspicious. The next thing to notice is one more customer (English) was listed here. Their original measurements were not both above the 90th percentile, but the combination of the two variables is enough to go over our threshold.

Why bother to do this? Think about the case where one variable is almost at the highest percentile while the other variable is at the 89th percentile. If we were simply comparing to see if both variables were above the 90th percentile, which in this case they are not, we would miss the potential outlier of customer English. However, the multiplication of both variables easily surpasses our boundaries for a risk indicator. This allows us to be more flexible in discovering possible fraud as one variable may be close to a threshold, while another easily surpasses it.

Three variables

What if we used three variables instead of two? We’ll continue to use the wire transfer data but also add a suspicious country score. All countries will get an initial score of 1 and if they are suspicious, they will get a score of 2. A really suspicious country will get a score of 3.

Our formula will now be amount * Frequency * suspicious. This will compute a virtual cubic volume. If the virtual cubic volume is over a predefined threshold, this entity should get a risk score just like we did before.

In the next SPL example, we artificially provide a suspicious risk rating using a case statement for each country in the dataset with a default rating of 1. 

index=transactions sourcetype=wire_transfer 
|eval amount=log(amount) 
|iplocation destIP
|rename Country as DestCountry
|eval suspicious = case(DestCountry=="United States", 1, DestCountry=="Ghana", 2, DestCountry=="North Korea", 3, 1=1, 0) 
|eval volume=amount * Frequency * suspicious
|where volume > 40
|eval RiskScore_Volume=30
|table customer, suspicious, amount, Frequency, DestCountry, volume, RiskScore_Volume 
|sort - volume

clipboard_e910ae177cc69abe8fc69eb7457f98329.png

The new things in this search are the use of iplocation to find the country name of the destination IP address, the use of the case function to assign scores to countries, and finally using a virtual volume to look for suspicious behavior. In deployment, instead of using a case function, it is better to use a lookup to determine the risk rating as there are over 200 countries in the world.

The use of 40 is an arbitrary predetermined threshold of fraud. We could have used the percentile method from above to show all events that had where Frequency>p_Frequency and suspicious>=p_suspicious and amount>p_amount to help compute the threshold, but the sample dataset used here was not aligned to the point where events had all 3 variables in the highest percentiles at the same time. This threshold should be fine-tuned over time with more training data. Using three variables instead of two could lead to less false positives, if a proper threshold is used. 

Four variables

Could we take this one step further and use the multiplication of four variables instead of three? Mathematically, any number of dimensions can be used. However, the more dimensions that are used after four makes it difficult to tell if any one variable is influencing the results without having to resort to machine learning to figure out the correlation of each variable to all the others. 

Next steps

Since multiple variables are being used to create a number that can be compared to a predetermined threshold, the question then comes up, are risk scores even needed and can a judgment of possible fraud be made on the spot? This depends on the efficacy of each variable for predicting fraud. A large training set of real world tested data should be used to see at which point false positives or false negatives come into play too often.

The best thing to do is continue to use risk scores with each result to add to a risk index for further summation to accurately predict fraud. After understanding what works and does not work, after a few weeks of testing with real data, risk scores may be abandoned for some use cases where we are certain it is fraud, as a judgment of fraud can be made immediately by the multiplication of variables.

For instance, if someone is withdrawing thousands of dollars from two different ATMs at the same time, we can be certain this is fraud. The usage of multiplying high percentile variables as indicators of fraud is an implicit way to calculate risk. In the traditional way, each variable would be part of the rule and the rule would be given a risk score because of the outlier value of the variable. Then, all of the entity’s risk scores would be summarized to compare to a threshold to determine fraud. In contrast, this virtual geometric area or volume method described here is doing the same thing by implicitly using the outlier values of variables to ascertain risks all at once, making the approach simpler.

Additional resources

This article came from a previously published blog. You might be interested in the following additional resources for financial use cases: