Skip to main content
 
 
Splunk Lantern

Applying Zipf's law in fraud detection

 

Zipf’s law states that in any word-based language, there will be one word that has the highest frequency of usage compared to other words in the language. The next highest used word will be half the frequency for usage compared to the most common word. This pattern continues down the line for all words in the language. For example, in English, the most common word is “the” at 7 percent usage, then "of" at 3.5% and so on. The percentage might not be cut in half exactly every time, but there is a downward pattern.

This can have implications in cyber security. For example, in any corporation, one outbound IP address will have the highest frequency, while the next highest outbound IP address frequency will be nearly in half. This downward spiral will eventually lead to rare outbound IP addresses, completing the pattern of Zipf’s law. It is these rare IP addresses that might make for notable events. Why were they used? Did some phishing scam or embedded bot lead to their appearance?

Solution

You can apply Zipf’s law to the question posed above, as well as to many types of data and topics in cybersecurity and fraud detection. 

Investigating aggregate risk scores

Within Splunk products, we often assign risk scores to fraudulent patterns of behavior. If an aggregate risk score goes over a predetermined threshold, the result might be that this entity has engaged in fraudulent activity. Fortunately, most aggregate risk scores do not go over the threshold, which means only a partial set of entities need to be investigated.

Here’s a distribution graph. Notice the descending pattern in this distribution follows the Zipf’s law ideas.

clipboard_e3c3127b7a2df54ad6bb8755210841a7d.png

In this case, the investigations will be for the highest occurring aggregate risk scores rather than the lowest ones, which are probably benign in a system that accurately measures risks.

Finding unusual wire transfers

A wire transfer might occur from one bank to another bank in another country. The pattern of distribution of frequencies for the count of countries appearing as the recipient country will be descending and follow a curve similar to Zipf’s law. There might be some countries that rarely appear on any given day as the recipient country. 

Most wire transfers will occur more than once to a country where wire transfer is expected due to the nature of the business and connected entities. If a wire transfer only occurs once for a country in a day, it might be suspicious. This does not mean it is fraud, but it does add to the risk.

In the Splunk platform we can use the rare command to see this.

index=transactions sourcetype=wire_transfer
|``` transformation, where clauses, etc
|rare destCountry
|sort - count

clipboard_e59ccd6212b4758446fcf4326da76f801.png

In this example, we use the rare command to count the frequency of the rarest countries that are the destination countries of a wire transfer. Because rare returns count sizes least to highest, in order to be consistent with our other charts, we sort the results from highest to lowest in the chart. This doesn’t exactly follow Zipf’s law, but it does show a reduction curve that follows the law in spirit.

What is missing here is context for the transaction such as the account name, the amount, etc. Let’s add that into the search.

index=transations sourcetype=wire_transfer
|``` comment: transformation, where clauses, etc
|stats count, list(_time) AS Time, values(customer) AS Customer, values(FromAccount) AS FromAccount, values(ToAccount) AS ToAccount, values(action) AS action, values(amount) AS amount BY destCountry 
|where count=1
|eval RiskScore_DestCountry=30

clipboard_eefe8deb351a5a02b1799ab574d32013a.png

In this example, we are using the stats command to save context for the transaction so it can be used to feed an alert or collect into a risk index. We no longer need to use rare because counting the destination countries’ frequency can get us the rarest count, and in this case the rarest count is exactly one. In the end, an arbitrary risk score is added to the results of the search so that it can be stored in a risk index for further aggregation of risk scores against the same entity.

This additional risk score might be the one that tips the aggregate risk score for the entity over a predetermined threshold due to other risk scores for the entity, which then means it is probably fraud.

Next steps

We have used the pattern of Zipf’s law to help determine risk in our wire transfers, which then helps determine possible fraud, when combined with other risky actions. For instance, suppose the same entity cited for wire transfer to a country rarely used as a destination also failed login multiple times before succeeding and changed their password before initiating the wire transfer. 

Each of those actions should generate a risk score. If we add up the risk scores, we would get:

TotalRiskScore=RiskScore_DestCountry + RiskScore_LoginFailures + RiskScore_PasswordChange

This would most likely trigger a total risk score over some predetermined threshold indicating the likelihood of fraud.

In Splunk Enterprise Security, the use of risk-based alerting would make this a streamlined process. In Splunk Enterprise or Splunk Cloud Platform, it can still be done, but with additional Splunk commands to manage a risk index and to create an alert. In either case, a SOAR product could initiate further actions for the alert, one of which would be to block the wire transfer from happening. Hence, our approach for adding in risk factors using rare events leads to better fraud detection and prevention.

Finally, automatic threat intelligence can be applied via SOAR products to the least frequently occurring IP addresses mentioned in our introduction to see if they are a cause for concern. 

Additional resources

This article came from a previously published blog. You might be interested in the following additional resources for financial use cases: