Using the DensityFunction algorithm in Machine Learning Toolkit 5.5

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

You are a user of the Splunk platform looking to enhance your machine learning capabilities by leveraging the new supervised grouping feature of the DensityFunction algorithm in the Machine Learning Toolkit (MLTK) 5.5. This will help you improve the efficiency of your model training and data processing workflows.

Solution

This article shows you how to use the new version of the DensityFunction in MLTK 5.5 by going through a short example using the app_usage.csv dataset that ships with MLTK. You can see what this dataset looks like in the screenshot below.

Use the DensityFunction with the old approach. The search below structures your data so that you have a timestamp, the app name, and the count associated with the app. It then enriches the data with the day of the week and fits a DensityFunction model on the count field, using the app and day_of_week fields in the by clause.
```
| inputlookup app_usage.csv
| table _time *
| untable _time app count
| eval day_of_week=strftime(strptime(_time,"%Y-%m-%d"),"%a")
| fit DensityFunction count BY "app,day_of_week" into test_df
```
The model created from this search can be viewed using the search below, where you can see in the "Statistics" column of the table below that there are 77 groups, one for each app (11 apps) and day-of-week combination.
Run the same model training search, but this time with the supervise_split_by option set to true. This search is identical to the model training search above, except for the supervise_split_by=true in the fit command and a new model name.
```
| inputlookup app_usage.csv
| table _time *
| untable _time app count
| eval day_of_week=strftime(strptime(_time,"%Y-%m-%d"),"%a")
| fit DensityFunction count BY "app,day_of_week" supervise_split_by=true into supervised_test_df
```
Viewing the model trained with the supervised approach, you can see that you now have 11 groups present in your data instead of the 77 previously.
Check the search completion time. In this example, the old approach took just over 42 seconds, whereas with the supervised approach the search takes under 12 seconds, improving processing time by almost 4 times.

The supervised approach is also accessible with the Smart Outlier Detection assistant. On the Learn Data step, you can now select the supervised grouping using a checkbox, as shown in the screenshot below.

How does the supervised grouping work?

The addition of the supervise_split_by option presents a decision point for DensityFunction during model training:

If set to false, the previous method of running DensityFunction is used, where each combination of categories in the fields included in the BY clause determines the model groups.
If set to true, a decision tree algorithm determines the model groups based on the categories present in the fields included in the BY clause.

This process is shown in the diagram below.

Using a decision tree involves partitioning the feature space into regions. A full grown tree is built first with all the features and levels. For the DensityFunction, the features and levels are determined by the fields that are selected in the BY clause. For example, if you were to detect outliers in a count of logons and select the HourOfDay and DayOfWeek in the BY clause, the initial tree would be determined from how the logons value is affected by each of the different hour of day and day of week combinations - for example, Monday at midnight, 1am, 2am, etc. From this initial tree, the weakest branches are removed to identify the smallest number of real groups in the data - for example the initial Monday groups at midnight, 1am and 2am might be so similar that they are placed in the same group through this tree pruning process.

The end result of the decision tree supervised approach is that the number of groups processed by the DensityFunction algorithm should be smaller than the number of categorical combinations present in the fields in the by clause. So if you process data by hour of the day and day of the week, you should see fewer than 7x24 (168) groups, provided there is enough similarity in some of the different hour of day and day of week combinations.

Next steps

These Splunk resources might help you understand and implement this use case: