Using the DensityFunction algorithm in Machine Learning Toolkit 5.5
You are a user of the Splunk platform looking to enhance your machine learning capabilities by leveraging the new supervised grouping feature of the DensityFunction algorithm in the Machine Learning Toolkit (MLTK) 5.5. This will help you improve the efficiency of your model training and data processing workflows.
Solution
This article shows you how to use the new version of the DensityFunction in MLTK 5.5 by going through a short example using the app_usage.csv dataset that ships with MLTK. You can see what this dataset looks like in the screenshot below.
-
Use the DensityFunction with the old approach. The search below structures your data so that you have a timestamp, the app name, and the count associated with the app. It then enriches the data with the day of the week and fits a DensityFunction model on the
count
field, using theapp
andday_of_week
fields in theby
clause.| inputlookup app_usage.csv | table _time * | untable _time app count | eval day_of_week=strftime(strptime(_time,"%Y-%m-%d"),"%a") | fit DensityFunction count BY "app,day_of_week" into test_df
The model created from this search can be viewed using the search below, where you can see in the "Statistics" column of the table below that there are 77 groups, one for each app (11 apps) and day-of-week combination.
-
Run the same model training search, but this time with the
supervise_split_by
option set totrue
. This search is identical to the model training search above, except for thesupervise_split_by=true
in thefit
command and a new model name.| inputlookup app_usage.csv | table _time * | untable _time app count | eval day_of_week=strftime(strptime(_time,"%Y-%m-%d"),"%a") | fit DensityFunction count BY "app,day_of_week" supervise_split_by=true into supervised_test_df
Viewing the model trained with the supervised approach, you can see that you now have 11 groups present in your data instead of the 77 previously.
-
Check the search completion time. In this example, the old approach took just over 42 seconds, whereas with the supervised approach the search takes under 12 seconds, improving processing time by almost 4 times.
The supervised approach is also accessible with the Smart Outlier Detection assistant. On the Learn Data step, you can now select the supervised grouping using a checkbox, as shown in the screenshot below.
How does the supervised grouping work?
The addition of the supervise_split_by
option presents a decision point for DensityFunction during model training:
- If set to
false
, the previous method of running DensityFunction is used, where each combination of categories in the fields included in theBY
clause determines the model groups. - If set to
true
, a decision tree algorithm determines the model groups based on the categories present in the fields included in theBY
clause.
This process is shown in the diagram below.
Using a decision tree involves partitioning the feature space into regions. A full grown tree is built first with all the features and levels. For the DensityFunction, the features and levels are determined by the fields that are selected in the BY
clause. For example, if you were to detect outliers in a count of logons and select the HourOfDay
and DayOfWeek
in the BY
clause, the initial tree would be determined from how the logons value is affected by each of the different hour of day and day of week combinations - for example, Monday at midnight, 1am, 2am, etc. From this initial tree, the weakest branches are removed to identify the smallest number of real groups in the data - for example the initial Monday groups at midnight, 1am and 2am might be so similar that they are placed in the same group through this tree pruning process.
The end result of the decision tree supervised approach is that the number of groups processed by the DensityFunction algorithm should be smaller than the number of categorical combinations present in the fields in the by
clause. So if you process data by hour of the day and day of the week, you should see fewer than 7x24 (168) groups, provided there is enough similarity in some of the different hour of day and day of week combinations.
Next steps
These Splunk resources might help you understand and implement this use case:
- Lantern: Alerting on source type volume with machine learning
- Lantern: Automating Know Your Customer continuous monitoring requirements
- Conf Talk: Accelerate your ability to sniff out application exceptions and detect outliers in performance KPIs
- Conf Talk: Augment your security monitoring use cases with Splunk's Machine Learning Toolkit
- Blog: MLOps - Logs, metrics, and traces to improve your machine learning systems