Measuring the quality of publicly available synthetic IDS datasets

Brussen, A.

Abstract

Year after year, the amount of network intrusions and costs associated to them rises. Research in this area is, therefore, of high importance and provides valuable insight in how to prevent or counteract intrusions. Machine learning algorithms seem to be a promising answer for automated network intrusion detection, as their results often reach upwards of 99\%+ on datasets. Yet even with these results, the problem does not seem to be solved as the same models do not reach similar scores on real live network traffic. This indicates a problem with the datasets.

In this work, we explored six recent network intrusion datasets and measured the quality of them from a design perspective and through a practical binary classification approach. Furthermore, we explored the need for complex classification models on these datasets, as research has shifted more towards using black-box models as opposed to white-box models.

Through a literature study on the quality metrics of a dataset, we found there is a general lack of agreement amongst researchers regarding what makes a dataset good and realistic. We also found areas in which datasets are often lacking in and provide concrete advice on how to upgrade the quality of these datasets.

For our practical classification approach, we built a general classification pipeline using a Random Forest. Three feature sets were tested, two of which form an ablation study to measure the effects of trigrams on the classification score. On all datasets the general classification model reaches good classification results (>90\%+) and for three datasets even reached state-of-the-art results. The ablation study yielded a positive effect of using trigrams for classification on the datasets. Our white-box approach performed on par or better than most black-box techniques. We conclude that black-box models are unnecessary in this problem context and that the techniques should shift back to white-box approaches.

Next, we attempted to link the quality of dataset methodologies to the difficulty of obtaining state-of-the-art classification results. Apart from the complexity of the attack vector and benign traffic variety, we found no further properties that define this relationship.

Finally, this work started with the assumption that real live network traffic is more complex and therefore more difficult to classify well on than most available datasets. Newer datasets do show improvements over older datasets and our classification results corroborated the validity of the assumption.

Measuring the quality of publicly available synthetic IDS datasets

A comparative study

Abstract

Files