Machine learning is recognised as a relevant approach to detect attacks and other anomalies in network traffic. However, there are still no suitable network datasets that would enable effective detection. On the other hand, the preparation of a network dataset is not easy due to privacy reasons but also due to the lack of tools for assessing their quality. In a previous paper, we proposed a new method for data quality assessment based on permutation testing. This paper presents a parallel study on the limits of detection of such an approach. We focus on the problem of network flow classification and use well-known machine learning techniques. The experiments were performed using publicly available network datasets.
<p>Intelligent and autonomous networks require precise and fast mechanisms that ensure error-free and efficient operation. Modern solutions are increasingly based on artificial intelligence, in particular on machine learning, to reliably process huge amounts of data. Therefore, high-quality datasets are essential to train machine learning models. Unfortunately, the problem of assessing the quality of datasets is very challenging and often overlooked. This paper proposes a method for assessing the dataset quality in the context of binary classification. It is based on permutation testing and examines the strength of the relationship between observations and labels. Experiments carried out on simulated and real network datasets show that the method is sensitive to detect errors/mislabels in the labeled dataset. We also present theoretical considerations justifying our results.</p>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.