Numerous research studies have demonstrated the effectiveness of machine learning techniques in application to network intrusion detection. And yet, the adoption of machine learning for securing large-scale network environments remains limited. The community acknowledges that network security presents unique challenges for machine learning, and the lack of training data representative of modern traffic remains one of the most intractable issues. New attempts are continuously made to develop high quality benchmark datasets and proper data collection methodologies. The CICIDS2017 dataset is one of the recent results, created to meet the demanding criterion of representativeness for network intrusion detection. In this paper, we revisit CICIDS2017 and its data collection pipeline and analyze correctness, relevance and overall utility of the dataset for the learning task. In the process, we uncover a series of traffic generation, flow extraction and labeling problems that severely affect the validity of data. To counter these issues, we investigate and eliminate their causes and regenerate the dataset. As a result, more than 20% of traffic traces are reconstructed or relabeled, and machine learning benchmarks demonstrate significant improvements. Our study exemplifies how data collection issues may have enormous impact on model evaluation and provides recommendations for their anticipation and prevention.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.