Intrusion detection is an essential task for protecting the cyber environment from attacks. Many studies have proposed sophisticated models to detect intrusions from a large amount of data, yet they ignore the fact that poor data quality has a direct impact on the performance of the intrusion detection systems. This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems. Then the data preparation workflow and data quality requirements for intrusion detection are discussed. In order to investigate how data quality affects model performance, we conducted experiments on 11 HIDS datasets using eight machine learning (ML) models and two pre-trained language models (BERT and GPT-2). The experimental results show: 1. BERT and GPT outperform the other models on all of the datasets. 2. The pre-trained models and the classic ML models behave differently when duplicate data and overlapped data are removed from a dataset. The pre-trained models are more capable of learning from duplicate and overlapped data compared to the classic ML models. 3. Removing overlaps and duplicates can improve the performances of the pre-trained models and the traditional ML models on most datasets used in this study. However, doing this can sometimes cause model performance to be decreased. 4. The reliability of model performance is affected when a testing data contain duplicates. 5. The overlapped rate between the normal and intrusion classes seems to have an inverse relationship to the pre-trained models' performances on the intrusion detection task. Given the results, we discuss model selection in HIDS, and quality assurance in training data and testing data based on nine data quality dimensions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.