The rapid rise of cyberattacks and the gradual failure of traditional
defense systems and approaches led to using Machine Learning (ML)
techniques to build more efficient and reliable Intrusion Detection
Systems (IDSs). However, the advent of larger IDS datasets has
negatively impacted the performance and computational complexity of
ML-based IDSs. Many researchers used data preprocessing techniques such
as feature selection and normalization to overcome such issues. While
most of these researchers reported the success of these preprocessing
techniques on a shallow level, very few studies have been performed on
their effects on a wider scale. Furthermore, the performance of an IDS
model is subject to not only the utilized preprocessing techniques but
also the dataset and the ML algorithm used, which most of the existing
studies give little emphasis on. Thus, this study provides an in-depth
analysis of feature selection and normalization effects on various IDS
models built using two IDS datasets namely, NSL-KDD and UNSW-NB15, and
five different ML algorithms. The algorithms are support vector machine,
k-nearest neighbor, random forest, naive bayes, and artificial neural
network. For feature selection and normalization, the decision tree
wrapper-based approach, which tends to give superior model performance,
and min-max normalization methods were respectively used. A total of 30
unique IDS models were implemented using the full and feature-selected
copy of the datasets. The models were evaluated using popular evaluation
metrics in IDS modeling, intra- and inter-model comparisons were
performed between models and with state-of-the-art works. Random forest
achieved the best performance on both NSL-KDD and UNSW-NB15 datasets
with prediction accuracies of 99.87% and 98.5%, as well as detection
rates of 99.79% and 99.17% respectively, it also achieved an excellent
performance in comparison with the recent works. The results show that
both normalization and feature selection positively affect IDS modeling
with normalization shown to be more important than feature selection in
improving performance and computational time. The study also found that
the UNSW-NB15 dataset is more complex and more suitable for building and
evaluating modern-day IDS than NSL-KDD.