In this study, we set up a scalable framework for large-scale data processing and analytics using the big data framework. The popular classification methods are implemented, tuned, and evaluated by using intrusion datasets. The objective is to select the best classifier after optimizing the hyper-parameters. We observed that the decision tree (DT) approach outperforms compared with other methods in terms of classification accuracy, fast training time, and improved average prediction rate. Therefore, it is selected as a base classifier in our proposed ensemble approach to study class imbalance. As the intrusion datasets are imbalanced, most of the classification techniques are biased toward the majority class. The misclassification rate is more in the case of the minority class. An ensemble-based method is proposed by using K-Means, RUSBoost, and DT approaches to mitigate the class imbalance problem; empirically investigate the impact of class imbalance on classification approaches' performance; and compare the result by using popular performance metrics such as Balanced Accuracy, Matthews Correlation Coefficient, and F-Measure, which are more suitable for the assessment of imbalanced datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.