The first successful isolation-based anomaly detector, ie, iForest, uses trees as a means to perform isolation. Although it has been shown to have advantages over existing anomaly detectors, we have identified 4 weaknesses, ie, its inability to detect local anomalies, anomalies with a high percentage of irrelevant attributes, anomalies that are masked by axis-parallel clusters, and anomalies in multimodal data sets. To overcome these weaknesses, this paper shows that an alternative isolation mechanism is required and thus presents iNNE or isolation using Nearest Neighbor Ensemble.Although relying on nearest neighbors, iNNE runs significantly faster than the existing nearest neighbor-based methods such as the local outlier factor, especially in data sets having thousands of dimensions or millions of instances. This is because the proposed method has linear time complexity and constant space complexity. KEYWORDSanomaly detection, ensemble learning, isolation-based, nearest neighbor, outlier detection INTRODUCTIONAnomaly detection is an important data mining task that has a diverse range of applications in various domains. 1,2 The explosive growth of databases in both size and dimensionality is challenging for anomaly detection methods in two important aspects: the requirement of low computational 968 /journal/coin Computational Intelligence. 2018;34:968-998. BANDARAGODA ET AL. 969cost and the susceptibility to issues in high-dimensional data sets. Efficient methods are required in time-critical applications such as network intrusion detection and credit card fraud detection. However, the time complexity of most existing methods is on the order of O(n 2 ) (where n is the data set size), which is prohibitively expensive for large data sets. Therefore, efficient and scalable methods for large data sets are highly desirable.iForest 3 is a unique anomaly detector because it utilizes an isolation mechanism to detect anomalies. iForest isolates each instance from the rest of the instances through recursive axis-parallel subdivisions. Those instances that can be easily isolated are likely to be anomalies.The key advantage of iForest is its linear execution time, which makes it extremely efficient in comparison to other methods, and thus, it is a very attractive option for large data sets. iForest has been shown 3,4 to have better detection accuracy and faster runtime than many state-of-the-art methods including the local outlier factor (LOF) 5 and optimal reciprocal collision avoidance. 6 Despite these advantages, our investigation finds that the current isolation mechanism has weaknesses in detecting the following 4 types of anomalies.1. Local anomalies: iForest uses a global anomaly score that is not sensitive to the local data distribution of a data set. 2. Anomalies with low relevant dimensions: In high-dimensional data, iForest can only utilize a subset of the dimensions to create isolation trees. Each subset does not usually contain sufficient relevant dimensions to detect anomalies when the number of relevant dimens...
Conventional wisdom in machine learning says that all algorithms are expected to follow the trajectory of a learning curve which is often colloquially referred to as 'more data the better'. We call this 'the gravity of learning curve', and it is assumed that no learning algorithms are 'gravity-defiant'. Contrary to the conventional wisdom, this paper provides the theoretical analysis and the empirical evidence that nearest neighbour anomaly detectors are gravity-defiant algorithms.
This paper introduces a new ensemble approach, Feature-Subspace Aggregating (Feating), which builds local models instead of global models. Feating is a generic ensemble approach that can enhance the predictive performance of both stable and unstable learners. In contrast, most existing ensemble approaches can improve the predictive performance of unstable learners only. Our analysis shows that the new approach reduces the execution time to generate a model in an ensemble through an increased level of localisation in Feating. Our empirical evaluation shows that Feating performs significantly better than Boosting, Random Subspace and Bagging in terms of predictive accuracy, when a stable learner SVM is used as the base learner. The speed up achieved by Feating makes feasible SVM ensembles that would otherwise be infeasible for large data sets. When SVM is the preferred base learner, we show that Feating SVM performs better than Boosting decision trees and Random Forests. We further demonstrate that Feating also substantially reduces the error of another stable learner, k-nearest neighbour, and an unstable learner, decision tree.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.