Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors

Ting, Kai Ming; Washio, Takashi; Wells, Jonathan R.; Aryal, Sunil

doi:10.1007/s10994-016-5586-4

Cited by 34 publications

(25 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When the number of clusters increases further, the data distribution becomes ill represented by the subsamples, resulting in a decrease of AUC, ie, iNNE (ψ=256): AUC degrades when the number of clusters >200, and iNNE (ψ=1024): AUC degrades when the number of clusters >700). This phenomenon is further explained in the work of Ting et al using computational geometry.…”

Section: Conceptual Comparisons With Iforest Lof and Spmentioning

confidence: 77%

“…First, mass-based dissimilarity measures 36,37 have been shown to outperform distance measures using the same NN algorithms in classification, clustering, anomaly detection, and information retrieval tasks. 13,27,38 Incorporating these into iNNE will enhance its effectiveness and guide in setting the appropriate sample size for different data sets, independent of the given data set size. 37 Second, theories have been developed to explain the reason why NN anomaly detectors can perform well with small samples.…”

Section: Discussionmentioning

confidence: 99%

“…This is because the sample is more likely to be contaminated by anomalies with large ψ. () The maximum AUC is reached when the sample size is sufficient to represent the data distribution in the data set.…”

Section: Conceptual Comparisons With Iforest Lof and Spmentioning

confidence: 99%

See 2 more Smart Citations

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Bandaragoda

Ting

Albrecht

et al. 2018

Computational Intelligence

Self Cite

121

View full text Add to dashboard Cite

The first successful isolation-based anomaly detector, ie, iForest, uses trees as a means to perform isolation. Although it has been shown to have advantages over existing anomaly detectors, we have identified 4 weaknesses, ie, its inability to detect local anomalies, anomalies with a high percentage of irrelevant attributes, anomalies that are masked by axis-parallel clusters, and anomalies in multimodal data sets. To overcome these weaknesses, this paper shows that an alternative isolation mechanism is required and thus presents iNNE or isolation using Nearest Neighbor Ensemble.Although relying on nearest neighbors, iNNE runs significantly faster than the existing nearest neighbor-based methods such as the local outlier factor, especially in data sets having thousands of dimensions or millions of instances. This is because the proposed method has linear time complexity and constant space complexity. KEYWORDSanomaly detection, ensemble learning, isolation-based, nearest neighbor, outlier detection INTRODUCTIONAnomaly detection is an important data mining task that has a diverse range of applications in various domains. 1,2 The explosive growth of databases in both size and dimensionality is challenging for anomaly detection methods in two important aspects: the requirement of low computational 968 /journal/coin Computational Intelligence. 2018;34:968-998. BANDARAGODA ET AL. 969cost and the susceptibility to issues in high-dimensional data sets. Efficient methods are required in time-critical applications such as network intrusion detection and credit card fraud detection. However, the time complexity of most existing methods is on the order of O(n 2 ) (where n is the data set size), which is prohibitively expensive for large data sets. Therefore, efficient and scalable methods for large data sets are highly desirable.iForest 3 is a unique anomaly detector because it utilizes an isolation mechanism to detect anomalies. iForest isolates each instance from the rest of the instances through recursive axis-parallel subdivisions. Those instances that can be easily isolated are likely to be anomalies.The key advantage of iForest is its linear execution time, which makes it extremely efficient in comparison to other methods, and thus, it is a very attractive option for large data sets. iForest has been shown 3,4 to have better detection accuracy and faster runtime than many state-of-the-art methods including the local outlier factor (LOF) 5 and optimal reciprocal collision avoidance. 6 Despite these advantages, our investigation finds that the current isolation mechanism has weaknesses in detecting the following 4 types of anomalies.1. Local anomalies: iForest uses a global anomaly score that is not sensitive to the local data distribution of a data set. 2. Anomalies with low relevant dimensions: In high-dimensional data, iForest can only utilize a subset of the dimensions to create isolation trees. Each subset does not usually contain sufficient relevant dimensions to detect anomalies when the number of relevant dimens...

show abstract

Section: Conceptual Comparisons With Iforest Lof and Spmentioning

confidence: 77%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Bandaragoda

Ting

Albrecht

et al. 2018

Computational Intelligence

Self Cite

121

View full text Add to dashboard Cite

show abstract

“…A detailed analysis of the advantages and drawbacks of these measures for unsupervised outlier detection can be found in [6]. Following the literature [6,16,30,32,37], the popular measure AUC is used. AUC inherently considers the class-imbalance nature of outlier detection, making it comparable across data sets with different outlier proportions [6].…”

Section: Performance Evaluation Methodsmentioning

confidence: 99%

“…The time complexity may be reduced to be nearly linear by using indexing [4] or distributed computing techniques [8]. Recent studies [26,30,32] show that random distance-based methods or distance-based ensemble methods can achieve not only a similar time complexity reduction but also low false positive errors, resulting in scalable state-of-theart distance-based detectors. However, these techniques still do not address the curse of dimensionality issue.…”

Section: Related Work 21 Distance-based Outlier Detectionmentioning

confidence: 99%

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

Pang

Cao

Chen

et al. 2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

179

134

View full text Add to dashboard Cite

Learning expressive low-dimensional representations of ultrahighdimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers).This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach -the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO.Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distancebased detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.

show abstract

Anomaly detection of aircraft lead‐acid battery

Zhao

Zhang

Zhu

et al. 2020

Quality & Reliability Eng

View full text Add to dashboard Cite

The lead‐acid battery has been widely used in various fields. In civil aviation aircraft, it plays a vital role in the power system to maintain normal operation during the flight mission. Thus, an effective abnormal detection system for monitoring and diagnosing the status of aircraft lead‐acid battery is essential to ensure its safety and reliability. This paper aims to effectively identify aircraft battery faulty using unsupervised anomaly detection techniques. It introduces state‐of‐the‐art anomaly detection algorithms and evaluates their performance on a large real civil aviation battery data. The experimental results show that the latest isolation‐based anomaly detectors, iForest and iNNE, have outstanding performance on this task and have promising applicability as efficient methods for guaranteeing the lead‐acid battery quality and reliability in civil aviation aircraft.

show abstract

Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors

Cited by 34 publications

References 21 publications

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

Anomaly detection of aircraft lead‐acid battery

Contact Info

Product

Resources

About