Outlier mining in large high-dimensional data sets

Angiulli, Fabrizio; Pizzuti, Clara

doi:10.1109/tkde.2005.31

Cited by 305 publications

(196 citation statements)

References 26 publications

Supporting

Mentioning

189

Contrasting

Unclassified

Order By: Relevance

“…More specifically, for two arbitrary data points p 1 and p 2 in DS, F out (p 1 ) and F out (p 2 ) can be compared with each other, and if F out (p 1 ) > F out (p 2 ), p 1 has a larger possibility than p 2 to be an outlier. The definitions proposed by Angiulli et al [6], Breunig et al [3], and Ramaswamy et al [7] straightforwardly adhere to this category. On the other hand, the definition of Ng and Knorr [4] can be converted to this category by taking the inverse of the number of neighbors within distance r of each data point.…”

Section: Introductionmentioning

confidence: 99%

“…Researchers have developed several supervised and unsupervised techniques to mine outliers in static databases and also recently in data streams [9]. Unsupervised outlier detection can be further classified as distance-based [5,6,4,7], density-based [3,8,9] and deviation-based [10]. In this paper, we focus on distance-based outliers which have been popularly defined as: (a) data points from which there are fewer than p points that are within distance r [4], (b) top n data points whose distance to their corresponding k th nearest neighbor are largest [7], and (c) top n data points whose total distance to their corresponding k nearest neighbors are largest [6].…”

Section: Introductionmentioning

confidence: 99%

“…Based on the definition [6], we develop an outlier scoring criterion. Then in the first phase, we partition the data into clusters, and make an early estimate on the lower bound of outlier scores.…”

Section: Introductionmentioning

confidence: 99%

“…The choice of a global or local outlier score function clearly affects later stages of the algorithm design process. In this work, we employ a global outlier function based on [6], although the ideas employed in MIRO can also be adapted to use other functions. The intuition and quality of detection results of the chosen outlier definition are based on solid foundations as shown by prior work [6,11].…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we employ a global outlier function based on [6], although the ideas employed in MIRO can also be adapted to use other functions. The intuition and quality of detection results of the chosen outlier definition are based on solid foundations as shown by prior work [6,11]. This definition is also employed in other popular techniques on outlier detection [12].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Pruning Schemes for Distance-Based Outlier Detection

Gopalkrishnan

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Outlier detection finds many applications, especially in domains that have scope for abnormal behavior. In this paper, we present a new technique for detecting distance-based outliers, aimed at reducing execution time associated with the detection process. Our approach operates in two phases and employs three pruning rules. In the first phase, we partition the data into clusters, and make an early estimate on the lower bound of outlier scores. Based on this lower bound, the second phase then processes relevant clusters using the traditional block nested-loop algorithm. Here two efficient pruning rules are utilized to quickly discard more non-outliers and reduce the search space. Detailed analysis of our approach shows that the additional overhead of the first phase is offset by the reduction in cost of the second phase. We also demonstrate the superiority of our approach over existing distance-based outlier detection methods by extensive empirical studies on real datasets.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Efficient Pruning Schemes for Distance-Based Outlier Detection

Gopalkrishnan

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

Distributed anomaly detection using 1‐class SVM for vertically partitioned data

Das

Bhaduri

Votava

2011

Statistical Analysis

View full text Add to dashboard Cite

There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of data sets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only because of the massive volume of data but also because these data sets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available data sets: (i) the NASA MODIS satellite images and (ii) a simulated aviation data set generated by the 'Commercial Modular Aero-Propulsion System Simulation' (CMAPSS). 

show abstract

A survey on unsupervised outlier detection in high‐dimensional numerical data

Zimek

Schubert

Kriegel

2012

Statistical Analysis

705

413

View full text Add to dashboard Cite

High‐dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so‐called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high‐dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high‐dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012

show abstract

Outlier mining in large high-dimensional data sets

Cited by 305 publications

References 26 publications

Efficient Pruning Schemes for Distance-Based Outlier Detection

Efficient Pruning Schemes for Distance-Based Outlier Detection

Distributed anomaly detection using 1‐class SVM for vertically partitioned data

A survey on unsupervised outlier detection in high‐dimensional numerical data

Contact Info

Product

Resources

About