Fast and reliable anomaly detection in categorical data

Akoglu, Leman; Tong, Hanghang; Vreeken, Jilles; Faloutsos, Christos

doi:10.1145/2396761.2396816

Cited by 90 publications

(93 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most existing categorical data oriented methods are based on a general assumption that anomalies lie in regions of low frequency (Akoglu et al, 2012;Ghoting, Otey, & Parthasarathy, 2004;He et al, 2005;Koufakou, Ortiz, Georgiopoulos, Anagnostopoulos, & Reynolds, 2007;Koufakou & Georgiopoulos, 2010;Smets & Vreeken, 2011;He, Deng, Xu, & Huang, 2006). Typical examples are frequent patterns based methods FPOF (He et al, 2005) and infrequent patterns based methods LOADED (Ghoting et al, 2004).…”

Section: Methods For Categorical Datamentioning

confidence: 99%

“…FPOF and LOADED build a single model on the entire training set, and identify anomalies based on frequent patterns and infrequent patterns, respectively. KRIMP (Smets & Vreeken, 2011) and COMPREX (Akoglu et al, 2012) also build a single model on the entire training set using pattern-based compression techniques. KRIMP generates the patterns based on frequent itemsets, while COMPREX employs the Minimum Description Length (Barron, Rissanen, & Yu, 1998) principle to automatically generate patterns from attribute groups (subspaces) with high information gain and avoid the costly frequent itemset search.…”

Section: Methods For Categorical Datamentioning

confidence: 99%

“…We compared ZERO++ with FPOF (He et al, 2005), COMPREX (Akoglu et al, 2012), iForest (Liu et al, 2012) and LOF (Breunig et al, 2000). FPOF is a state-of-the-art frequency-based method for categorical data.…”

Section: Contenders and Their Parameter Settingsmentioning

confidence: 99%

“…ZERO++ is unique in that it works in regions of subspaces that are not occupied by data; whereas existing methods (Akoglu, Tong, Vreeken, & Faloutsos, 2012;Breunig, Kriegel, Ng, & Sander, 2000;He, Xu, Huang, & Deng, 2005;Liu, Ting, & Zhou, 2012) identify anomalies based on the assumption that anomalies lie in regions of low density/frequency, i.e., in regions occupied by data.…”

Section: Introductionmentioning

confidence: 99%

“…Frequency-based algorithms (Akoglu et al, 2012;He et al, 2005;Smets & Vreeken, 2011) need to conduct a subspace pattern searching which have time and space complexities that are at least quadratic in terms of the data dimensionality. ZERO++ involves no searching; thus it runs significantly faster.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-Scale Data Sets

Pang¹,

Ting²,

Albrecht³

et al. 2016

jair

View full text Add to dashboard Cite

This paper introduces a new unsupervised anomaly detector called ZERO++ which employs the number of zero appearances in subspaces to detect anomalies in categorical data. It is unique in that it works in regions of subspaces that are not occupied by data; whereas existing methods work in regions occupied by data. ZERO++ examines only a small number of low dimensional subspaces to successfully identify anomalies. Unlike existing frequencybased algorithms, ZERO++ does not involve subspace pattern searching. We show that ZERO++ is better than or comparable with the state-of-the-art anomaly detection methods over a wide range of real-world categorical and numeric data sets; and it is efficient with linear time complexity and constant space complexity which make it a suitable candidate for large-scale data sets.

show abstract

Section: Methods For Categorical Datamentioning

confidence: 99%

Section: Methods For Categorical Datamentioning

confidence: 99%

Section: Contenders and Their Parameter Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-Scale Data Sets

Pang¹,

Ting²,

Albrecht³

et al. 2016

jair

View full text Add to dashboard Cite

show abstract

There and back again: Outlier detection between statistical reasoning and data mining algorithms

Zimek

Filzmoser

2018

WIREs Data Min & Knowl

141

View full text Add to dashboard Cite

Outlier detection has been a topic in statistics for centuries. Over mainly the last two decades, there has been also an increasing interest in the database and data mining community to develop scalable methods for outlier detection. Initially based on statistical reasoning, however, these methods soon lost the direct probabilistic interpretability of the derived outlier scores. Here, we detail from a joint point of view of data mining and statistics the roots and the path of development of statistical outlier detection and of database‐related data mining methods for outlier detection. We discuss their inherent meaning, review approaches to again find a statistically meaningful interpretation of outlier scores, and sketch related current research topics. This article is categorized under: Algorithmic Development > Statistics Algorithmic Development > Scalable Statistical Methods Technologies > Machine Learning

show abstract