Outlier detection in contingency tables using decomposable graphical models

Lindskou, Mads; Eriksen, Poul Svante; Tvedebrink, Torben

doi:10.1111/sjos.12407

Cited by 6 publications

(16 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…, for i ∈ I, which is the maximum likelihood estimates of ( 16) as also exploited in the outlier detection model given in [22].…”

Section: Notation and The Likelihood Functionmentioning

confidence: 99%

“…The likelihood ratio for the pure discrete part, Q D := L( p; n)/L( q; n), was investigated by [22]: Given a RIP ordering…”

Section: The Null Hypothesis and Deviance Test Statisticmentioning

confidence: 99%

“…[15] gave the following definition: "an observation which deviates so much from the other observations in the data-set as to arouse suspicions that it was generated by a different mechanism". In [22], this definition, was adapted by specifying a statistical hypothesis of an outlier being distributed differently than all other observations for discrete data sets. In this paper, we extend this definition to capture outliers in data sets with variables of mixed types, i.e.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Detecting Outliers in High-dimensional Data with Mixed Variable Types using Conditional Gaussian Regression Models

Lindskou,

Tvedebrink,

Eriksen

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Outlier detection has gained increasing interest in recent years, due to newly emerging technologies and the huge amount of high-dimensional data that are now available. Outlier detection can help practitioners to identify unwanted noise and/or locate interesting abnormal observations. To address this, we developed a novel method for outlier detection for use in, possibly high-dimensional, datasets with both discrete and continuous variables. We exploit the family of decomposable graphical models in order to model the relationship between the variables and use this to form an exact likelihood ratio test for an observation that is considered an outlier. We show that our method outperforms the state-of-the-art Isolation Forest algorithm on a real data example.

show abstract

“…, for i ∈ I, which is the maximum likelihood estimates of ( 16) as also exploited in the outlier detection model given in [22].…”

Section: Notation and The Likelihood Functionmentioning

confidence: 99%

“…The likelihood ratio for the pure discrete part, Q D := L( p; n)/L( q; n), was investigated by [22]: Given a RIP ordering…”

Section: The Null Hypothesis and Deviance Test Statisticmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Detecting Outliers in High-dimensional Data with Mixed Variable Types using Conditional Gaussian Regression Models

Lindskou,

Tvedebrink,

Eriksen

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In Lindskou et al (2019) the molic package was used to detect outliers in microhap data from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015). This data contains DNA profiles from five different continental regions (CRs); Europe (EUR), America (AMR), East Asia (EAS), South Asia (SAS) and Africa (AFR).…”

Section: A Use Case In Forensic Sciencementioning

confidence: 99%

“…An outlier is a case-specific unit since it may be interpreted as natural extreme noise in some applications, whereas in other applications it may be the most interesting observation. The molic package has been written to facilitate the novel outlier detection method in high-dimensional contingency tables (Lindskou, Eriksen, & Tvedebrink, 2019). In other words, the method works for data sets in which all variables are categorical, implying that they can only take on a finite set of values (also called levels).The software uses decomposable graphical models (DGMs), where the probability mass function can be associated with an interaction graph, from which conditional independences among the variables can be inferred.…”

mentioning

confidence: 99%

molic: An R package for multivariate outlier detection in contingency tables

Lindskou¹

2019

JOSS

Self Cite

View full text Add to dashboard Cite

Outlier detection is an important task in statistical analyses. An outlier is a case-specific unit since it may be interpreted as natural extreme noise in some applications, whereas in other applications it may be the most interesting observation. The molic package has been written to facilitate the novel outlier detection method in high-dimensional contingency tables (Lindskou, Eriksen, & Tvedebrink, 2019). In other words, the method works for data sets in which all variables are categorical, implying that they can only take on a finite set of values (also called levels).The software uses decomposable graphical models (DGMs), where the probability mass function can be associated with an interaction graph, from which conditional independences among the variables can be inferred. This gives a way to investigate the underlying nature of outliers. This is also called understandability in the literature. Outlier detection has many applications including areas such as • Fraud detection • Medical and public health • Anomaly detection in text data • Fault detection (on critical systems) • Forensic science The MethodThe method can be described by the outlier test procedure below. Assume we are interested in whether or not a new observation z is an outlier in some data set D. First an interaction graph G is fitted to the variables in D; a decomposable undirected graph that describes the association structure between variables in D. If the assumption that z belongs to D is true, z should be included in D. Denote by D z the new data set including z. Finally the outlier model M is constructed using G and D z from which we can query the p-value, p, for the test about z belonging to D. If p is less than some chosen threshold (significance level), say 0.05, z is declared an outlier in D.

show abstract