The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed kernel minimum regularized covariance determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel-induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.
Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label and measurement noise which often impairs the model's predictive ability. Robust estimators of location and scatter are resistant to this type of contamination. However, they have a prohibitive computational cost for large scale industrial experiments. We present a novel QDA method based on a recent real-time robust algorithm. We additionally integrate an anomaly detection step to classify the most atypical observations into a separate class of outliers. Finally, we introduce the label bias plot, a graphical display to identify label and measurement noise in the training data. The performance of the proposed approach is illustrated in a simulation study with huge datasets, and on real datasets about diabetes and fruit.
Modern industrial machines can generate gigabytes of data in seconds, frequently pushing the boundaries of available computing power. Together with the time criticality of industrial processing this presents a challenging problem for any data analytics procedure. We focus on the deterministic minimum covariance determinant method (DetMCD), which detects outliers by fitting a robust covariance matrix. We construct a much faster version of DetMCD by replacing its initial estimators by two new methods and incorporating update-based concentration steps. The computation time is reduced further by parallel computing, with a novel robust aggregation method to combine the results from the threads. The speed and accuracy of the proposed real-time DetMCD method (RT-DetMCD) are illustrated by simulation and a real industrial application to food sorting.
Abstract. This is an invited comment on the discussion paper "The power of monitoring: how to make the most of a contaminated multivariate sample" by A. Cerioli, M. Riani, A. Atkinson and A. Corbellini that will appear in Statistical Methods & Applications.We would like to congratulate Cerioli, Riani, Atkinson and Corbellini (henceforth CRAC) on their well-written and lavishly illustrated exposition about the usage and benefits of monitoring, and thank the editors for inviting us to comment on this interesting work. The problem of nearby contaminationThe leading example in the paper is the geyser (Old Faithful) dataset. From the scatterplot of this bivariate dataset we see that it consists of two clusters, the smaller of which contains about 30% to 35% of the observations. If one interprets the smaller cluster as contamination this is a relatively high contamination level, though it should not be prohibitive since the estimators considered in the paper can be tuned to a breakdown value well above 35%. However, the contamination happens to lie quite close to the inlying data. Having a large fraction of contamination located fairly close by makes the geyser data particularly challenging, as illustrated by CRAC. If we use a scatter estimator with a breakdown value of e.g. 40%, replacing any 35% of clean data by data points positioned anywhere cannot completely destroy the scatter matrix (in the sense of making its first eigenvalue arbitrarily large or its last eigenvalue arbitrarily close to zero), but that does not imply that the scatter matrix will have a small bias. Indeed, it is known that the bias of the estimators under study is the largest for nearby contamination, as shown by Hubert et al. (2014).The other real data example in the paper is the cows dataset with 4 variables. Our first instinct was to carry out a PCA to get some idea about the shape of the data. Figure 1 shows the first two principal components of the cows data, which explain 96% of the total variance. (Here we used the ROBPCA method of Hubert et al. (2005), but classical PCA gave a very similar picture.) The plot shows that this dataset is equally challenging. CRAC interpret it as a well-behaved point cloud plus contamination. Again most of the contamination is nearby, and then it fans out. An alternative interpretation is that it could arXiv:1803.04820v1 [stat.ME]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.