Abstract. This is an invited comment on the discussion paper "The power of monitoring: how to make the most of a contaminated multivariate sample" by A. Cerioli, M. Riani, A. Atkinson and A. Corbellini that will appear in Statistical Methods & Applications.We would like to congratulate Cerioli, Riani, Atkinson and Corbellini (henceforth CRAC) on their well-written and lavishly illustrated exposition about the usage and benefits of monitoring, and thank the editors for inviting us to comment on this interesting work.
The problem of nearby contaminationThe leading example in the paper is the geyser (Old Faithful) dataset. From the scatterplot of this bivariate dataset we see that it consists of two clusters, the smaller of which contains about 30% to 35% of the observations. If one interprets the smaller cluster as contamination this is a relatively high contamination level, though it should not be prohibitive since the estimators considered in the paper can be tuned to a breakdown value well above 35%. However, the contamination happens to lie quite close to the inlying data. Having a large fraction of contamination located fairly close by makes the geyser data particularly challenging, as illustrated by CRAC. If we use a scatter estimator with a breakdown value of e.g. 40%, replacing any 35% of clean data by data points positioned anywhere cannot completely destroy the scatter matrix (in the sense of making its first eigenvalue arbitrarily large or its last eigenvalue arbitrarily close to zero), but that does not imply that the scatter matrix will have a small bias. Indeed, it is known that the bias of the estimators under study is the largest for nearby contamination, as shown by Hubert et al. (2014).The other real data example in the paper is the cows dataset with 4 variables. Our first instinct was to carry out a PCA to get some idea about the shape of the data. Figure 1 shows the first two principal components of the cows data, which explain 96% of the total variance. (Here we used the ROBPCA method of Hubert et al. (2005), but classical PCA gave a very similar picture.) The plot shows that this dataset is equally challenging. CRAC interpret it as a well-behaved point cloud plus contamination. Again most of the contamination is nearby, and then it fans out. An alternative interpretation is that it could arXiv:1803.04820v1 [stat.ME]