In phase I of statistical process control (SPC), control charts are often used as outlier detection methods to assess process stability. Many of these methods require estimation of the covariance matrix, are computationally infeasible, or have not been studied when the dimension of the data, p, is large. We propose the one-class peeling (OCP) method, a flexible framework that combines statistical and machine learning methods to detect multiple outliers in multivariate data. The OCP method can be applied to phase I of SPC, does not require covariance estimation, and is well suited to high-dimensional data sets with a high percentage of outliers. Our empirical evaluation suggests that the OCP method performs well in high dimensions and is computationally more efficient and robust than existing methodologies. We motivate and illustrate the use of the OCP method in a phase I SPC application on a N = 354, p = 1917 dimensional data set containing Wikipedia search results for National Football League (NFL) players, teams, coaches, and managers. The example data set and R functions, OCP.R and OCPLimit.R, to compute the respective OCP distances and thresholds are available in the supplementary materials.
The k-chart, based on support vector data description, has received recent attention in the literature. We review four different methods for choosing the bandwidth parameter, s, when the k-chart is designed using the Gaussian kernel. We provide results of extensive Phase I and Phase II simulation studies varying the method of choosing the bandwidth parameter along with the size and distribution of sample data. In very limited cases, the k-chart performed as desired. In general, we are unable to recommend the k-chart for use in a Phase I or Phase II process monitoring study in its current form.Inspired by support vector machines (SVM), SVDD is an unsupervised learning method used to give a description (or produce a boundary) around a data set. Whereas SVM separates classes by maximizing the margin (the distance between the closest objects of two classes), SVDD maximizes the minimum volume surrounding a data set and relies on user-supplied parameters to determine how large the boundary should be. In SVM, the boundary between the two classes is defined by only a few points of each class, called the support vectors. Similarly, in SVDD, the boundary surrounding a data set is also defined only by the points farthest from the center of the data. These boundary-defining points are referred to as the support vectors. To obtain the SVDD hypersphere, defined by a center and a radius R, we minimize R using F.R, , i / D R 2 C C X
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.