In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation that minimizes the overall coding length of the segmented data, subject to a given distortion. By analyzing the coding length/rate of mixed data, we formally establish some strong connections of data segmentation to many fundamental concepts in lossy data compression and rate distortion theory. We show that a deterministic segmentation is approximately the (asymptotically) optimal solution for compressing mixed data. We propose a very simple and effective algorithm which depends on a single parameter, the allowable distortion. At any given distortion, the algorithm automatically determines the corresponding number and dimension of the groups and does not involve any parameter estimation. Simulation results reveal intriguing phase-transition-like behaviors of the number of segments when changing the level of distortion or the amount of outliers. Finally, we demonstrate how this technique can be readily applied to segment real imagery and bioinformatic data.
In this paper, we introduce a simple and efficient representation for natural images. We view an image (in either the spatial domain or the wavelet domain) as a collection of vectors in a high-dimensional space. We then fit a piece-wise linear model (i.e., a union of affine subspaces) to the vectors at each downsampling scale. We call this a multiscale hybrid linear model for the image. The model can be effectively estimated via a new algebraic method known as generalized principal component analysis (GPCA). The hybrid and hierarchical structure of this model allows us to effectively extract and exploit multimodal correlations among the imagery data at different scales. It conceptually and computationally remedies limitations of many existing image representation methods that are based on either a fixed linear transformation (e.g., DCT, wavelets), or an adaptive uni-modal linear transformation (e.g., PCA), or a multimodal model that uses only cluster means (e.g., VQ). We will justify both quantitatively and experimentally why and how such a simple multiscale hybrid model is able to reduce simultaneously the model complexity and computational cost. Despite a small overhead of the model, our careful and extensive experimental results show that this new model gives more compact representations for a wide variety of natural images under a wide range of signal-to-noise ratios than many existing methods, including wavelets. We also briefly address how the same (hybrid linear) modeling paradigm can be extended to be potentially useful for other applications, such as image segmentation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.