Consider a two-class clustering problem where we observe Xi = iµ + Zi, Zi iid ∼ N (0, Ip), 1 ≤ i ≤ n. The feature vector µ ∈ R p is unknown but is presumably sparse. The class labels i ∈ {−1, 1} are also unknown and the main interest is to estimate them.We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam's idea on comparison of experiments.We also extend the study on statistical limits for clustering to that for signal recovery and that for global testing. We compare the statistical limits for three problems and expose some interesting insight.We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold t > 0, IF-PCA clusters by applying classical PCA to all columns of X with an L 2 -norm larger than t. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering.We discover a phase transition for IF-PCA. For any threshold t > 0, let ξ (t) be the first left singular vector of the post-selection data matrix. The phase space partitions into two different regions. In one region, there is a t such that cos(ξ (t) , ) → 1 and IF-PCA yields successful clustering. In the other, cos(ξ (t) , ) ≤ c0 < 1 for all t > 0. Our results require delicate analysis, especially on post-selection Random Matrix Theory and on lower bound arguments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.