We study the problem of detecting a structured, low-rank signal matrix corrupted with additive Gaussian noise. This includes clustering in a Gaussian mixture model, sparse PCA, and submatrix localization. Each of these problems is conjectured to exhibit a sharp information-theoretic threshold, below which the signal is too weak for any algorithm to detect. We derive upper and lower bounds on these thresholds by applying the first and second moment methods to the likelihood ratio between these "planted models" and null models where the signal matrix is zero. For sparse PCA and submatrix localization, we determine this threshold exactly in the limit where the number of blocks is large or the signal matrix is very sparse; for the clustering problem, our bounds differ by a factor of √ 2 when the number of clusters is large. Moreover, our upper bounds show that for each of these problems there is a significant regime where reliable detection is information-theoretically possible but where known algorithms such as PCA fail completely, since the spectrum of the observed matrix is uninformative. This regime is analogous to the conjectured 'hard but detectable' regime for community detection in sparse graphs.
A matrix A ∈ C n×n is diagonalizable if it has a basis of linearly independent eigenvectors. Since the set of nondiagonalizable matrices has measure zero, every A ∈ C n×n is the limit of diagonalizable matrices. We prove a quantitative version of this fact conjectured by E.B. Davies: for each δ ∈ (0, 1), every matrix A ∈ C n×n is at least δ A -close to one whose eigenvectors have condition number at worst c n /δ, for some constants c n dependent only on n. Our proof uses tools from random matrix theory to show that the pseudospectrum of A can be regularized with the addition of a complex Gaussian perturbation. Along the way, we explain how a variant of a theorem of Śniady implies a conjecture of Sankar, Spielman and Teng on the optimal constant for smoothed analysis of condition numbers.
We study the problem of detecting a structured, low-rank signal matrix corrupted with additive Gaussian noise. This includes clustering in a Gaussian mixture model, sparse PCA, and submatrix localization. Each of these problems is conjectured to exhibit a sharp information-theoretic threshold, below which the signal is too weak for any algorithm to detect. We derive upper and lower bounds on these thresholds by applying the first and second moment methods to the likelihood ratio between these "planted models" and null models where the signal matrix is zero. For sparse PCA and submatrix localization, we determine this threshold exactly in the limit where the number of blocks is large or the signal matrix is very sparse; for the clustering problem, our bounds differ by a factor of √ 2 when the number of clusters is large. Moreover, our upper bounds show that for each of these problems there is a significant regime where reliable detection is information-theoretically possible but where known algorithms such as PCA fail completely, since the spectrum of the observed matrix is uninformative. This regime is analogous to the conjectured 'hard but detectable' regime for community detection in sparse graphs.Keywords: First and second moment methods, clustering, information-theoretic bounds, sparse PCA, submatrix localization * J. Banks is with the
We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of m points in n dimensions, n, m → ∞ and α = m/n stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of α and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determine the accuracy achievable by the Bayes-optimal estimation algorithm. In particular, we find that when the number of clusters is sufficiently large, r > 4 + 2 √ α, there is a gap between the threshold for informationtheoretically optimal performance and the threshold at which known algorithms succeed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.