In regression analysis of multivariate data, it is tacitly assumed that response and predictor variables in each observed response-predictor pair correspond to the same entity or unit. In this paper, we consider the situation of "permuted data" in which this basic correspondence has been lost. Several recent papers have considered this situation without further assumptions on the underlying permutation. In applications, the latter is often to known to have additional structure that can be leveraged. Specifically, we herein consider the common scenario of "sparsely permuted data" in which only a small fraction of the data is affected by a mismatch between response and predictors. However, an adverse effect already observed for sparsely permuted data is that the least squares estimator as well as other estimators not accounting for such partial mismatch are inconsistent. One approach studied in detail herein is to treat permuted data as outliers which motivates the use of robust regression formulations to estimate the regression parameter. The resulting estimate can subsequently be used to recover the permutation. A notable benefit of the proposed approach is its computational simplicity given the general lack of procedures for the above problem that are both statistically sound and computationally appealing. sparse_permutation_arxiv-v1.tex, version:
An open problem in graphical Gaussian models is to determine the smallest number of observations needed to guarantee the existence of the maximum likelihood estimator of the covariance matrix with probability one. In this paper we formalize a closely related problem in which the existence of the maximum likelihood estimator is guaranteed for all generic observations. We call the number determined by this problem the Gaussian rank of the graph representing the model. We prove that the Gaussian rank is strictly between the subgraph connectivity number and the graph degeneracy number. These bounds are in general much sharper than the best bounds known in the literature and furthermore computable in polynomial time. *
Recently, there has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters. In this paper, we present a method to adjust for such mismatches under "partial shuffling" in which a sufficiently large fraction of (predictors, response)-pairs are observed in their correct correspondence. The proposed approach is based on a pseudo-likelihood in which each term takes the form of a two-component mixture density. Expectation-Maximization schemes are proposed for optimization, which (i) scale favorably in the number of samples, and (ii) achieve excellent statistical performance relative to an oracle that has access to the correct pairings as certified by simulations and case studies. In particular, the proposed approach can tolerate considerably larger fraction of mismatches than existing approaches, and enables estimation of the noise level as well as the fraction of mismatches. Inference for the resulting estimator (standard errors, confidence intervals) can be based on established theory for composite likelihood estimation. Along the way, we also propose a statistical test for the presence of mismatches and establish its consistency under suitable conditions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.