In high-dimensional linear regression, would increasing effect sizes always improve model selection, while maintaining all the other conditions unchanged (especially fixing the sparsity of regression coefficients)? In this paper, we answer this question in the negative in the regime of linear sparsity for the Lasso method, by introducing a new notion we term effect size heterogeneity. Roughly speaking, a regression coefficient vector has high effect size heterogeneity if its nonzero entries have significantly different magnitudes. From the viewpoint of this new measure, we prove that the false and true positive rates achieve the optimal trade-off uniformly along the Lasso path when this measure is maximal in a certain sense, and the worst trade-off is achieved when it is minimal in the sense that all nonzero effect sizes are roughly equal. Moreover, we demonstrate that the first false selection occurs much earlier when effect size heterogeneity is minimal than when it is maximal. The underlying cause of these two phenomena is, metaphorically speaking, the "competition" among variables with effect sizes of the same magnitude in entering the model. Taken together, our findings suggest that effect size heterogeneity shall serve as an important complementary measure to the sparsity of regression coefficients in the analysis of high-dimensional regression problems. Our proofs use techniques from approximate message passing theory as well as a novel technique for estimating the rank of the first false variable.
Conformal prediction has received tremendous attention in recent years with several applications both in health and social sciences. Recently, conformal inference has offered new solutions to problems in causal inference, which have leveraged advances in modern discipline of semiparametric statistics to construct efficient tools for prediction uncertainty quantification. In this paper, we consider the problem of obtaining distribution-free prediction regions accounting for a shift in the distribution of the covariates between the training and test data. Under an explainable covariate shift assumption that the conditional distribution of the outcome given covariates is the same in training and test samples, we propose three variants of a general framework to construct well-calibrated prediction regions for the unobserved outcome in the test sample. Our approach is based on the efficient influence function for the quantile of the unobserved outcome in the test population combined with an arbitrary machine learning prediction algorithm, without compromising asymptotic coverage. Next, we extend our approach to account for departure from the explainable covariate shift assumption in a semiparametric sensitivity analysis. In all cases, we establish that the resulting prediction sets are well-calibrated in the sense that they eventually attain nominal average coverage with increasing sample size. This guarantee is a consequence of the product bias form of our proposal which implies correct coverage if either the propensity score or the conditional distribution of the response can be estimated sufficiently well. Our results also provide a framework for construction of doubly robust calibration prediction sets of individual treatment effects, both under unconfoundedness conditions as well as allowing for unmeasured confounding in the context of a sensitivity analysis. Finally we discuss aggregation of prediction sets from different machine learning algorithms to obtain tighter prediction sets and illustrate the performance of our methods in both synthetic and real data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.