Abstract. We develop results for the use of Lasso and Post-Lasso methods to form first-stage predictions and estimate optimal instruments in linear instrumental variables (IV) models with many instruments, p. Our results apply even when p is much larger than the sample size, n. We show that the IV estimator based on using Lasso or Post-Lasso in the first stage is root-n consistent and asymptotically normal when the first-stage is approximately sparse; i.e.when the conditional expectation of the endogenous variables given the instruments can be wellapproximated by a relatively small set of variables whose identities may be unknown. We also show the estimator is semi-parametrically efficient when the structural error is homoscedastic.Notably our results allow for imperfect model selection, and do not rely upon the unrealistic "beta-min" conditions that are widely used to establish validity of inference following model selection. In simulation experiments, the Lasso-based IV estimator with a data-driven penalty performs well compared to recently advocated many-instrument-robust procedures. In an empirical example dealing with the effect of judicial eminent domain decisions on economic outcomes, the Lasso-based IV estimator outperforms an intuitive benchmark.Optimal instruments are conditional expectations. In developing the IV results, we establish a series of new results for Lasso and Post-Lasso estimators of nonparametric conditional expectation functions which are of independent theoretical and practical interest. We construct a modification of Lasso designed to deal with non-Gaussian, heteroscedastic disturbances which uses a data-weighted 1-penalty function. By innovatively using moderate deviation theory for self-normalized sums, we provide convergence rates for the resulting Lasso and Post-Lasso estimators that are as sharp as the corresponding rates in the homoscedastic Gaussian case under the condition that log p = o(n 1/3 ). We also provide a data-driven method for choosing the penalty level that must be specified in obtaining Lasso and Post-Lasso estimates and establish its asymptotic validity under non-Gaussian, heteroscedastic disturbances.
Abstract. We propose a pivotal method for estimating high-dimensional sparse linear regression models, where the overall number of regressors p is large, possibly much larger than n, but only s regressors are significant. The method is a modification of the lasso, called the square-root lasso. The method is pivotal in that it neither relies on the knowledge of the standard deviation σ or nor does it need to pre-estimate σ. Moreover, the method does not rely on normality or sub-Gaussianity of noise. It achieves near-oracle performance, attaining the convergence rate σ{(s/n) log p} 1/2 in the prediction norm, and thus matching the performance of the lasso with known σ. These performance results are valid for both Gaussian and non-Gaussian errors, under some mild moment restrictions. We formulate the square-root lasso as a solution to a convex conic programming problem, which allows us to implement the estimator using efficient algorithmic methods, such as interior-point and first-order methods.
We consider median regression and, more generally, a possibly infinite collection of quantile regressions in high-dimensional sparse models. In these models the number of regressors p is very large, possibly larger than the sample size n, but only at most s regressors have a non-zero impact on each conditional quantile of the response variable, where s grows more slowly than n. Since ordinary quan-tile regression is not consistent in this case, we consider ℓ1-penalized quantile regression (ℓ1-QR), which penalizes the ℓ1-norm of regression coefficients, as well as the post-penalized QR estimator (post-ℓ1-QR), which applies ordinary QR to the model selected by ℓ1-QR. First, we show that under general conditions ℓ1-QR is consistent at the near-oracle rate s/n log(p ∨ n), uniformly in the compact set U ⊂ (0, 1) of quantile indices. In deriving this result, we propose a partly pivotal, data-driven choice of the penalty level and show that it satisfies the requirements for achieving this rate. Second, we show that under similar conditions post-ℓ1-QR is consistent at the near-oracle rate s/n log(p ∨ n), uniformly over U, even if the ℓ1-QR-selected models miss some components of the true models, and the rate could be even closer to the oracle rate otherwise. Third, we characterize conditions under which ℓ1-QR contains the true model as a submodel, and derive bounds on the dimension of the selected model, uniformly over U; we also provide conditions under which hard-thresholding selects the minimal true model, uniformly over U.
Data with a large number of variables relative to the sample size—“high-dimensional data”—are readily available and increasingly common in empirical economics. Highdimensional data arise through a combination of two phenomena. First, the data may be inherently high dimensional in that many different characteristics per observation are available. For example, the US Census collects information on hundreds of individual characteristics and scanner datasets record transaction-level data for households across a wide range of products. Second, even when the number of available variables is relatively small, researchers rarely know the exact functional form with which the small number of variables enter the model of interest. Researchers are thus faced with a large set of potential variables formed by different ways of interacting and transforming the underlying variables. This paper provides an overview of how innovations in “data mining” can be adapted and modified to provide high-quality inference about model parameters. Note that we use the term “data mining” in a modern sense which denotes a principled search for “true” predictive power that guards against false discovery and overfitting, does not erroneously equate in-sample fit to out-of-sample predictive ability, and accurately accounts for using the same data to examine many different hypotheses or models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.