Abstract. We study the problem of minimizing the expected loss of a linear predictor while constraining its sparsity, i.e., bounding the number of features used by the predictor. While the resulting optimization problem is generally NP-hard, several approximation algorithms are considered. We analyze the performance of these algorithms, focusing on the characterization of the trade-off between accuracy and sparsity of the learned predictor in different scenarios.
Key words. sparsity, linear prediction
AMS subject classifications. 68T99, 68W40DOI. 10.1137/090759574 1. Introduction. In statistical and machine learning applications, although many features might be available for use in a prediction task, it is often beneficial to use only a small subset of the available features. Predictors that use only a small subset of features require a smaller memory footprint and can be applied faster. Furthermore, in applications such as medical diagnostics, obtaining each possible "feature" (e.g., test result) can be costly, and so a predictor that uses only a small number of features is desirable, even at the cost of a small degradation in performance relative to a predictor that uses more features.These applications lead to optimization problems with sparsity constraints. Focusing on linear prediction, it is generally NP-hard to find the best predictor subject to a sparsity constraint, i.e., a bound on the number of features used [7,19]. In this paper we show that by compromising on prediction accuracy, one can compute sparse predictors efficiently. Our main goal is to understand the precise trade-off between accuracy and sparsity and how this trade-off depends on properties of the underlying optimization problem.We now formally define our problem setting. A linear predictor is a mapping x → φ( w, x ), where x ∈ X