On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms

Shalev‐Shwartz, Shai; Singer, Yoram

doi:10.1007/s10994-010-5173-z

Cited by 29 publications

(18 citation statements)

References 21 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proof uses ideas from convex analysis. We refer the reader to [3,23] and see also a similar derivation in [26]. The definition ofL implies that it is the infimal convolution of L and the quadratic function (β/2)v 2 .…”

Section: A3 Proof Of Theorem 24mentioning

confidence: 99%

Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints

Shalev‐Shwartz¹,

Srebro²,

Zhang³

2010

SIAM J. Optim.

Self Cite

125

126

View full text Add to dashboard Cite

Abstract. We study the problem of minimizing the expected loss of a linear predictor while constraining its sparsity, i.e., bounding the number of features used by the predictor. While the resulting optimization problem is generally NP-hard, several approximation algorithms are considered. We analyze the performance of these algorithms, focusing on the characterization of the trade-off between accuracy and sparsity of the learned predictor in different scenarios. Key words. sparsity, linear prediction AMS subject classifications. 68T99, 68W40DOI. 10.1137/090759574 1. Introduction. In statistical and machine learning applications, although many features might be available for use in a prediction task, it is often beneficial to use only a small subset of the available features. Predictors that use only a small subset of features require a smaller memory footprint and can be applied faster. Furthermore, in applications such as medical diagnostics, obtaining each possible "feature" (e.g., test result) can be costly, and so a predictor that uses only a small number of features is desirable, even at the cost of a small degradation in performance relative to a predictor that uses more features.These applications lead to optimization problems with sparsity constraints. Focusing on linear prediction, it is generally NP-hard to find the best predictor subject to a sparsity constraint, i.e., a bound on the number of features used [7,19]. In this paper we show that by compromising on prediction accuracy, one can compute sparse predictors efficiently. Our main goal is to understand the precise trade-off between accuracy and sparsity and how this trade-off depends on properties of the underlying optimization problem.We now formally define our problem setting. A linear predictor is a mapping x → φ( w, x ), where x ∈ X

show abstract

Section: A3 Proof Of Theorem 24mentioning

confidence: 99%

Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints

Shalev‐Shwartz¹,

Srebro²,

Zhang³

2010

SIAM J. Optim.

Self Cite

125

126

View full text Add to dashboard Cite

show abstract

“…As before, we will choose a linear functionL t ≤ L in each round and squash φ t towards it to obtain the new auxiliary function. Therefore (14) continues to hold, and we can again inductively prove that φ t continues to retain an elliptical quadratic form:…”

Section: Boom: a Fusionmentioning

confidence: 90%

“…However, parallel boosting algorithms on their own are too slow. See for instance [14] for a primal-dual analysis of the rate of convergence of boosting algorithms in the context of loss minimization.…”

Section: Introductionmentioning

confidence: 99%

Parallel Boosting with Momentum

Mukherjee

Canini

Frongillo

et al. 2013

Advanced Information Systems Engineering

Self Cite

View full text Add to dashboard Cite

Abstract. We describe a new, simplified, and general analysis of a fusion of Nesterov's accelerated gradient with parallel coordinate descent. The resulting algorithm, which we call BOOM, for boosting with momentum, enjoys the merits of both techniques. Namely, BOOM retains the momentum and convergence properties of the accelerated gradient method while taking into account the curvature of the objective function. We describe an distributed implementation of BOOM which is suitable for massive high dimensional datasets. We show experimentally that BOOM is especially effective in large scale learning problems with rare yet informative features.

show abstract

“…n ′ is the regularization parameter which denotes the level of relaxation of the margin constraint. Shalev-Shwartz and Singer [16] proved that optimizing the above linear programming problem is equivalent to maximizing the averaged n ′ smallest margins. Thus, the case where n ′ equals 1 denotes maximizing the minimum margin, the case where n ′ equal n means maximizing the averaged margin over all labeled examples, and n ′ lies in the range from 1 to n.…”

Section: A Soft Margins and Lpboostmentioning

confidence: 99%

“…SSLPBoost actually maximizes a combination of the averaged n ′ smallest margins over the labeled data and averaged m ′ smallest margins over the unlabeled data. ShalevShwartz [16] showed the use of soft margins with supervised learning. Here, we only consider the margins for the unlabeled data.…”

Section: B Semi-supervised Linear Programming Boostingmentioning

confidence: 99%