We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank-Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality (primal and dual gaps) rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of Braun et al. [2017]. We also present a streamlined version of the algorithm for the probability simplex.
ContributionOur contribution is summarized as follows:
Blended Conditional Gradients (BCG). The BCG approach blends different types of descent steps: Frank-Wolfe steps from Frank and Wolfe [1956], optionally lazified as in Braun et al. [2017], and gradient descent steps over the convex hull of the current active vertex set. It avoids projections and does not use away steps and pairwise steps, which are components of other popular variants of CG. It achieves linear convergence for strongly convex functions (see Theorem 3.1), and O(1/t) convergence after t iterations for general smooth functions. While the linear convergence proof of the Away-step Frank-Wolfe Algorithm [Lacoste-Julien and Jaggi, 2015, Theorem 1, Footnote 4] requires the objective functionf to be defined on the Minkowski sum P − P + P, BCG does not need f to be defined outside the polytope P. The algorithm has complexity comparable to pairwise-step or away-step variants of conditional gradients, both in time measured as number of iterations and in space (size of active set). It is affine-invariant and parameter-free; estimates of such parameters as smoothness, strong convexity, or the diameter of P are not required. It maintains iterates as (often sparse) convex combinations of vertices, typically much sparser than the baseline CG methods, a property that is important for some applications. Such sparsity is due to the aggressive reuse of active vertices, and the fact that new vertices are added only as a kind of last resort. In wall-clock time as well as per-iteration progress, our computational results show that BCG can be orders of magnitude faster than competimg CG methods on some problems.
Simplex Gradient Descent (SiGD).In Section 4, we describe a new projection-free gradient descent procedure for minimizing a smooth function over the probability simplex, which can be used to implement the "simplex descent oracle" required by BCG, which is the module doing gradient descent steps.Computational Experiments. We demonstrate the ex...