Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Gower, Robert M.; Richtárik, Peter; Bach, Francis

doi:10.1007/s10107-020-01506-0

Cited by 38 publications

(58 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…which depends on κmean := ( 1 n i Li)/µ rather than on κmax = (maxi Li)/µ. This improved rate under non-uniform sampling has been shown for the basic VR methods SVRG (Xiao and Zhang, 2014), SDCA (Qu et al, 2015), and SAGA (Schmidt et al, 2015;Gower et al, 2018).…”

Section: Advanced Algorithmsmentioning

confidence: 77%

See 1 more Smart Citation

Variance-Reduced Methods for Machine Learning

et al. 2020

Self Cite

View full text Add to dashboard Cite

Stochastic optimization lies at the heart of machine learning, and its cornerstone is stochastic gradient descent (SGD), a method introduced over 60 years ago. The last 8 years have seen an exciting new development: variance reduction (VR) for stochastic optimization methods. These VR methods excel in settings where more than one pass through the training data is allowed, achieving a faster convergence than SGD in theory as well as practice. These speedups underline the surge of interest in VR methods and the fast-growing body of work on this topic. This review covers the key principles and main developments behind VR methods for optimization with finite data sets and is aimed at non-expert readers. We focus mainly on the convex setting, and leave pointers to readers interested in extensions for minimizing non-convex functions.optimization, machine learning, variance reduction * The classic way to implement GD is to determine γ as the approximate solution to min γ>0 f (x k − γ∇f (x k )). This is called a line search since it is an optimization over a

show abstract

Section: Advanced Algorithmsmentioning

confidence: 77%

“…is a mini batch smoothness constant first defined in (Gower et al, 2018(Gower et al, , 2019. This iteration complexity interpolate between the complexity of full-gradient descent when L(n) = L and VR methods where L(1) = Lmax.…”

Section: Advanced Algorithmsmentioning

confidence: 99%

Variance-Reduced Methods for Machine Learning

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…The elements of A and y were sampled from the standard Gaussian distribution N (0, 1). Note that for ridge regression, L i = 1 m A(: , i) 2 2 + λ where, following (Gower, Richtárik, and Bach 2018), we normalize data such that A(:, 1) 2 = 1 and A(:, i) 2 = 1 m , i = 2, . .…”

Section: Ridge Regression On Synthetic Datamentioning

confidence: 99%

A Stochastic Derivative-Free Optimization Method with Importance Sampling: Theory and Learning to Control

Bibi

Bergou

Şener

et al. 2020

AAAI

View full text Add to dashboard Cite

We consider the problem of unconstrained minimization of a smooth objective function in ℝn in a setting where only function evaluations are possible. While importance sampling is one of the most popular techniques used by machine learning practitioners to accelerate the convergence of their models when applicable, there is not much existing theory for this acceleration in the derivative-free setting. In this paper, we propose the first derivative free optimization method with importance sampling and derive new improved complexity results on non-convex, convex and strongly convex functions. We conduct extensive experiments on various synthetic and real LIBSVM datasets confirming our theoretical results. We test our method on a collection of continuous control tasks on MuJoCo environments with varying difficulty. Experiments show that our algorithm is practical for high dimensional continuous control problems where importance sampling results in a significant sample complexity improvement.

show abstract

“…The SCSG (Stochastically Controlled Sto-chastic Gradient) algorithm was proposed, SCSG is to calculate the average gradient by randomly selecting a part of the sample gradient as the global gradient, but when performing the weight update, randomly selecting the number of updates will make the calculation more variable and tedious, and the computation is large [12]. Subsequently, a series of algorithms [13] such as the novel Mini-Batch SCSG [14,15], b-NICE, SAGA [16,17] were generated based on the idea of variance reduction. However, there is another structural risk minimization problem in machine learning, which is composed of "loss function + regularization term", and different forms of regularization terms lead to different complex problems, such as Overlapping group lasso, Graph-guided fused lasso etc.…”

Section: Introductionmentioning

confidence: 99%

“…[18], which are very complex for SGD-based theoretical approaches, while the ADMM algorithm is applied to a wider range of models and its excellent performance proves itself to be an effective optimization tool. Several variance reduction algorithms have been proposed in combination with ADMM, including SAG-ADMM [19], SDCA-ADMM [20], and SVRG-ADMM [21]. All three algorithms are improved algorithms generated based on the update strategy of ADMM.…”

Section: Introductionmentioning

confidence: 99%

N-SVRG: Stochastic Variance Reduction Gradient with Noise Reduction Ability for Small Batch Samples

Pan¹,

Zheng²

2022

Computer Modeling in Engineering &Amp; Sciences

View full text Add to dashboard Cite

The machine learning model converges slowly and has unstable training since large variance by random using a sample estimate gradient in SGD. To this end, we propose a noise reduction method for Stochastic Variance Reduction gradient (SVRG), called N-SVRG, which uses small batches samples instead of all samples for the average gradient calculation, while performing an incremental update of the average gradient. In each round of iteration, a small batch of samples is randomly selected for the average gradient calculation, while the average gradient is updated by rounding of the past model gradients during internal iterations. By suitably reducing the batch size B, the memory storage as well as the number of iterations can be reduced. The experiments are compared with the state-of-the-art Mini-Batch SGD, AdaGrad, RMSProp, SVRG and SCSG, and it is demonstrated that N-SVRG outperforms SVRG and SASG, and is on par with SCSG. Finally, by exploring the relationship between the small values of different parameters n. B and k and the effectiveness of the algorithm, we prove that our N-SVRG algorithm has some stability and can achieve sufficient accuracy even in the case of small batch size. The advantages and disadvantages of various methods are experimentally compared, and the stability of N-SVRG is explored by parameter settings.

show abstract

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Cited by 38 publications

References 24 publications

Variance-Reduced Methods for Machine Learning

Variance-Reduced Methods for Machine Learning

A Stochastic Derivative-Free Optimization Method with Importance Sampling: Theory and Learning to Control

N-SVRG: Stochastic Variance Reduction Gradient with Noise Reduction Ability for Small Batch Samples

Contact Info

Product

Resources

About