2020
DOI: 10.1007/s10107-020-01506-0
|View full text |Cite
|
Sign up to set email alerts
|

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Abstract: We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method-JacSketch-is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
57
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 38 publications
(58 citation statements)
references
References 24 publications
1
57
0
Order By: Relevance
“…which depends on κmean := ( 1 n i Li)/µ rather than on κmax = (maxi Li)/µ. This improved rate under non-uniform sampling has been shown for the basic VR methods SVRG (Xiao and Zhang, 2014), SDCA (Qu et al, 2015), and SAGA (Schmidt et al, 2015;Gower et al, 2018).…”
Section: Advanced Algorithmsmentioning
confidence: 77%
See 1 more Smart Citation
“…which depends on κmean := ( 1 n i Li)/µ rather than on κmax = (maxi Li)/µ. This improved rate under non-uniform sampling has been shown for the basic VR methods SVRG (Xiao and Zhang, 2014), SDCA (Qu et al, 2015), and SAGA (Schmidt et al, 2015;Gower et al, 2018).…”
Section: Advanced Algorithmsmentioning
confidence: 77%
“…is a mini batch smoothness constant first defined in (Gower et al, 2018(Gower et al, , 2019. This iteration complexity interpolate between the complexity of full-gradient descent when L(n) = L and VR methods where L(1) = Lmax.…”
Section: Advanced Algorithmsmentioning
confidence: 99%
“…The elements of A and y were sampled from the standard Gaussian distribution N (0, 1). Note that for ridge regression, L i = 1 m A(: , i) 2 2 + λ where, following (Gower, Richtárik, and Bach 2018), we normalize data such that A(:, 1) 2 = 1 and A(:, i) 2 = 1 m , i = 2, . .…”
Section: Ridge Regression On Synthetic Datamentioning
confidence: 99%
“…The SCSG (Stochastically Controlled Sto-chastic Gradient) algorithm was proposed, SCSG is to calculate the average gradient by randomly selecting a part of the sample gradient as the global gradient, but when performing the weight update, randomly selecting the number of updates will make the calculation more variable and tedious, and the computation is large [12]. Subsequently, a series of algorithms [13] such as the novel Mini-Batch SCSG [14,15], b-NICE, SAGA [16,17] were generated based on the idea of variance reduction. However, there is another structural risk minimization problem in machine learning, which is composed of "loss function + regularization term", and different forms of regularization terms lead to different complex problems, such as Overlapping group lasso, Graph-guided fused lasso etc.…”
Section: Introductionmentioning
confidence: 99%
“…[18], which are very complex for SGD-based theoretical approaches, while the ADMM algorithm is applied to a wider range of models and its excellent performance proves itself to be an effective optimization tool. Several variance reduction algorithms have been proposed in combination with ADMM, including SAG-ADMM [19], SDCA-ADMM [20], and SVRG-ADMM [21]. All three algorithms are improved algorithms generated based on the update strategy of ADMM.…”
Section: Introductionmentioning
confidence: 99%