On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs

Gargiani, Matilde; Zanelli, Andrea; Diehl, Moritz; Hutter, Frank

doi:10.48550/arxiv.2006.02409

Cited by 3 publications

(5 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A possible reason is that when the residuals are big, doing more GN iterations may not lead a better direction for minimizing (37). A similar observation has been made in [53], for training DNNs. It is experimentally shown that higher number of CG iterations might not produce more accurate results if the Hessian obtained by mini-batch is not reliable due to nonrepresentative batches and/or big residuals.…”

Section: Results Of the Type II Modelmentioning

confidence: 53%

“…It is experimentally shown that higher number of CG iterations might not produce more accurate results if the Hessian obtained by mini-batch is not reliable due to nonrepresentative batches and/or big residuals. On the other hand, if the residuals are small, higher number of CG iterations can produce more accurate results thanks to the curvature information [53].…”

Section: Results Of the Type II Modelmentioning

confidence: 99%

“…In our blind deconvolution experiments, we used the complex GN algorithm with the conjugate gradient Steihaug method. We used the second-order batch Gauss-Newton algorithm for the regression and classification, following the same intuition as in [53]. In each epoch of the algorithm, we randomly shuffle the data points in the training set and process all data points by dividing them into batches.…”

Section: Numerical Experimentsmentioning

confidence: 99%

See 2 more Smart Citations

CPD-Structured Multivariate Polynomial Optimization

Ayvaz

Lathauwer

2022

Front. Appl. Math. Stat.

View full text Add to dashboard Cite

We introduce the Tensor-Based Multivariate Optimization (TeMPO) framework for use in nonlinear optimization problems commonly encountered in signal processing, machine learning, and artificial intelligence. Within our framework, we model nonlinear relations by a multivariate polynomial that can be represented by low-rank symmetric tensors (multi-indexed arrays), making a compromise between model generality and efficiency of computation. Put the other way around, our approach both breaks the curse of dimensionality in the system parameters and captures the nonlinear relations with a good accuracy. Moreover, by taking advantage of the symmetric CPD format, we develop an efficient second-order Gauss–Newton algorithm for multivariate polynomial optimization. The presented algorithm has a quadratic per-iteration complexity in the number of optimization variables in the worst case scenario, and a linear per-iteration complexity in practice. We demonstrate the efficiency of our algorithm with some illustrative examples, apply it to the blind deconvolution of constant modulus signals, and the classification problem in supervised learning. We show that TeMPO achieves similar or better accuracy than multilayer perceptrons (MLPs), tensor networks with tensor trains (TT) and projected entangled pair states (PEPS) architectures for the classification of the MNIST and Fashion MNIST datasets while at the same time optimizing for fewer parameters and using less memory. Last but not least, our framework can be interpreted as an advancement of higher-order factorization machines: we introduce an efficient second-order algorithm for higher-order factorization machines.

show abstract

Section: Results Of the Type II Modelmentioning

confidence: 53%

Section: Results Of the Type II Modelmentioning

confidence: 99%

Section: Numerical Experimentsmentioning

confidence: 99%

See 1 more Smart Citation

CPD-Structured Multivariate Polynomial Optimization

Ayvaz

Lathauwer

2022

Front. Appl. Math. Stat.

View full text Add to dashboard Cite

show abstract

“…To alleviate this issue, the update direction should be computed also taking into account second-order information. Second-order methods are notably more robust against the step-size selection than first-order methods, since their update includes information on the local curvature (Agarwal et al, 2019;Gargiani et al, 2020). Noise annealing strategies.…”

Section: Conclusion Limitations and Future Workmentioning

confidence: 99%

PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation

Gargiani¹,

Zanelli²,

Martinelli³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Despite their success, policy gradient methods suffer from high variance of the gradient estimate, which can result in unsatisfactory sample complexity. Recently, numerous variance-reduced extensions of policy gradient methods with provably better sample complexity and competitive numerical performance have been proposed. After a compact survey on some of the main variance-reduced REINFORCE-type methods, we propose ProbAbilistic Gradient Estimation for Policy Gradient (PAGE-PG), a novel loopless variance-reduced policy gradient method based on a probabilistic switch between two types of updates. Our method is inspired by the PAGE estimator for supervised learning and leverages importance sampling to obtain an unbiased gradient estimator. We show that PAGE-PG enjoys a O −3 average sample complexity to reach an -stationary solution, which matches the sample complexity of its most competitive counterparts under the same setting. A numerical evaluation confirms the competitive performance of our method on classical control tasks.

show abstract

“…The above approximations are inspired by Gauss-Newton (GN) methods for nonlinear leastsquares problems (see, e.g,. [34]), where the Hessian matrix of the objective function p i=1 (r i − a i ) 2 (in which each r i is a scalar function and a i a scalar) is approximated by p i=1 ∇r i ∇r i , and also from the fact that the empirical risk of misclassification in ML is often a sum of non-negative terms matching a function to a scalar which can then be considered in a leastsquares fashion [3,15]. The resulting approximate adjoint equation (∇ y f ∇ y f ) λ = −∇ y f u is most likely infeasible, and we suggest solving it in the least-squares sense.…”

Section: Contributions Of the Papermentioning

confidence: 99%

Bilevel stochastic methods for optimization and machine learning: Bilevel stochastic descent and DARTS

Giovannelli¹,

Kent²,

Vicente³

2021

Preprint

View full text Add to dashboard Cite

Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as neural architecture search, continual learning, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints.The goal of this paper is twofold. First, we aim at promoting the use of bilevel optimization in large-scale learning and we introduce a practical bilevel stochastic gradient method (BSG-1) that requires neither lower level second-order derivatives nor system solves (and dismisses any matrix-vector products). Our BSG-1 method is close to first-order principles, which allows it to achieve a performance better than those that are not, such as DARTS. Second, we develop bilevel stochastic gradient descent for bilevel problems with lower level constraints, and we introduce a convergence theory that covers the unconstrained and constrained cases and abstracts as much as possible from the specifics of the bilevel gradient calculation.

show abstract

On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs

Cited by 3 publications

References 4 publications

CPD-Structured Multivariate Polynomial Optimization

CPD-Structured Multivariate Polynomial Optimization

PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation

Bilevel stochastic methods for optimization and machine learning: Bilevel stochastic descent and DARTS

Contact Info

Product

Resources

About