Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Rafati, Jacob; Marcia, Roummel F.

doi:10.1109/icmla.2018.00081

Cited by 19 publications

(14 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…to (52), we obtain the desired result in (44). For a given constant ǫ satisfying 0 < ǫ < 1, the iteration number needed to guarantee…”

Section: Convergence Resultsmentioning

confidence: 64%

“…S TOCHASTIC optimization algorithms have been extensively studied over decades and can be traced back to the epochal work [22], which have been widely employed in different areas, e.g., machine learning [23]- [25], [52], [53], power systems [51], wireless communication [5]- [7], and bioinformatics [50]. In particular, the classical stochastic approximation (SA) of the exact gradient, also known as stochastic gradient descent (SGD), has been widely applied to these stochastic optimization problems, where the gradient information is employed in finding the search direction.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Stochastic Quasi-Newton Method for Large-Scale Nonconvex Optimization With Applications

Chen

Chan

et al. 2020

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Ensuring the positive definiteness and avoiding illconditioning of the Hessian update in the stochastic Broyden-Fletcher-Goldfarb-Shanno (BFGS) method are significant in solving nonconvex problems. This paper proposes a novel stochastic version of damped and regularized BFGS method for addressing the above problems. While the proposed regularized strategy helps to prevent the BFGS matrix from being close to singularity, the new damped parameter further ensures positivity of the product of correction pairs. To alleviate the computational cost of the stochastic LBFGS updates, and to improve its robustness, the curvature information is updated using the averaged iterate at spaced intervals. The effectiveness of the proposed method is evaluated through the logistic regression and Bayesian logistic regression problems in machine learning. Numerical experiments are conducted by using both synthetic dataset and several real datasets. The results show that the proposed method generally outperforms the stochastic damped limited memory BFGS (SdLBFGS) method. In particular, for problems with small sample sizes, our method has shown superior performance and is capable of mitigating ill-conditioned problems. Furthermore, our method is more robust to the variations of the batch size and memory size than the SdLBFGS method.Index Terms-nonconvex optimization, stochastic quasi-Newton method, LBFGS, damped parameter, nonconjugate exponential models, variational inference.

show abstract

“…to (52), we obtain the desired result in (44). For a given constant ǫ satisfying 0 < ǫ < 1, the iteration number needed to guarantee…”

Section: Convergence Resultsmentioning

confidence: 64%

Section: Introductionmentioning

confidence: 99%

A Stochastic Quasi-Newton Method for Large-Scale Nonconvex Optimization With Applications

Chen

Chan

et al. 2020

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Since this can potentially double the iteration complexity, an overlap batching strategy was proposed to reduce the computational cost in [3] and tested also in [4]. This strategy was further applied in [17,39]. Other stochastic quasi-Newton methods have been considered that employ a progressive batching approach in which the sample size is increased as the iteration progresses, see e.g.…”

Section: Literature Reviewmentioning

confidence: 99%

On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning

Yousefi¹,

Martínez²

2022

Preprint

View full text Add to dashboard Cite

While first-order methods are popular for solving optimization problems that arise in largescale deep learning problems, they come with some acute deficiencies. To diminish such shortcomings, there has been recent interest in applying second-order methods such as quasi-Newton based methods which construct Hessians approximations using only gradient information. The main focus of our work is to study the behaviour of stochastic quasi-Newton algorithms for training deep neural networks. We have analyzed the performance of two well-known quasi-Newton updates, the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This study fills a gap concerning the real performance of both updates and analyzes whether more efficient training is obtained when using the more robust BFGS update or the cheaper SR1 formula which allows for indefinite Hessian approximations and thus can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study which includes the effect of batch normalization and network's architecture, the limited memory parameter, the batch size and the type of sampling strategy. we show that stochastic quasi-Newton optimizers are efficient and able to outperform in some instances the well-known first-order Adam optimizer run with the optimal combination of its numerous hyperparameters.

show abstract

“…An initial approximation of the Hessian is obtained by solving an eigenvalue problem as proposed in Reference [65].…”

Section: Convolutional Networkmentioning

confidence: 99%

Globally Convergent Multilevel Training of Deep Residual Networks

Kopaničáková¹,

Krause²

2021

Preprint

View full text Add to dashboard Cite

We propose a globally convergent multilevel training method for deep residual networks (ResNets). The devised method can be seen as a novel variant of the recursive multilevel trust-region (RMTR) method, which operates in hybrid (stochastic-deterministic) settings by adaptively adjusting mini-batch sizes during the training. The multilevel hierarchy and the transfer operators are constructed by exploiting a dynamical system's viewpoint, which interprets forward propagation through the ResNet as a forward Euler discretization of an initial value problem. In contrast to traditional training approaches, our novel RMTR method also incorporates curvature information on all levels of the multilevel hierarchy by means of the limited-memory SR1 method. The overall performance and the convergence properties of our multilevel training method are numerically investigated using examples from the field of classification and regression.

show abstract

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Cited by 19 publications

References 17 publications

A Stochastic Quasi-Newton Method for Large-Scale Nonconvex Optimization With Applications

A Stochastic Quasi-Newton Method for Large-Scale Nonconvex Optimization With Applications

On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning

Globally Convergent Multilevel Training of Deep Residual Networks

Contact Info

Product

Resources

About