Empirical study towards understanding line search approximations for training neural networks

Chae, Younghwan; Wilke, Daniel N.

doi:10.48550/arxiv.1909.06893

Cited by 1 publication

(1 citation statement)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GOLS-I has also been demonstrated to outperform probabilistic line searches (Mahsereci and Hennig, 2017), provided mini-batch sizes are not too small (< 50 for investigated problems) (Kafka and Wilke, 2019). The gradient-only optimization paradigm has recently also shown promise in the construction of approximation models to conduct line searches (Chae and Wilke, 2019). Some of the most important factors governing the nature of the computed gradients are: 1) The neural network architecture, 2) the activation functions (AFs) used within the architecture, 3) the loss function implemented, and 4) the mini-batch size used to evaluate the loss, to name a few.…”

Section: Introductionmentioning

confidence: 99%

Investigating the interaction between gradient-only line searches and different activation functions

Kafka,

Wilke

2020

Preprint

View full text Add to dashboard Cite

Gradient-only line searches (GOLS) adaptively determine step sizes along search directions for discontinuous loss functions resulting from dynamic mini-batch sub-sampling in neural network training.Step sizes in GOLS are determined by localizing Stochastic Non-Negative Associated Gradient Projection Points (SNN-GPPs) along descent directions. These are identified by a sign change in the directional derivative from negative to positive along a descent direction. Activation functions are a significant component of neural network architectures as they introduce non-linearities essential for complex function approximations. The smoothness and continuity characteristics of the activation functions directly affect the gradient characteristics of the loss function to be optimized. Therefore, it is of interest to investigate the relationship between activation functions and different neural network architectures in the context of GOLS. We find that GOLS are robust for a range of activation functions, but sensitive to the Rectified Linear Unit (ReLU) activation function in standard feedforward architectures. The zeroderivative in ReLU's negative input domain can lead to the gradient-vector becoming sparse, which severely affects training. We show that implementing architectural features such as batch normalization and skip connections can alleviate these difficulties and benefit training with GOLS for all activation functions considered.

show abstract

Section: Introductionmentioning

confidence: 99%