2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014
DOI: 10.1109/icassp.2014.6853582
|View full text |Cite
|
Sign up to set email alerts
|

Mean-normalized stochastic gradient for large-scale deep learning

Abstract: Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a facto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
58
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 58 publications
(58 citation statements)
references
References 13 publications
0
58
0
Order By: Relevance
“…Neural networks with multiple linear bottlenecks can not be trained well with SGD. In our previous work [15], we derived a stochastic algorithm called mean-normalized stochastic gradient descent (MN-SGD) and showed that it is capable of optimizing such bottleneck networks from scratch For sequence training, the advantages of stochastic optimization are less compelling. First, because of their frequent model updates, the lattices deviate quickly from the search space of the current model.…”
Section: Optimizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Neural networks with multiple linear bottlenecks can not be trained well with SGD. In our previous work [15], we derived a stochastic algorithm called mean-normalized stochastic gradient descent (MN-SGD) and showed that it is capable of optimizing such bottleneck networks from scratch For sequence training, the advantages of stochastic optimization are less compelling. First, because of their frequent model updates, the lattices deviate quickly from the search space of the current model.…”
Section: Optimizationmentioning
confidence: 99%
“…In this paper, we present experiments on an English broadcast conversations recognition task with our our implementation of sequence training in the RASR toolkit [15]. We compare different different training criteria and optimization algorithms.…”
Section: Introductionmentioning
confidence: 99%
“…This simplifies the implementation of alternative optimization algorithms, which is currently an active research area for neural networks [13,14]. Currently, the supported estimators include basic SGD, SGD with momentum, and a new stochastic second-order algorithm developed in our group [15]. In addition, batch estimation with gradient descent and Rprop [16] is possible.…”
Section: Trainingmentioning
confidence: 99%
“…RASR also supports a power schedule that decays with every mini-batch and has been reported to perform slightly better than Newbob [18]. Mostly, we use a modification of Newbob which decays the learning rate less aggressively than the original Newbob [15].…”
Section: Trainingmentioning
confidence: 99%
“…The initial technique used for batch normalisation [28] was based on modifications of the inputs at each layer to achieve a constant distribution of features. This modification can be perceived as either creating a dependency on the activation values for the optimiser or a direct change in the overall network structure [30,31].…”
Section: Normalisationmentioning
confidence: 99%