Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach

Karakida, Ryo; Akaho, Shotaro; Амари, Шун-ичи

doi:10.48550/arxiv.1806.01316

Cited by 21 publications

(26 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where q (k) = W (1) x (k) / √ d and D 1(k) = diag(σ (q (k) )). Our work is then to evaluate the asymptotics of the right side of (25). Toward this end, we first show that we can replace Q(w (2) ) in the latter by the simpler matrix…”

Section: Proof Of Theorem 310mentioning

confidence: 99%

See 1 more Smart Citation

Lower Bounds on the Generalization Error of Nonlinear Learning Models

Seroussi¹,

Zeitouni²

2021

Preprint

View full text Add to dashboard Cite

We study in this paper lower bounds for the generalization error of models derived from multi-layer neural networks, in the regime where the size of the layers is commensurate with the number of samples in the training data. We show that unbiased estimators have unacceptable performance for such nonlinear networks in this regime. We derive explicit generalization lower bounds for general biased estimators, in the cases of linear regression and of two-layered networks. In the linear case the bound is asymptotically tight. In the nonlinear case, we provide a comparison of our bounds with an empirical study of the stochastic gradient descent algorithm. The analysis uses elements from the theory of large random matrices.

show abstract

Section: Proof Of Theorem 310mentioning

confidence: 99%

“…The spectrum of the Fisher information matrix at initialization for one hidden layer is calculated in [40]. The Fisher matrix for deep neural network in the mean field limit is studied in [25].…”

Section: Related Literaturementioning

confidence: 99%

Lower Bounds on the Generalization Error of Nonlinear Learning Models

Seroussi¹,

Zeitouni²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Gaussian weights and biases. In this section, we provide background information and briefly recall the formalism of Karakida et al (2018) which first computes spectral properties of the Fisher Information of a neural network and then relates it to the maximal stable learning rate.…”

Section: Preliminariesmentioning

confidence: 99%

“…This assumption holds true for a large class of losses including squared loss and cross-entropy loss. Let I θ denote the Fisher Information Matrix (FIM) associated with the parametric family induced by the loss, If θ is initialized in a sufficiently small neighborhood of θ * , then by expanding the population loss L(θ) to quadratic order about θ * one can show that a necessary condition for convergence is that the step size is bounded from above by (LeCun et al, 2012;Karakida et al, 2018)…”

Section: Fisher Information Matrix and Learning Dynamicsmentioning

confidence: 99%

See 1 more Smart Citation

Mean-field Analysis of Batch Normalization

Wei,

Stokes,

Schwab

2019

Preprint

View full text Add to dashboard Cite

Batch Normalization (BatchNorm) is an extremely useful component of modern neural network architectures, enabling optimization using higher learning rates and achieving faster convergence. In this paper, we use mean-field theory to analytically quantify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. We show that it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix. These findings are then used to justify the use of larger learning rates for networks that use BatchNorm, and we provide quantitative characterization of the maximal allowable learning rate to ensure convergence. Experiments support our theoretically predicted maximum learning rate, and furthermore suggest that networks with smaller values of the BatchNorm parameter γ achieve lower loss after the same number of epochs of training.

show abstract

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Osawa

Tsuji

Ueno

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that shows similar generalization capability to first-order methods, but converges faster and can handle larger minibatches. To test our method on a benchmark where highly optimized first-order methods are available as references, we train ResNet-50 on ImageNet. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took only 978 iterations.

show abstract

Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach

Cited by 21 publications

References 25 publications

Lower Bounds on the Generalization Error of Nonlinear Learning Models

Lower Bounds on the Generalization Error of Nonlinear Learning Models

Mean-field Analysis of Batch Normalization

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Contact Info

Product

Resources

About