Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Hu, Wei; Xiao, Lechao; Pennington, Jeffrey

doi:10.48550/arxiv.2001.05992

Cited by 24 publications

(18 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The evidence of the study suggests intimates that the empirical Fisher information matrix is of order L/N . This observation is consistent with a depth-dependent learning rate, which is empirically observed in [19] and required for the convergence of training in deep linear networks [27].…”

Section: Discussionsupporting

confidence: 90%

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

Hayase,

Karakida

2020

Preprint

View full text Add to dashboard Cite

The Fisher information matrix (FIM) is fundamental for understanding the trainability of deep neural networks (DNN) since it describes the local metric of the parameter space. We investigate the spectral distribution of the FIM given a single input by focusing on fully-connected networks achieving dynamical isometry. Then, while dynamical isometry is known to keep specific backpropagated signals independent of the depth, we find that the parameter space's local metric depends on the depth. In particular, we obtain an exact expression of the spectrum of the FIM given a single input and reveal that it concentrates around the depth point. Here, considering random initialization and the wide limit, we construct an algebraic methodology to examine the spectrum based on free probability theory, which is the algebraic wrapper of random matrix theory. As a byproduct, we provide the solvable spectral distribution in the two-hidden-layer case. Lastly, we empirically confirm that the spectrum of FIM with small batch-size has the same property as the single-input version. An experimental result shows that FIM's dependence on the depth determines the appropriate size of the learning rate for convergence at the initial phase of the online training of DNNs.Preprint. Under review.

show abstract

Section: Discussionsupporting

confidence: 90%

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

Hayase,

Karakida

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…We use the Adam optimizer 69 with learning rate η = 0.00001 and batch size 128. The parameters of the encoder and generator networks are initialized with orthogonal initialization 70,71 . We set the hyperparameter λ = 0.01 and sample β from a uniform distribution, with a different random value for each training example.…”

Section: Implementation Detailsmentioning

confidence: 99%

Feature Alignment As A Generative Process

Farias

Maziero

2022

Preprint

View full text Add to dashboard Cite

We introduce feature alignment, a technique for obtaining approximate reversibility in artificial neural networks. By means of feature extraction, we can train a neural network to learn an estimated map for its reverse process from outputs to inputs. Combined with variational autoencoders, we can generate new samples from the same statistics as the training data. Improvements of the results are obtained by using concepts from generative adversarial networks. Finally, we show that the technique can be modified for training neural networks locally, saving computational memory resources. Applying these techniques, we report results for three vision generative tasks: MNIST, CIFAR-10, and celebA.

show abstract

“…Low-rank deep networks reduce parameter counts (thus saving memory) as well as the number of ops required for matrix-vector multiplication: (d + m) • r vs. d • m. Khodak et al [2021] demonstrate that if one pays attention to proper initialization and regularization, low-rank methods outperform sparse pruning approaches in many domains, contrary to existing beliefs that sparse methods outperform low-rank methods in parameter count savings. In particular, a low-rank initialization scheme called spectral initialization is crucial to achieve better performance -initialization schemes are in general quite important for achieving good performance in neural network training [Bachlechner et al, 2020, Choromanski et al, 2018, Dauphin and Schoenholz, 2019, Hu et al, 2020, Huang et al, 2020, Mishkin and Matas, 2015, Pennington et al, 2017, Xiao et al, 2018, Zhang et al, 2021. Spectral initialization samples a full-rank matrix W ∈ R d×m from a known init distribution, factorizes W as AΣ 1/2 , Σ 1/2 B via singular value decomposition (SVD), and initializes U and V with these factors.…”

Section: Low-rank Factorized Networkmentioning

confidence: 99%

Nonlinear Initialization Methods for Low-Rank Neural Networks

Vodrahalli¹,

Shivanna²,

Sathiamoorthy³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study algorithms for learning low-rank neural networks -networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error with probability ≥ 1 − δ, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error with respect to a strictly smaller hypothesis class. To solve this problem, we introduce an efficient SVD-based Nonlinear Kernel Projection algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank deep networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNet and EfficientNet models [

show abstract

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Cited by 24 publications

References 16 publications

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry

Feature Alignment As A Generative Process

Nonlinear Initialization Methods for Low-Rank Neural Networks

Contact Info

Product

Resources

About