2020
DOI: 10.48550/arxiv.2001.05992
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Abstract: The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 24 publications
(18 citation statements)
references
References 16 publications
1
17
0
Order By: Relevance
“…The evidence of the study suggests intimates that the empirical Fisher information matrix is of order L/N . This observation is consistent with a depth-dependent learning rate, which is empirically observed in [19] and required for the convergence of training in deep linear networks [27].…”
Section: Discussionsupporting
confidence: 90%
“…The evidence of the study suggests intimates that the empirical Fisher information matrix is of order L/N . This observation is consistent with a depth-dependent learning rate, which is empirically observed in [19] and required for the convergence of training in deep linear networks [27].…”
Section: Discussionsupporting
confidence: 90%
“…We use the Adam optimizer 69 with learning rate η = 0.00001 and batch size 128. The parameters of the encoder and generator networks are initialized with orthogonal initialization 70,71 . We set the hyperparameter λ = 0.01 and sample β from a uniform distribution, with a different random value for each training example.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…Low-rank deep networks reduce parameter counts (thus saving memory) as well as the number of ops required for matrix-vector multiplication: (d + m) • r vs. d • m. Khodak et al [2021] demonstrate that if one pays attention to proper initialization and regularization, low-rank methods outperform sparse pruning approaches in many domains, contrary to existing beliefs that sparse methods outperform low-rank methods in parameter count savings. In particular, a low-rank initialization scheme called spectral initialization is crucial to achieve better performance -initialization schemes are in general quite important for achieving good performance in neural network training [Bachlechner et al, 2020, Choromanski et al, 2018, Dauphin and Schoenholz, 2019, Hu et al, 2020, Huang et al, 2020, Mishkin and Matas, 2015, Pennington et al, 2017, Xiao et al, 2018, Zhang et al, 2021. Spectral initialization samples a full-rank matrix W ∈ R d×m from a known init distribution, factorizes W as AΣ 1/2 , Σ 1/2 B via singular value decomposition (SVD), and initializes U and V with these factors.…”
Section: Low-rank Factorized Networkmentioning
confidence: 99%