“…Low-rank deep networks reduce parameter counts (thus saving memory) as well as the number of ops required for matrix-vector multiplication: (d + m) • r vs. d • m. Khodak et al [2021] demonstrate that if one pays attention to proper initialization and regularization, low-rank methods outperform sparse pruning approaches in many domains, contrary to existing beliefs that sparse methods outperform low-rank methods in parameter count savings. In particular, a low-rank initialization scheme called spectral initialization is crucial to achieve better performance -initialization schemes are in general quite important for achieving good performance in neural network training [Bachlechner et al, 2020, Choromanski et al, 2018, Dauphin and Schoenholz, 2019, Hu et al, 2020, Huang et al, 2020, Mishkin and Matas, 2015, Pennington et al, 2017, Xiao et al, 2018, Zhang et al, 2021. Spectral initialization samples a full-rank matrix W ∈ R d×m from a known init distribution, factorizes W as AΣ 1/2 , Σ 1/2 B via singular value decomposition (SVD), and initializes U and V with these factors.…”