2022
DOI: 10.1109/tpami.2020.3004354
|View full text |Cite
|
Sign up to set email alerts
|

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

Abstract: Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose Scalable and Practical Natural Gradient Descent (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 22 publications
(15 citation statements)
references
References 13 publications
0
15
0
Order By: Relevance
“…Similarly, the other two modified CXR-Nets based on ResNet-50 (Supplementary Fig. 4 ) were trained for abnormality description and disease diagnosis on CXR images 31 . To mimic the diagnostic routine of thoracic clinicians, we modified the diagnosis task networks (mentioned above) to design a two-stream disease diagnosis network architecture to perform image feature extraction using the trained backbone of abnormality prediction.…”
Section: Resultsmentioning
confidence: 99%
“…Similarly, the other two modified CXR-Nets based on ResNet-50 (Supplementary Fig. 4 ) were trained for abnormality description and disease diagnosis on CXR images 31 . To mimic the diagnostic routine of thoracic clinicians, we modified the diagnosis task networks (mentioned above) to design a two-stream disease diagnosis network architecture to perform image feature extraction using the trained backbone of abnormality prediction.…”
Section: Resultsmentioning
confidence: 99%
“…ImageNet training with ResNet-50 on 512 Nvidia V100 GPUs. Osawa 45 achieved the scaling efficiency was up to about 75% on ImageNet training…”
Section: Related Workmentioning
confidence: 98%
“…Sun 44 achieved the scaling efficiency was up to 80% on ImageNet training with ResNet‐50 on 512 Nvidia V100 GPUs. Osawa 45 achieved the scaling efficiency was up to about 75% on ImageNet training with ResNet‐50 by K‐FAC on 1024 Nvidia V100 GPUs. In their implementations, GPU Direct RDMA (GDR) are used to enable direct data exchange between GPUs on different nodes.…”
Section: Related Workmentioning
confidence: 99%
“…The Kronecker-Factored Approximate Curvature (KFAC) has been successfully used as an approximate FIM to precondition the gradient through layer-wise block-diagonalization and Kronecker factorization for training large-scale convolutional neural networks (CNNs) [8,9,12]. Osawa et al, [13,17] show that distributed KFAC (D-KFAC) can achieve the target accuracy of ResNet-50 [18] model on the ImageNet [19] data set in 1/3 number of epochs than the standard training with SGD.…”
Section: Introductionmentioning
confidence: 99%
“…The existing state-of-the-art distributed KFAC (MPD-KFAC) [13,17,20]- [22] makes use of the concept of model parallelism with multiple GPUs to calculate the inverses of different layers' approximate FIMs in parallel to reduce the computing time. However, besides the communication in aggregating Kronecker factors, MPD-KFAC further introduces significant communication overheads in collecting inverted matrices.…”
Section: Introductionmentioning
confidence: 99%