Scalable and Practical Natural Gradient for Large-Scale Deep Learning

Osawa, Kazuki; Tsuji, Yohei; Ueno, Yoshio; Naruse, Akira; Foo, Chuan-Sheng; Yokota, Rio

doi:10.1109/tpami.2020.3004354

Cited by 22 publications

(15 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, the other two modified CXR-Nets based on ResNet-50 (Supplementary Fig. 4 ) were trained for abnormality description and disease diagnosis on CXR images 31 . To mimic the diagnostic routine of thoracic clinicians, we modified the diagnosis task networks (mentioned above) to design a two-stream disease diagnosis network architecture to perform image feature extraction using the trained backbone of abnormality prediction.…”

Section: Resultsmentioning

confidence: 99%

Development and validation of an abnormality-derived deep-learning diagnostic system for major respiratory diseases

Wang

Ma²,

Zhang³

et al. 2022

npj Digit. Med.

View full text Add to dashboard Cite

Respiratory diseases impose a tremendous global health burden on large patient populations. In this study, we aimed to develop DeepMRDTR, a deep learning-based medical image interpretation system for the diagnosis of major respiratory diseases based on the automated identification of a wide range of radiological abnormalities through computed tomography (CT) and chest X-ray (CXR) from real-world, large-scale datasets. DeepMRDTR comprises four networks (two CT-Nets and two CXR-Nets) that exploit contrastive learning to generate pre-training parameters that are fine-tuned on the retrospective dataset collected from a single institution. The performance of DeepMRDTR was evaluated for abnormality identification and disease diagnosis on data from two different institutions: one was an internal testing dataset from the same institution as the training data and the second was collected from an external institution to evaluate the model generalizability and robustness to an unrelated population dataset. In such a difficult multi-class diagnosis task, our system achieved the average area under the receiver operating characteristic curve (AUC) of 0.856 (95% confidence interval (CI):0.843–0.868) and 0.841 (95%CI:0.832–0.887) for abnormality identification, and 0.900 (95%CI:0.872–0.958) and 0.866 (95%CI:0.832–0.887) for major respiratory diseases’ diagnosis on CT and CXR datasets, respectively. Furthermore, to achieve a clinically actionable diagnosis, we deployed a preliminary version of DeepMRDTR into the clinical workflow, which was performed on par with senior experts in disease diagnosis, with an AUC of 0.890 and a Cohen’s k of 0.746–0.877 at a reasonable timescale; these findings demonstrate the potential to accelerate the medical workflow to facilitate early diagnosis as a triage tool for respiratory diseases which supports improved clinical diagnoses and decision-making.

show abstract

Section: Resultsmentioning

confidence: 99%

Development and validation of an abnormality-derived deep-learning diagnostic system for major respiratory diseases

Wang

Ma²,

Zhang³

et al. 2022

npj Digit. Med.

View full text Add to dashboard Cite

show abstract

“…ImageNet training with ResNet-50 on 512 Nvidia V100 GPUs. Osawa 45 achieved the scaling efficiency was up to about 75% on ImageNet training…”

Section: Related Workmentioning

confidence: 98%

“…Sun 44 achieved the scaling efficiency was up to 80% on ImageNet training with ResNet‐50 on 512 Nvidia V100 GPUs. Osawa 45 achieved the scaling efficiency was up to about 75% on ImageNet training with ResNet‐50 by K‐FAC on 1024 Nvidia V100 GPUs. In their implementations, GPU Direct RDMA (GDR) are used to enable direct data exchange between GPUs on different nodes.…”

Section: Related Workmentioning

confidence: 99%

Fast and accurate variable batch size convolution neural network training on large scale distributed systems

Xiao

Sun

et al. 2022

Concurrency and Computation

View full text Add to dashboard Cite

Large-scale distributed convolution neural network (CNN) training brings two performance challenges: model performance and system performance. Large batch size usually leads to model test accuracy loss, which counteracts the benefits of parallel SGD. The existing solutions require massive hyperparameter hand-tuning. To overcome this difficult, we analyze the training process and find that earlier training stages are more sensitive to batch size. Accordingly, we assert that different stages should use different batch size, and propose a variable batch size strategy. In order to remain high test accuracy under larger batch size cases, we design an auto-tuning engine for automatic parameter tuning in the proposed variable batch size strategy. Furthermore, we develop a dataflow implementation approach to achieve the high-throughput CNN training on supercomputer system. Our approach has achieved high generalization performance on SOAT CNN networks. For the ShuffleNet, training with ImageNet-1K dataset, we scale the batch size to 120 K without accuracy loss and to 128 K with only a slight loss. And the dataflow implementation approach achieves 93.5% scaling efficiency on 1024 GPUs compared with the state-of-the-art.

show abstract

“…The Kronecker-Factored Approximate Curvature (KFAC) has been successfully used as an approximate FIM to precondition the gradient through layer-wise block-diagonalization and Kronecker factorization for training large-scale convolutional neural networks (CNNs) [8,9,12]. Osawa et al, [13,17] show that distributed KFAC (D-KFAC) can achieve the target accuracy of ResNet-50 [18] model on the ImageNet [19] data set in 1/3 number of epochs than the standard training with SGD.…”

Section: Introductionmentioning

confidence: 99%

“…The existing state-of-the-art distributed KFAC (MPD-KFAC) [13,17,20]- [22] makes use of the concept of model parallelism with multiple GPUs to calculate the inverses of different layers' approximate FIMs in parallel to reduce the computing time. However, besides the communication in aggregating Kronecker factors, MPD-KFAC further introduces significant communication overheads in collecting inverted matrices.…”

Section: Introductionmentioning

confidence: 99%

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Shi

Zhang

2021

Preprint

View full text Add to dashboard Cite

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.

show abstract

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

Cited by 22 publications

References 13 publications

Development and validation of an abnormality-derived deep-learning diagnostic system for major respiratory diseases

Development and validation of an abnormality-derived deep-learning diagnostic system for major respiratory diseases

Fast and accurate variable batch size convolution neural network training on large scale distributed systems

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Contact Info

Product

Resources

About