Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

Ueno, Yoshio; Osawa, Kazuki; Tsuji, Yohei; Naruse, Akira; Yokota, Rio

doi:10.1145/3394486.3403265

Cited by 9 publications

(16 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to S-SGD, D-KFAC requires six extra timeconsuming operations at each layer: four computing operations (compute Kronecker factors A p l−1 and G p l and their inverses) and two communication operations (aggregation of A p l−1 and G p l ). Due to the high computational cost of inverting matrices, recent work has proposed the distributed algorithm to reduce the computation time of inverting matrices [13,20,22]. As shown in Eq.…”

Section: B Distributed Kfac (D-kfac)mentioning

confidence: 99%

“…( 13), the inverse operations of Kronecker factors at different layers have no dependency with each other. In existing state-of-the-art solutions [13,20,22], the workloads of different layers in computing inverses are distributed to multiple GPUs (with a concept of model parallelism), and their results are finally gathered to all GPUs for preconditioning gradients. An example is shown in the right hand side of Fig.…”

Section: B Distributed Kfac (D-kfac)mentioning

confidence: 99%

“…Note, however, that D-KFAC requires extensive computations to calculate preconditioning matrices compared to the firstorder gradients, and also introduces significant communication costs on GPU clusters [13,20].…”

Section: Introductionmentioning

confidence: 99%

“…The existing state-of-the-art distributed KFAC (MPD-KFAC) [13,17,20]- [22] makes use of the concept of model parallelism with multiple GPUs to calculate the inverses of different layers' approximate FIMs in parallel to reduce the computing time. However, besides the communication in aggregating Kronecker factors, MPD-KFAC further introduces significant communication overheads in collecting inverted matrices.…”

Section: Introductionmentioning

confidence: 99%

“…However, besides the communication in aggregating Kronecker factors, MPD-KFAC further introduces significant communication overheads in collecting inverted matrices. There were a number of attempts [13,20,22] to alleviate the communication overhead, but they fail to capture the parallelism between computing and communication tasks, which results in low throughput in a distributed system. It has been shown that communication tasks and computing tasks can be scheduled in S-SGD so as to hide some communication overheads to improve the system throughput [2,23]- [26].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Shi

Zhang

2021

Preprint

View full text Add to dashboard Cite

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.

show abstract

Section: B Distributed Kfac (D-kfac)mentioning

confidence: 99%

Section: B Distributed Kfac (D-kfac)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%