swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Li, Liandeng; Fang, Jiayuan; Fu, Haohuan; Jiang, Jinlei; Zhao, Wenlai; He, Conghui; You, Xiao‐Zeng; Yang, Guangwen

doi:10.1109/cluster.2018.00087

Cited by 30 publications

(14 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A er initialization, the traces in a CDP are processed in sequence (line 9) and the data halfpoints is prefetched before a new trace is processed (line 10-12). For the current trace, the memory addresses of the data accesses are calculated for each sample-NMO velocity pair and kept in the k1 array (line [13][14][15]. en, the maximum and minimum memory address in k1 array is identi ed (line [16][17][18] and used to determine the memory range (len th) of data accesses (line 19).…”

Section: 32mentioning

confidence: 99%

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Yang

Luan

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Common Midpoint (CMP) and Common Re ection Surface (CRS) are widely used methods for improving the signal-to-noise ratio in the eld of seismic processing. ese methods are computationally intensive and require high performance computing. is paper optimizes these methods on the Sunway many-core architecture and implements large-scale seismic processing on the Sunway Taihulight supercomputer. We propose the following three optimization techniques: 1) we propose a so ware cache method to reduce the overhead of memory accesses, and share data among CPEs via the register communication; 2) we re-design the semblance calculation procedure to further reduce the overhead of memory accesses; 3) we propose a vectorization method to improve the performance when processing the small volume of data within short loops. e experimental results show that our implementations of CMP and CRS methods on Sunway achieve 3.50× and 3.01× speedup on average compared to the-state-of-the-art implementations on CPU. In addition, our implementation is capable to run on more than one million cores of Sunway TaihuLight with good scalability.

show abstract

Section: 32mentioning

confidence: 99%

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Yang

Luan

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…To guide the performance model, swDNN was designed for supporting efficient CNN implementation on Sunway TaihuLight, combining computing and memory resources together (Fang et al 2017). The later swCaffe, equipped with swDNN with Caffe, is the first deep learning frameworks for this supercomputer and obtains 4X speedup for the complete training process of the VGG-16 network (Li et al 2018). swCaffe on SW26010 has nearly half the performance of K40m in single precision and have 1.8× speedup over K40m in double precision.…”

Section: Large Scale Deep Learning On Sunway Taihulightmentioning

confidence: 99%

Distributed deep learning system for cancerous region detection on Sunway TaihuLight

et al. 2020

CCF Trans. HPC

View full text Add to dashboard Cite

To explore the potential of distributed training on deep neural networks, we implement several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight. Based on two naive designs of parameter server and ring all-reduce, we present the limitation of the communication model and discuss the optimizations for adapting the five-level interconnect architecture of Sunway system. To reduce the communication bottleneck on large scale system, multi-severs and hierarchical ring all-reduce models are introduced. With a benchmark from deep learning-based cancerous region detection algorithm, the average parallel efficiency obtains over 80% for at most 1024 processors. It reveals the great opportunity for joint combination of deep learning and HPC system.

show abstract

“…Performing reduction tree operations is thus both more efficient and scalable than the traditional parameter server approach. Several prior works [8], [20]- [23] all implement 'allreduce' operations, customized by cluster interconnect features, to optimize the transmission process.…”

Section: Communication Optimizationmentioning

confidence: 99%

“…In order to reduce the I/O overhead, S-Caffe [29] provides DL frameworks with parallel reading capabilities in order to take advantage of parallel file systems such as Lustre. swCaffe [20] improves the aggregated bandwidth of disk arrays by adjusting the data layout, enabling it to better fit the hardware architecture.…”

Section: I/o Optimizationmentioning

confidence: 99%

An Efficient Method for Training Deep Learning Networks Distributed

Wang

Chen

et al. 2020

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Training deep learning (DL) is a computationally intensive process; as a result, training time can become so long that it impedes the development of DL. High performance computing clusters, especially supercomputers, are equipped with a large amount of computing resources, storage resources, and efficient interconnection ability, which can train DL networks better and faster. In this paper, we propose a method to train DL networks distributed with high efficiency. First, we propose a hierarchical synchronous Stochastic Gradient Descent (SGD) strategy, which can make full use of hardware resources and greatly increase computational efficiency. Second, we present a two-level parameter synchronization scheme which can reduce communication overhead by transmitting parameters of the first layer models in shared memory. Third, we optimize the parallel I/O by making each reader read data as continuously as possible to avoid the high overhead of discontinuous data reading. At last, we integrate the LARS algorithm into our system. The experimental results demonstrate that our approach has tremendous performance advantages relative to unoptimized methods. Compared with the native distributed strategy, our hierarchical synchronous SGD strategy (HSGD) can increase computing efficiency by about 20 times.

show abstract

swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Cited by 30 publications

References 12 publications

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

Distributed deep learning system for cancerous region detection on Sunway TaihuLight

An Efficient Method for Training Deep Learning Networks Distributed

Contact Info

Product

Resources

About