Min Si scite author profile

Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.

show abstract

Parallel I/O Optimizations for Scalable Deep Learning

Pumma

Feng

et al. 2017

View full text Add to dashboard Cite

Direct MPI Library for Intel Xeon Phi Co-Processors

Ishikawa

Tatagi

2013

View full text Add to dashboard Cite

DCFA-MPI is an MPI library implementation for Intel Xeon Phi co-processor clusters, where a compute node consists of an Intel Xeon Phi co-processor card connected to the host via PCI Express with InfiniBand. DCFA-MPI enables direct data transfer between Intel Xeon Phi co-processors without assistance from the host. Since DCFA, a direct communication facility for many-core based accelerators, provides direct InfiniBand communication functionality with the same interface as that on the host processor for Xeon Phi co-processor user space programs, direct InfiniBand communication between Xeon Phi co-processors could easily be developed. Using DCFA, an MPI library able to perform direct inter-node communication between Xeon Phi co-processors, has been designed and implemented. The implementation is based on the Mellanox InfiniBand HCA and the pre-production version of the Intel Xeon Phi coprocessor. DCFA-MPI delivers 3 times greater bandwidth than the 'Intel MPI on Xeon Phi co-processors' mode, and a from 2 to 12 times speed-up when compared to the 'Intel MPI on Xeon where it offloads computation to Xeon Phi co-processors' mode in communication with 2 MPI processes. It also shows from 2 to 4 times speed-up over the 'Intel MPI on Xeon where it offloads computation to Xeon Phi co-processors' mode in a five point stencil computation with an 8 processes * 56 threads parallelization by MPI + OpenMP.

show abstract

Towards Scalable Deep Learning via I/O Analysis and Optimization

Pumma

Feng

et al. 2017

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Min Si

Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures

Scalable Deep Learning via I/O Analysis and Optimization

Parallel I/O Optimizations for Scalable Deep Learning

Direct MPI Library for Intel Xeon Phi Co-Processors

Towards Scalable Deep Learning via I/O Analysis and Optimization

Contact Info

Product

Resources

About