Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.
DCFA-MPI is an MPI library implementation for Intel Xeon Phi co-processor clusters, where a compute node consists of an Intel Xeon Phi co-processor card connected to the host via PCI Express with InfiniBand. DCFA-MPI enables direct data transfer between Intel Xeon Phi co-processors without assistance from the host. Since DCFA, a direct communication facility for many-core based accelerators, provides direct InfiniBand communication functionality with the same interface as that on the host processor for Xeon Phi co-processor user space programs, direct InfiniBand communication between Xeon Phi co-processors could easily be developed. Using DCFA, an MPI library able to perform direct inter-node communication between Xeon Phi co-processors, has been designed and implemented. The implementation is based on the Mellanox InfiniBand HCA and the pre-production version of the Intel Xeon Phi coprocessor. DCFA-MPI delivers 3 times greater bandwidth than the 'Intel MPI on Xeon Phi co-processors' mode, and a from 2 to 12 times speed-up when compared to the 'Intel MPI on Xeon where it offloads computation to Xeon Phi co-processors' mode in communication with 2 MPI processes. It also shows from 2 to 4 times speed-up over the 'Intel MPI on Xeon where it offloads computation to Xeon Phi co-processors' mode in a five point stencil computation with an 8 processes * 56 threads parallelization by MPI + OpenMP.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.