Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.
The steadily increasing number of nodes in high-performance computing systems and the technology and power constraints lead to sparse network topologies. Efficient mapping of application communication patterns to the network topology gains importance as systems grow to petascale and beyond. Such mapping is supported in parallel programming frameworks such as MPI, but is often not well implemented. We show that the topology mapping problem is NP-complete and analyze and compare different practical topology mapping heuristics. We demonstrate an efficient and fast new heuristic which is based on graph similarity and show its utility with application communication patterns on real topologies. Our mapping strategies support heterogeneous networks and show significant reduction of congestion on torus, fat-tree, and the PERCS network topologies, for irregular communication patterns. We also demonstrate that the benefit of topology mapping grows with the network size and show how our algorithms can be used in a practical setting to optimize communication performance. Our efficient topology mapping strategies are shown to reduce network congestion by up to 80%, reduce average dilation by up to 50%, and improve benchmarked communication performance by 18%. MOTIVATIONThe number of nodes in the largest computing systems, and, hence, the size of their interconnection networks, is increasing rapidly: The Jaguar system at ORNL has over 18,000 nodes and larger systems are expected in the near future. These networks are built by interconnecting nodes (switches and processors) with links. Pin count, power and gate count constraints restrict the number of links per switch; typical sizes are: 24 (InfiniBand), 36 (Myrinet, InfiniBand), or 6 (Sea Star or BlueGene/P). Different topologies are used to construct large-scale networks from crossbars; e.g., kary n-cubes (hypercube, torus), k-ary n-trees (fat-trees), or folded Clos networks. Networks also differ in their routing protocols.As the number of nodes grows larger, the diameter of the network (i.e., the maximum distance between two processors) increases; for many topologies, the bisection bandwidth (i.e., the minimum total bandwidth of links that need to be cut in order to divide the processors into two equal sets) decreases relative to the number of nodes.This effect is well understood and it is generally accepted that dense communication patterns (such as an all-to-all communication where each node communicates to each other) are hard to scale beyond petascale systems. Luckily, the communication patterns of many applications are relatively sparse (each node communicate with a few others), and dense communications can be replaced by repeated sparse communications (e.g., the all-to-all communication used for the transpose in a parallel Fast Fourier Transform can be replaced by two phases of group transposes, each involving only Θ( √ P ) processors [17]). Furthermore, the communication pattern often has significant locality, e.g., when most communication occurs between adjacent cell...
Abstract-We introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-ofthe-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centers as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient datacenter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.
Abstract-This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. Our analytical model shows that not only collective operations but also point-to-point communications influence the application's sensitivity to noise. We present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. We investigate collective operations with up to 1 million processes and three applications (Sweep3D, AMG, and POP) with up to 32,000 processes. We show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. Simulations with different network speeds show that a 10x faster network does not improve application scalability. We quantify noise and conclude that our tools can be utilized to tune the noise signatures of a specific system. I. MOTIVATION AND BACKGROUNDThe performance impact of operating system and architectural overheads (system noise) at massive scale is increasingly of concern. Even small local delays on compute nodes, which can be caused by interrupts, operating system daemons, or even cache or page misses, can affect global application performance significantly [1]. Such local delays often cause less than 1% overhead per process but severe performance losses can occur if noise is propagated (amplified) through communication or global synchronization. Previous analyses generally assume that the performance impact of system noise grows at scale and Tsafrir et al. [2] even suggest that the impact of very low frequency noise scales linearly with the system size. A. Related WorkPetrini, Kerbyson, and Pakin [1] report that the parallel performance of SAGE on a fixed number of ASCI Q nodes was highest when SAGE used only three of the four CPUs per node. It turned out that "resonance" between the application's collective communication and the misconfigured system caused delays during each iteration. Jones, Brenner, and Fier [3] observed similar effects with collective communication and also report that, under certain circumstances, it is beneficial to leave one CPU idle. A theoretical analysis of the influence of noise on collective communication [4] suggests that the impact of noise depends on the type of distribution and their parameters and can, in the worst case (exponential distribution), scale linearly with the number of processes. Ferreira, Bridges, and Brightwell use noise-injection techniques to assess the impact of noise on several applications [5]. Beckman et al.[6] analyzed the performance on BlueGene/L, concluding that most sources of noise can be avoided in very specialized systems.Previous work was either limited to experimental analysis on specific architectures with injection of artificially generated noise (fixed frequency), or to purely theoretical analyses that assume a particular collective pattern [4]. These previou...
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other highperformance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.