On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24,28].There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection.In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, lowdiameter, fully-connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline, and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -no application code is necessary.Our design is implemented in a few hundred lines of X10. On the binomial tree described in [26], the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes.In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our frame...
The X10 programming language is organized around the notion of places (an encapsulation of data and activities operating on the data), partitioned global address space (PGAS), and asynchronous computation and communication. This paper introduces an expressive subset of X10, Flat X10, designed to permit efficient execution across multiple single-threaded places with a simple runtime and without compromising on the productivity of X10. We present the design, implementation and evaluation of a compiler and runtime system for Flat X10. The Flat X10 compiler translates programs into C++ SPMD programs communicating using an active messaging infrastructure. It uses novel techniques to transform explicitly parallel programs into SPMD programs. The runtime system is based on IBM's LAPI (Low-level API) and is easily portable to other libraries such as GASNet and ARMCI. Our implementation realizes performance comparable to hand-written MPI programs for well-known HPC benchmarks such as Random Access, Stream, and FFT, on a Federation-based cluster of Power5 SMPs (with hundreds of processors) and the Blue Gene (with thousands of processors). Submissions based on the work presented in this paper were co-winners of the 2007 and 2008 HPC Challenge Type II Awards.
On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k -dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary. Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes. In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.
The X10 programming language is organized around the notion of places (an encapsulation of data and activities operating on the data), partitioned global address space (PGAS), and asynchronous computation and communication.This paper introduces an expressive subset of X10, FLAT X10, designed to permit efficient execution across multiple singlethreaded places with a simple runtime and without compromising on the productivity of X10. We present the design, implementation and evaluation of a compiler and runtime system for FLAT X10. The FLAT X10 compiler translates programs into C++ SPMD programs communicating using an active messaging infrastructure. It uses novel techniques to transform explicitly parallel programs into SPMD programs. The runtime system is based on IBM's LAPI (Low-level API) and is easily portable to other libraries such as GASNet and ARMCI.Our implementation realizes performance comparable to handwritten MPI programs for well-known HPC benchmarks such as Random Access, Stream, and FFT, on a Federation-based cluster of Power5 SMPs (with hundreds of processors) and the Blue Gene (with thousands of processors). Submissions based on the work presented in this paper were co-winners of the 2007 and 2008 HPC Challenge Type II Awards.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.