CCL: a portable and tunable collective communication library for scalable parallel computers

Bala, Vasanth; Bruck, Jehoshua; Cypher, Robert; Elustondo, P.; Ho, A.; Ho, Ching-Tien; Kipnis, Shlomo; Snir, Marc

doi:10.1109/ipps.1994.288208

Cited by 37 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several algorithms for improving the performance of collective communications have been proposed for decades [1], [2], [3], [4], [5]. More recently, some researchers have focused their efforts on taking advantage of existing algorithms in order to find techniques for selecting the most efficient algorithm for any given system/workload configuration.…”

Section: Related Workmentioning

confidence: 99%

Effect of dynamic algorithm selection of Alltoall communication on environments with unstable network speed

Nanri

Kurokawa²

2011

2011 International Conference on High Performance Computing &Amp; Simulation

View full text Add to dashboard Cite

As the HPC systems increase their size, performance of collective communications is becoming an important issue. Usually, decisions for which algorithm of those communications to be used are done based on statically specified thresholds of the size of messages and the number of processes. However, on recent HPC systems that are hiring Fat Tree or Torus topology as their interconnect, the network speed has become unpredictable. The main reason is the effect of contentions. This effect depends heavily on the relative locations of the compute nodes. On the other hand, to reduce the number of idle nodes, there are attempts for building job schedulers to attach compute nodes flexibly, without considering their relative positions among each other. With this policy, the network performance becomes unstable. As an approach for finding an appropriate algorithm even on such environment, a dynamic method, STAR-MPI, has been proposed. This method examines each algorithm at runtime, and uses the empirical data to choose the suitable one for the given situation.This paper first examined the effect of STAR-MPI on an environment with unstable network speed. The results of experiments on this environment showed that the dynamic approach was effective, but the cost for testing slow algorithms limited the effect. Then, the authors proposed an enhancement, in which algorithms that have been predicted relatively slow were discarded from the list of candidates. The predictions were done by using the performance models of the algorithms with the latency and the bandwidth measured at the first call of the collective communication. At this point, the effect of this enhancement shown in experimental results was not significant. However, the results showed that there was a possibility for achieving better performance by using more cost-effective way of prediction and tuning thresholds and factors used in the enhancement.

show abstract

Section: Related Workmentioning

confidence: 99%

Effect of dynamic algorithm selection of Alltoall communication on environments with unstable network speed

Nanri

Kurokawa²

2011

2011 International Conference on High Performance Computing &Amp; Simulation

View full text Add to dashboard Cite

show abstract

“…Figure 7 measures the average skew per frame in milliseconds. The total skew is calculated by taking the square root of the sum of the square of the differences between retrieval times of items from the two consumers 3 . Clearly channel groups lead to significantly lower skew and, although these results are somewhat obvious, the measurements simply provide a compact characterization of the performance of channel groups and show channel groups do provide viable synchronization in a realistic application scenario.…”

Section: Preliminary Performancementioning

confidence: 99%

Stampede RT: Programming Abstractions for Live Streaming Applications

Hilley

Ramachandran

2007

27th International Conference on Distributed Computing Systems (ICDCS '07)

View full text Add to dashboard Cite

We present Stampede RT , middleware designed to provide a natural programming model appropriate for live streaming applications. Such applications require pervasive access to multiple streaming data sources for distributed online analysis. One motivating example is a distributed robotics application which analyzes live camera feeds for control and planning. Most existing middlewares for streaming data focus on media streams and low-level transport characteristics such as delivery latency and efficient transfer, but do not define a programming model to succinctly express applications that manipulate and analyze the streaming content. Stampede RT provides for straightforward transport and manipulation of temporally-ordered data streams, enabling simple synchronization and correlation of data sources. We present an abstract programming model to support the aforementioned class of applications and then describe a concrete realization of the model as a distributed middleware architecture. We also evaluate our implementation of the architecture and present several motivating applications Stampede RT is designed to support.

show abstract

“…Indirect algorithms for collective communication have also been addressed in [1]. Here, interprocessor data communication is performed in a "combine and forward" manner.…”

Section: Cyclic(x)mentioning

confidence: 99%

“…The communication overheads can be represented using an analytical model of typical distributed memory machines, the General purpose Distributed Memory (GDM) model [24]. Similar models are reported in the literature [1], [3], [4]. The GDM model represents the communication time of a message passing operation using two parameters: the start-up time T d and the unit data transmission time τ d .…”

Section: The Cost Of Redistributionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Algorithms for Block-Cyclic Redistribution of Arrays

1999

View full text Add to dashboard Cite

The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another.Block-cyclic redistribution consists of index set computation, wherein the destination locations for individual data blocks are calculated, and data communication, wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct, indirect, and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms.Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O(log K ) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K ) communication steps for redistribution.Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations.

show abstract

CCL: a portable and tunable collective communication library for scalable parallel computers

Cited by 37 publications

References 22 publications

Effect of dynamic algorithm selection of Alltoall communication on environments with unstable network speed

Effect of dynamic algorithm selection of Alltoall communication on environments with unstable network speed

Stampede RT: Programming Abstractions for Live Streaming Applications

Efficient Algorithms for Block-Cyclic Redistribution of Arrays

Contact Info

Product

Resources

About