Practical parallel algorithms for personalized communication and integer sorting

Bader, David A.; Helman, David R.; JáJá, Joseph F.

doi:10.1145/235141.235148

Cited by 27 publications

(22 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This improves aggregate performance to 23,910 MByte/s (46.7 MByte/s per processor), or 62% of the nominal bisection bandwidth. 2 The nominal bisection bandwidth is based on a link speed of 75 MByte/s. Experimentally, 78.9 MByte/s can be achieved for unidirectional traffic and 73.2 MByte/s for simultaneous bidirectional traffic between two nodes, each consisting of two processors.…”

Section: Adapting To the Routing Tablementioning

confidence: 99%

“…Most practical methods use a priori knowledge about the communication pattern, and attempt to minimize congestion and contention in the network. For further details, we refer the reader to a good description of the history of AAPC and a survey of algorithms [2].…”

Section: Introductionmentioning

confidence: 99%

“…For many modern parallel machines, the fastest sorting algorithms are based on counting algorithms (e.g., radix sorts). Again, we refer to previous surveys of sorting algorithms and implementations [3,10,2,5].…”

Section: Introductionmentioning

confidence: 99%

“…A previously reported portable implementation of radix sort written in Split-C achieves a sorting performance of 4 million 32-bit keys in just under 6 seconds on an 8-processor T3D [2], for a sorting performance of approximately 330 kBytes/s per processor. This is equivalent to two 16-bit counting sort passes, each with a memory performance of approximately 660 kBytes/s per processor.…”

mentioning

confidence: 99%

See 3 more Smart Citations

From AAPC algorithms to high performance permutation routing and sorting

Stricker

Hardwick

1996

Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures - SPAA '96

View full text Add to dashboard Cite

Several recent papers have proposed or analyzed optimal algorithms to route all-to-all personalized communication (AAPC) over communication networks such as meshes, hypercubes and omega switches. However, the constant factors of these algorithms are often an obscure function of system parameters such as link speed, processor clock rate, and memory access time. In this paper we investigate these architectural factors, showing the impact of the communication style, the network routing table, and most importantly, the local memory system, on AAPC performance and permutation routing on the Cray T3D.The fast hardware barriers on the T3D permit a straightforward AAPC implementation using routing phases separated by barriers, which improve performance by controlling congestion. However, we found that a practical implementation was difficult, and the resulting AAPC performance was less than expected. After detailed analysis, several corrections were made to the AAPC algorithm and to the machine's routing table, raising the performance from 41% to 74% of the nominal bisection bandwidth of the network.Most AAPC performance measurements are for permuting large, contiguous blocks of data (i.e., every processor has an array of P contiguous elements to be sent to every other processor). In practice, sorting and true h,h permutation routing 1 require data elements to be gathered from their source location into a buffer, transferred over the network, and scattered into their final location in a destination array. We obtain an optimal T3D implementation by chaining local and remote memory operations together. We quantify the implementation's efficiency both experimentally and theoretically, using the recently-introduced copy transfer model, and present results for a counting sort based on this AAPC implementation.

show abstract

Section: Adapting To the Routing Tablementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

From AAPC algorithms to high performance permutation routing and sorting

Stricker

Hardwick

1996

Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures - SPAA '96

View full text Add to dashboard Cite

show abstract

“…Proof. The proof of the maximum message sizes is given in [4]. In the following, we give a proof for the minimum message sizes.…”

Section: Algorithm 1 Balancedrouting (From [4])mentioning

confidence: 99%

Reducing I/O complexity by simulating coarse grained parallel algorithms

Dehne

Hutchinson²,

Maheshwari

et al.

Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SP

View full text Add to dashboard Cite

show abstract

A Randomized Parallel Sorting Algorithm with an Experimental Study

Helman

Bader

JáJá

1998

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Previous schemes for sorting on general-purpose parallel machines have had to choose between poor load balancing and irregular communication or multiple rounds of all-to-all personalized communication. In this paper, we introduce a novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, it seems to outperform all similar algorithms known to the authors on these platforms, and its performance is invariant over the set of input distributions unlike previous efficient algorithms. Our results also compare favorably with those reported for the simpler ranking problem posed by the NAS Integer Sorting (IS) Benchmark.1998 Academic Press, Inc.

show abstract

Practical parallel algorithms for personalized communication and integer sorting

Cited by 27 publications

References 36 publications

From AAPC algorithms to high performance permutation routing and sorting

From AAPC algorithms to high performance permutation routing and sorting

Reducing I/O complexity by simulating coarse grained parallel algorithms

A Randomized Parallel Sorting Algorithm with an Experimental Study

Contact Info

Product

Resources

About