Decoupling synchronization and data transfer in message passing systems of parallel computers

Stricker, Thomas M.; Stichnoth, James M.; O’Hallaron, David R.; Hinrichs, Susan; Groß, Thomas

doi:10.1145/224538.224539

Cited by 26 publications

(16 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It does not incur any copying/buffering during a data transfer, since low communication overhead is critical for sparse code with mixed granularities. RMA is available in modern multiprocessor architectures such as Cray-T3D [34], T3E [32], and Meiko CS-2 [15]. Since the RMA directly writes data to a remote address, it is possible that the content at the remote address is still being used by other tasks and, then, the execution at the remote processor could be incorrect.…”

Section: Scheduling and Run-time Support For 1d Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Sparse LU factorization with partial pivoting on distributed memory machines

Yang

1996

Proceedings of the 1996 ACM/IEEE Conference on Supercomputing

View full text Add to dashboard Cite

Abstract-A sparse LU factorization based on Gaussian elimination with partial pivoting (GEPP) is important to many scientific applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main difficulty is that partial pivoting operations dynamically change computation and nonzero fill-in structures during the elimination process. This paper presents an approach called S* for parallelizing this problem on distributed memory machines. The S* approach adopts static symbolic factorization to avoid run-time control overhead, incorporates 2D L/U supernode partitioning and amalgamation strategies to improve caching performance, and exploits irregular task parallelism embedded in sparse LU using asynchronous computation scheduling. The paper discusses and compares the algorithms using 1D and 2D data mapping schemes, and presents experimental studies on Cray-T3D and T3E. The performance results for a set of nonsymmetric benchmark matrices are very encouraging, and S* has achieved up to 6.878 GFLOPS on 128 T3E nodes. To the best of our knowledge, this is the highest performance ever achieved for this challenging problem and the previous record was 2.583 GFLOPS on shared memory machines [8].

show abstract

Section: Scheduling and Run-time Support For 1d Methodsmentioning

confidence: 99%

“…The communication network of the T3D is a 3D torus. Cray provides a shared memory access library called shmem, which can achieve 126 Mbytes/s bandwidth and 2.7ms communication overhead using shmem_put() primitive [34]. We have used shmem_put() for the communications in all the implementations.…”

Section: Experimental Studiesmentioning

confidence: 99%

Sparse LU factorization with partial pivoting on distributed memory machines

Yang

1996

Proceedings of the 1996 ACM/IEEE Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Based on these first series of experiments alone, it can not be concluded if overlap of the computation and the communication is beneficial or detrimental to performance and scalability of CHARMM on a particular platform. Decoupling computation, synchronization and data transfer resulted in better performance for certain compiled parallel programs on the Cray T3D and other machines [21].…”

Section: Previous Workmentioning

confidence: 99%

Performance characterization of a molecular dynamics code on PC clusters: is there any easy parallelism in CHARMM?

Taufer

Perathoner

Cavalli

et al. 2002

Proceedings 16th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

show abstract

“…While the data elements are stored in a distributed array, the permutation itself is specified by a table of index pairs, where each table entry contains a source index and a destination index. Using the direct deposit model [7], synchronization and consistency are guaranteed by the use of hardware barriers, and the data transfers are performed by remote stores using the messaging system. For distributed memory systems the index relation table must be maintained in a certain order, in order to group all transfers for a given source-destination pair.…”

Section: Impact Of Memory System Performancementioning

confidence: 99%

From AAPC algorithms to high performance permutation routing and sorting

Stricker

Hardwick

1996

Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures - SPAA '96

View full text Add to dashboard Cite

Several recent papers have proposed or analyzed optimal algorithms to route all-to-all personalized communication (AAPC) over communication networks such as meshes, hypercubes and omega switches. However, the constant factors of these algorithms are often an obscure function of system parameters such as link speed, processor clock rate, and memory access time. In this paper we investigate these architectural factors, showing the impact of the communication style, the network routing table, and most importantly, the local memory system, on AAPC performance and permutation routing on the Cray T3D.The fast hardware barriers on the T3D permit a straightforward AAPC implementation using routing phases separated by barriers, which improve performance by controlling congestion. However, we found that a practical implementation was difficult, and the resulting AAPC performance was less than expected. After detailed analysis, several corrections were made to the AAPC algorithm and to the machine's routing table, raising the performance from 41% to 74% of the nominal bisection bandwidth of the network.Most AAPC performance measurements are for permuting large, contiguous blocks of data (i.e., every processor has an array of P contiguous elements to be sent to every other processor). In practice, sorting and true h,h permutation routing 1 require data elements to be gathered from their source location into a buffer, transferred over the network, and scattered into their final location in a destination array. We obtain an optimal T3D implementation by chaining local and remote memory operations together. We quantify the implementation's efficiency both experimentally and theoretically, using the recently-introduced copy transfer model, and present results for a counting sort based on this AAPC implementation.

show abstract

Decoupling synchronization and data transfer in message passing systems of parallel computers

Cited by 26 publications

References 13 publications

Sparse LU factorization with partial pivoting on distributed memory machines

Sparse LU factorization with partial pivoting on distributed memory machines

Performance characterization of a molecular dynamics code on PC clusters: is there any easy parallelism in CHARMM?

From AAPC algorithms to high performance permutation routing and sorting

Contact Info

Product

Resources

About