Global combine on mesh architectures with wormhole routing

Barnett, Mike; Littlefield, Richard J.; Payne, D. G.; Geijn, Robert van de

doi:10.1109/ipps.1993.262873

Cited by 62 publications

(36 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results also indicate that the performance of an implementation is influenced by the relationship among parameters of the parallel machine, as well as by the relationship of the parameters to the amount of data involved. This agrees with other research done on the implementation of communication operations [1,2,4,19].…”

Section: Validation Through Communication Operationssupporting

confidence: 92%

C3: A Parallel Model for Coarse-Grained Machines

Hambrusch

Khokhar

1996

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Section: Validation Through Communication Operationssupporting

confidence: 92%

C3: A Parallel Model for Coarse-Grained Machines

Hambrusch

Khokhar

1996

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…Ideas from our previous work on performing the global combine can be used to obtain an alternative tradeoff between the startup cost and the transfer cost [4]. We first present a simple algorithm for one-dimensional meshes, and then extend it to the two-dimensional case.…”

Section: Alternative Algorithm: Scatter-collectmentioning

confidence: 99%

Broadcasting on Meshes with Wormhole Routing

Barnett

Payne

Geijn

et al. 1996

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…Reduction collectives entail both communication (data transfer) and processing (data reduction operations), and therefore efficient implementations must consider the characteristics of the network, the processor, and the interactions between them. Over the years, many researchers have dedicated significant effort to derive optimal and scalable algorithms [1,2,3,4,5,8]. However, with respect to the underlying system characteristics, all of this work commonly assumed reduction processing must be performed by the host CPU.…”

Section: Introductionmentioning

confidence: 99%

NIC-based reduction algorithms for large-scale clusters

Petrini¹,

Moody²,

Fernández³

et al. 2006

IJHPCN

View full text Add to dashboard Cite

Efficient algorithms for reduction operations across a group of processes are crucial for good performance in many large-scale, parallel scientific applications. While previous algorithms limit processing to the host CPU, we utilize the programmable processors and local memory available on modern cluster network interface cards (NICs) to explore a new dimension in the design of reduction algorithms. In this paper, we present the benefits and challenges, design issues and solutions, analytical models, and experimental evaluations of a family of NIC-based reduction algorithms. Performance and scalability evaluations were conducted on the ASCI Linux Cluster (ALC), a 960-node, 1920-processor machine at Lawrence Livermore National Laboratory, which uses the Quadrics QsNet interconnect. We find NIC-based reductions on modern interconnects to be more efficient than host-based implementations in both scalability and consistency. In particular, at large-scale-1812 processes-NIC-based reductions of small integer and floating-point arrays provided respective speedups of 121% and 39% over the host-based, production-level MPI implementation.

show abstract

Global combine on mesh architectures with wormhole routing

Cited by 62 publications

References 2 publications

C3: A Parallel Model for Coarse-Grained Machines

C3: A Parallel Model for Coarse-Grained Machines

Broadcasting on Meshes with Wormhole Routing

NIC-based reduction algorithms for large-scale clusters

Contact Info

Product

Resources

About