Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather

Kandalla, Krishna; Subramoni, Hari; Vishnu, Abhinav; Panda, Dhabaleswar K.

doi:10.1109/ipdpsw.2010.5470853

Cited by 60 publications

(42 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of the previous work [2,12,27,25,26,18,13,16] addresses congestion in the core (switches) of HPC networks. As our experimental evaluation shows, the advent of multicore processors introduces congestion at the edge of these networks and mechanisms to handle Concurrency Congestion are required for best performance on contemporary hardware.…”

Section: Discussionmentioning

confidence: 99%

“…Dvorak et al [13] described techniques for topology aware scheduling of many-to-many collective operations. Kandalla et al [16] discussed topology aware scatter and gather for large scale InfiniBand clusters. Thakur et al [22] discussed the scalability of MPI collectives and described implementations that use multiple algorithms in order to alleviate congestion in data intensive operations such as all-to-all.…”

Section: Related Workmentioning

confidence: 99%

“…With more cores per node, the likelihood of any node sending messages to multiple nodes at any time is higher, thus making "all-to-all" patterns the norm. These patterns are dynamic, while the whole body of work in algorithmic scheduling [25,26,18,13,16,22] addresses only static patterns. Second, the low level congestion control mechanisms [2] already require non-trivial extensions to handle multiple concurrent flows and to deal with runtime software artifacts such as multiplexing processes, pthreads on multiple endpoints or "interfaces".…”

Section: Node Level Proactive Congestion Avoidancementioning

confidence: 99%

See 2 more Smart Citations

Congestion avoidance on manycore high performance computing systems

Luo

Panda

Ibrahim

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, e.g. congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve allto-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Node Level Proactive Congestion Avoidancementioning

confidence: 99%

See 1 more Smart Citation

Congestion avoidance on manycore high performance computing systems

Luo

Panda

Ibrahim

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Other related studies focused on optimizing the performance of MPI collective communication by proposing topology aware mechanisms (Gong et al, 2013;Subramoni et al, 2011;Kandalla et al, 2010) and process arrival patterns aware mechanisms (Qian and Afsahi, 2009;Patarasuk and Yuan, 2008) to achieve the best performance in terms of time.…”

Section: Related Workmentioning

confidence: 99%

Performance Analysis of Message Passing Interface Collective Communication on Intel Xeon Quad-Core Gigabit Ethernet and Infiniband Clusters

Ismail¹,

Hamid²,

Othman³

et al. 2013

Journal of Computer Science

View full text Add to dashboard Cite

The performance of MPI implementation operations still presents critical issues for high performance computing systems, particularly for more advanced processor technology. Consequently, this study concentrates on benchmarking MPI implementation on multi-core architecture by measuring the performance of Open MPI collective communication on Intel Xeon dual quad-core Gigabit Ethernet and InfiniBand clusters using SKaMPI. It focuses on well known collective communication routines such as MPI-Bcast, MPI-AlltoAll, MPI-Scatter and MPI-Gather. From the collection of results, MPI collective communication on InfiniBand clusters had distinctly better performance in terms of latency and throughput. The analysis indicates that the algorithm used for collective communication performed very well for all message sizes except for MPI-Bcast and MPI-Alltoall operation of inter-node communication. However, InfiniBand provides the lowest latency for all operations since it provides applications with an easy to use messaging service, compared to Gigabit Ethernet, which still requests the operating system for access to one of the server communication resources with the complex dance between an application and a network.

show abstract

“…Recent works [3], [4], [5], [6], [8], [9], [10], [11] have shown substantial communication performance improvement on large parallel machines by suitable assignment of processes or tasks to nodes of the machine. Earlier works on graph embedding are usually not suitable for modern machines because the earlier works used metrics suitable for a store-and-forward communication mechanism.…”

Section: Introductionmentioning

confidence: 99%

Optimization of the hop-byte metric for effective topology aware mapping

Sudheer

Srinivasan

2012

2012 19th International Conference on High Performance Computing

View full text Add to dashboard Cite

Abstract-Suitable mapping of processes to the nodes of a massively parallel machine can substantially improve communication performance by reducing network congestion. The hop-byte metric has been used as a measure of the quality of such a mapping by several recent works. Optimizing this metric is NP hard, and thus heuristics are applied. However, the heuristics proposed so far do not directly try to optimize this metric. Rather, they use some intuitive methods for reducing congestion and use the metric just to evaluate the quality of the mapping. In fact, heuristics intending to optimize other metrics too don't directly optimize for them, but, rather, use the metric to evaluate the results of the heuristic. In contrast, we pose the mapping problem with the hop-byte metric as a quadratic assignment problem and use a heuristic to directly optimize for this metric. We evaluate our approach on realistic node allocations obtained on the Kraken system at NICS. Our approach yields values for the metric that are up to 75% lower than the default mapping and 66% lower than existing heuristics. However, the time taken to produce the mapping can be substantially more, which makes this suitable for somewhat static, though possibly irregular, communication patterns. We introduce new heuristics that reduce the time taken to be comparable to that of existing fast heuristics, while still producing mappings of higher quality than existing ones. We also use theoretical lower bounds to suggest that our mapping may be close to optimal, at least for medium sized problems. Consequently, our work can also provide insight into the tradeoff between mapping quality and time taken by other mapping heuristics.

show abstract

Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather

Cited by 60 publications

References 12 publications

Congestion avoidance on manycore high performance computing systems

Congestion avoidance on manycore high performance computing systems

Performance Analysis of Message Passing Interface Collective Communication on Intel Xeon Quad-Core Gigabit Ethernet and Infiniband Clusters

Optimization of the hop-byte metric for effective topology aware mapping

Contact Info

Product

Resources

About