Scaling all-to-all multicast on fat-tree networks

Kumar, Sameer; Kalé, Laxmikant V.

doi:10.1109/icpads.2004.1316097

Cited by 22 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of the previous work [2,12,27,25,26,18,13,16] addresses congestion in the core (switches) of HPC networks. As our experimental evaluation shows, the advent of multicore processors introduces congestion at the edge of these networks and mechanisms to handle Concurrency Congestion are required for best performance on contemporary hardware.…”

Section: Discussionmentioning

confidence: 99%

“…Yang and Wang [25,26] discussed algorithms for near optimal all-to-all broadcast on meshes and tori. Kumar and Kale [18] discussed algorithms to optimize all-to-all multicast on fat-tree networks. Dvorak et al [13] described techniques for topology aware scheduling of many-to-many collective operations.…”

Section: Related Workmentioning

confidence: 99%

“…With more cores per node, the likelihood of any node sending messages to multiple nodes at any time is higher, thus making "all-to-all" patterns the norm. These patterns are dynamic, while the whole body of work in algorithmic scheduling [25,26,18,13,16,22] addresses only static patterns. Second, the low level congestion control mechanisms [2] already require non-trivial extensions to handle multiple concurrent flows and to deal with runtime software artifacts such as multiplexing processes, pthreads on multiple endpoints or "interfaces".…”

Section: Node Level Proactive Congestion Avoidancementioning

confidence: 99%

See 2 more Smart Citations

Congestion avoidance on manycore high performance computing systems

Luo

Panda

Ibrahim

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, e.g. congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve allto-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Node Level Proactive Congestion Avoidancementioning

confidence: 99%

See 1 more Smart Citation

Congestion avoidance on manycore high performance computing systems

Luo

Panda

Ibrahim

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Implementations can be designed to exploit the host machine's native network architecture, but a poor MPI implementation can be a source of serious performance problems in large-scale applications. For example, even on a high-bandwidth InfiniBand network, an implementation of collective operations such as multicast must avoid congestion to achieve good performance (Kumar and Kale, 2004).…”

mentioning

confidence: 99%

Scalable Performance Measurement and Analysis

Gamblin

2009

View full text Add to dashboard Cite

TODD GAMBLIN: Scalable Performance Measurement and Analysis.(Under the direction of Daniel A. Reed.)Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Modern machines may contain 100,000 or more microprocessor cores, and the largest of these, IBM's Blue Gene/L, contains over 200,000 cores. Future systems are expected to support millions of concurrent tasks. In this dissertation, we focus on efficient techniques for measuring and analyzing the performance of applications running on very large parallel machines.Tuning the performance of large-scale applications can be a subtle and time-consuming task because application developers must measure and interpret data from many independent processes. While the volume of the raw data scales linearly with the number of tasks in the running system, the number of tasks is growing exponentially, and data for even small systems quickly becomes unmanageable. Transporting performance data from so many processes over a network can perturb application performance and make measurements inaccurate, and storing such data would require a prohibitive amount of space. Moreover, even if it were stored, analyzing the data would be extremely time-consuming.In this dissertation, we present novel methods for reducing performance data volume. The first draws on multi-scale wavelet techniques from signal processing to compress systemwide, time-varying load-balance data. The second uses statistical sampling to select a small subset of running processes to generate low-volume traces. A third approach combines sampling and wavelet compression to stratify performance data adaptively at run-time and to reduce further the cost of sampled tracing. We have integrated these approaches into Libra, a toolset for scalable load-balance analysis. We present Libra and show how it can be used to analyze data from large scientific applications scalably.iii Without the values that they, along with my grandparents, instilled in me, I would not be the person I am today.Thanks to Dan Reed, my advisor, for sticking with me to the end, even at times when I was unsure whether I would finish. Despite his busy schedule, he was available for advice when I needed it. Even if our typical meetings were short, the advice Dan provided was always excellent, and his well-timed words of encouragement kept me going even when I was on the brink of ditching this whole Ph.D. gig.Thanks to Rob Fowler for his constant advice while I was at RENCI. His extensive input on my papers and on this dissertation has been invaluable. Thanks also to Niki Fowler for her assistance in proofreading my final draft, and to Allan Porterfield for the many useful technical discussions we had at RENCI. I am grateful to Bronis de Supinski and Martin Schulz at Lawrence Livermore NationalLaboratory for their research insights, constant availability, and for giving me the opportunity to continue working with them after graduation as a postdoctoral scholar. I learn something new every day I work at the la...

show abstract

“…Many all-to-all broadcast algorithms were designed for specific network topologies that are used in parallel machines, including hypercube [8,20], mesh [15,18,21], torus [21], k-ary n-cube [20], fat tree [10], and star [14]. Work in [9] optimizes MPI collective communications, including M P I Allgather, on wide area networks.…”

Section: Related Workmentioning

confidence: 99%

Bandwidth Efficient All-to-All Broadcast on Switched Clusters

Faraj¹,

Patarasuk²,

Zhong³

2005

2005 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

We develop an all-to-all broadcast scheme that achieves maximum bandwidth efficiency for clusters with tree topologies. Using our scheme for clusters with cut-through switches, any tree topology can support allto-all broadcast as efficiently as a single switch connecting all machines when the message size is sufficiently large. Since a tree topology can be embedded in almost any connected network, it follows that efficient all-toall broadcast can be achieved in almost all topologies, regular or irregular. To perform all-to-all broadcast efficiently on clusters with store-and-forward switches, the algorithm must minimize the communication path lengths in addition to maximizing bandwidth efficiency. This turns out to be a harder algorithmic problem. We develop schemes that give solutions to common cases for such systems. The performance of our algorithms is evaluated on Ethernet switched clusters with different topologies. The results confirm our theoretical finding. Furthermore, depending on the topology, our algorithms sometimes out-perform the topology-unaware algorithms used in MPI libraries, including MPICH and LAM/MPI, to a very large degree.

show abstract

Scaling all-to-all multicast on fat-tree networks

Abstract: In this paper, we study the all-to-all

Cited by 22 publications

References 20 publications

Congestion avoidance on manycore high performance computing systems

Congestion avoidance on manycore high performance computing systems

Scalable Performance Measurement and Analysis

Bandwidth Efficient All-to-All Broadcast on Switched Clusters

Contact Info

Product

Resources

About