Abstract-Suitable mapping of processes to the nodes of a massively parallel machine can substantially improve communication performance by reducing network congestion. The hop-byte metric has been used as a measure of the quality of such a mapping by several recent works. Optimizing this metric is NP hard, and thus heuristics are applied. However, the heuristics proposed so far do not directly try to optimize this metric. Rather, they use some intuitive methods for reducing congestion and use the metric just to evaluate the quality of the mapping. In fact, heuristics intending to optimize other metrics too don't directly optimize for them, but, rather, use the metric to evaluate the results of the heuristic. In contrast, we pose the mapping problem with the hop-byte metric as a quadratic assignment problem and use a heuristic to directly optimize for this metric. We evaluate our approach on realistic node allocations obtained on the Kraken system at NICS. Our approach yields values for the metric that are up to 75% lower than the default mapping and 66% lower than existing heuristics. However, the time taken to produce the mapping can be substantially more, which makes this suitable for somewhat static, though possibly irregular, communication patterns. We introduce new heuristics that reduce the time taken to be comparable to that of existing fast heuristics, while still producing mappings of higher quality than existing ones. We also use theoretical lower bounds to suggest that our mapping may be close to optimal, at least for medium sized problems. Consequently, our work can also provide insight into the tradeoff between mapping quality and time taken by other mapping heuristics.
The Cell is a heterogeneous multicore processor that has attracted much attention in the HPC community.
Diffusion Monte Carlo is a highly accurate Quantum Monte Carlo method for the electronic structure of materials, but it requires frequent load balancing or population redistribution steps to maintain efficiency on parallel machines. This step can be a significant factor affecting performance, and will become more important as the number of processing elements increases. We propose a new dynamic load balancing algorithm, the Alias Method, and evaluate it theoretically and empirically. An important feature of the new algorithm is that the load can be perfectly balanced with each process receiving at most one message. It is also optimal in the maximum size of messages received by any process. We also optimize its implementation to reduce network contention, a process facilitated by the low messaging requirement of the algorithm: a simple renumbering of the MPI ranks based on proximity and a space filling curve significantly improves the MPI Allgather performance. Empirical results on the petaflop Cray XT Jaguar supercomputer at ORNL showing up to 30% improvement in performance on 120,000 cores. The load balancing algorithm may be straightforwardly implemented in existing codes. The algorithm may also be employed by any method with many near identical computational tasks that requires load balancing.
The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques -such as double buffering -used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.