Shared memory architectures have recently experienced a large increase in thread-level parallelism, leading to complex memory hierarchies with multiple cache memory levels and memory controllers. These new designs created a Non-Uniform Memory Access (NUMA) behavior, where the performance and energy consumption of memory accesses depend on the place where the data is located in the memory hierarchy. Accesses to local caches or memory controllers are generally more efficient than accesses to remote ones. A common way to improve the locality and balance of memory accesses is to determine the mapping of threads to cores and data to memory controllers based on the affinity between threads and data. Such mapping techniques can operate at different hardware and software levels, which impacts their complexity, applicability, and the resulting performance and energy consumption gains. In this article, we introduce a taxonomy to classify different mapping mechanisms and provide a comprehensive overview of existing solutions.
High-Performance Computing (HPC) in the cloud has reached the mainstream and is currently a hot topic in the research community and the industry. The attractiveness of cloud for HPC is the capability to run large applications on powerful, scalable hardware without needing to actually own or maintain this hardware. In this paper, we conduct a detailed comparison of HPC applications running on three cloud providers, Amazon EC2, Microsoft Azure and Rackspace. We analyze three important characteristics of HPC, deployment facilities, performance and cost efficiency and compare them to a cluster of machines.For the experiments, we used the well-known NAS parallel benchmarks as an example of general scientific HPC applications to examine the computational and communication performance. Our results show that HPC applications can run efficiently on the cloud. However, care must be taken when choosing the provider, as the differences between them are large. The best cloud provider depends on the type and behavior of the application, as well as the intended usage scenario. Furthermore, our results show that HPC in the cloud can have a higher performance and cost efficiency than a traditional cluster, up to 27% and 41%, respectively. C OpenMP, MPI LU Lower and Upper Triangular Regular communication Fortran OpenMP, MPI MG Multigrid Regular communication Fortran OpenMP, MPI SP Scalar Pentadiagonal Floating point performance Fortran OpenMP, MPI UA Unstructured Adaptive Irregular communication Fortran OpenMP B. Machines
Abstract. Process placement is a technique widely used on parallel machines with heterogeneous interconnections to reduce the overall communication time. For instance, two processes which communicate frequently are mapped close to each other. Finding the optimal mapping between threads and cores in a shared-memory environment (for example, OpenMP and Pthreads) is an even more complex task due to implicit communication. In this work, we examine data sharing patterns between threads in different workloads and use those patterns in a similar way as messages are used to map processes in cluster computers. We evaluated our technique on two state-of-the-art multi-core processors and achieved moderate improvements in the common case and considerable improvements in some cases, reducing execution time by up to 45%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.