Abstract-The performance of collective communication operations is known to have a significant impact on the scalability of some applications. Indeed, the global, synchronous nature of some collective operations directly implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. This fact has led many researchers to try to improve the efficiency of collective operations. One popular approach improves the implementation of MPI collective operations by using intelligent or programmable network interfaces to offload the burden of communication activities from the host processor(s). Such implementations have shown significant improvement for microbenchmarks that isolate collective communication performance, but these results have not been shown to translate to significant increases in performance for real applications. In order for collective offload implementations to benefit real applications, a greater understanding of application behavior is needed. In this paper, we describe several characteristics of applications and application benchmarks that impact collective communication performance. We analyze network resource usage data in order to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI.
Understanding the message passing behavior and network resource usage of distributed-memory messagepassing parallel applications is critical to achieving high performance and scalability. While much research has focused on how applications use critical compute related resources, relatively little attention has been devoted to characterizing the usage of network resources, specifically those needed by the network interface. This paper discusses the importance of understanding network interface resource usage requirements for parallel applications and describes an initial attempt to gather network resource usage data for several real-world codes. The results show widely varying usage patterns between processes in the same parallel job and indicate that resource requirements can change dramatically as process counts increase and input data changes. This suggests that general network resource management strategies may not be widely applicable, and that adaptive strategies or more fine-grained controls may be necessary for environments where network interface resources are severely constrained.
We are developing parallel programming models that are complementary to related projects and respond to unaddressed needs in the parallel computing community. These needs include incremental or partial migration of applications and their expert programmers from MPI, and efficient support for high-volume, random, fine-grained parallelism.A programming model provides an abstraction for expression of parallelism in applications. This abstraction must be at an appropriate level such that inherent parallelism can be mapped to capabilities of the underlying hardware. MPI is the de facto standard for high performance computing, mainly because its abstraction perfectly matches distributed memory architectures. However, it is difficult to directly express certain types of parallelism, such as parallel graph algorithms. Meanwhile, PetaFLOP-scale hardware is approaching. Vendors are developing multi-core processors. MPI may not be a suitable programming model for these new architectectures.GAS models are more expressive than MPI. These models can be realized as libraries (such as SHMEM, MPI-2, Portals) that are callable from conventional languages, or as language extensions (such as UPC, Co-Array Fortran). Existing GAS models typically support one of two levels of abstraction: one-sided communication, which allows a processor to access another processor's memory without the remote processor's cooperation, or distributed shared memory, which provides a logically global view of the data. Accesses to shared data in other processors require communication, which is more expensive than access to local data. Too much fine-grained communication can cause significant performance penalty due to communication latency in each separate transaction. For good performance, users need to manage data locality carefully to minimize fine-grained communication. Without system level support, the task of data locality management can diminish the convenience intended by this programming model. Offering ad hoc support for random communication patterns is not enough; it leads to a large, ever-increasing number of such utilities, and again undermines programming ease intended by the model. ThereCopyright is held by the author/owner(s). SPAA'06, July 30-August 2, 2006, Cambridge, Massachusetts, USA. ACM 1-59593-452-9/06/0007. fore, a higher level of abstraction is desirable.Careful evaluation of the issues listed above and our indepth study of Sandia 1 applications suggest that the next appropriate level of abstraction should support high-volume, random, fine-grained parallel data access. Our work has three parts: BEC, a bootstrap approach to add GAS capabilities to MPI; PRAM C, a C language extension to support parallel random access and maximal expression of parallelism in virtual processors; translation, a new scheme that statically compiles fine-grained parallelism into coarsegrained parallelism.Specifically, BEC (Bundle-Exchange-Compute) is an abstraction formalized from well-practiced MPI programming techniques. In dealing with high-volume, fine...
Sandla is a multi rogram laboratory operated by2403 San Mateo NE, Albuquerque, New Mexico 871 10 AbstractThe items discussed in this report reflect the work in progress during FY98.As a way to bootstrap the DISCOM' Distance Computing Program the SP2 Pilot Project was launched in March 1998. The Pilot was directed towards creating an environment to allow Sandia users to run their applications on the Accelerated Strategic Computing Initiative's (ASCI) Blue Pacific computation platform, the unclassified IBM SP2 platform at Lawrence Livermore National Laboratory (LLNL). The DISCOM' Pilot leverages the ASCI PSE (Problem Solving Environment) efforts in networking and services to baseline the performance of the current system. Efforts in the following areas of the pilot are documented applications, services, networking, visualization, and the system model. It details not only the running of two Sandia codes CTH and COYOTE on the Blue Pacific platform, but also the building of the Sandia National Laboratories (SNL) proxy environment of the RS6000 platforms to support the Sandia users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.