The Blue Gene/Q machine is the next generation in the line of IBM massively parallel supercomputers, designed to scale to 262144 nodes and sixteen million threads. With each BG/Q node having 68 hardware threads, hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, are ideal and will enable applications to achieve high throughput on BG/Q. With such unprecedented massive parallelism and scale, this paper is a groundbreaking effort to explore the design challenges for designing a communication library that can match and exploit such massive parallelism In particular, we present the Parallel Active Messaging Interface (PAMI) library as our BG/Q library solution to the many challenges that come with a machine at such scale. PAMI provides (1) novel techniques to partition the application communication overhead into many contexts that can be accelerated by communication threads; (2) client and context objects to support multiple and different programming paradigms; (3) lockless algorithms to speed up MPI message rate; and (4) novel techniques leveraging the new BG/Q architectural features such as the scalable atomic primitives implemented in the L2 cache, the highly parallel hardware messaging unit that supports both point-to-point and collective operations, and the collective hardware acceleration for operations such as broadcast, reduce, and allreduce. We experimented with PAMI on 2048 BG/Q nodes and the results show high messaging rates as well as low latencies and high throughputs for collective communication operations.
Different programming paradigms utilize a variety of collective communication operations, often with different semantics. We present the component collective messaging interface (CCMI) that can support asynchronous nonblocking collectives and is extensible to different programming paradigms and architectures. CCMI is designed with components written in the C++ programming language, allowing it to be reusable and extendible. Collective algorithms are embodied in topological schedules and executors that execute them. Portability across architectures is enabled by the multisend data movement component. CCMI includes a programming language adaptor used to implement different APIs with different semantics for different paradigms. We study the effectiveness of CCMI on 16K nodes of Blue Gene/P machine and evaluate its performance for the barrier, broadcast, and allreduce collective operations and several application benchmarks. We also present the performance of the barrier collective on the Abe Infiniband cluster.
SummaryRecent Cray-authored Lustre modifications known as Lustre Lockahead show significantly improved write performance for collective, shared-file I/O workloads. Initial tests show write performance improvements of more than 200% for small transfer sizes and over 100% for larger transfer sizes compared to traditional Lustre locking. Standard Lustre shared-file locking mechanisms limit scaling of shared-file I/O performance on modern high-performance Lustre servers.The new Lockahead feature provides a mechanism for applications (or libraries) with knowledge of their I/O patterns to overcome this limitation by explicitly requesting locks. MPI-IO is able to use this feature to dramatically improve shared-file collective I/O performance, achieving more than 80% of file per process performance. This paper discusses our early experience using Lockahead with applications. We also present application and synthetic performance results and discuss performance considerations for applications that benefit from Lockahead. INTRODUCTIONPOSIX I/O file access behavior is usually categorized as file-per-process or single-shared-file. Historically, file-per-process access has provided higher throughput than single-shared-file access due to required file system overhead ensuring consistency during shared write accesses. However, optimal I/O throughput is never the only, and rarely the primary, consideration when writing or using an application. Despite the performance downside, shared files are widely used for a variety of data management and ease of use reasons. To avoid long I/O time caused by shared-file performance, an application may reduce data output, checkpoint frequency, or number of jobs.Shared-file performance characterization is largely specific to each file system implementation and I/O library. This work focuses on modifications to the Lustre file system for applications using collective MPI-IO operations. A design and implementation in Lustre and Cray MPI libraries to improve shared-file write performance by introducing a new Lustre locking scheme is described in this paper. Given MPI-IO allows requesting different file system lock modes via environment variables, this work provides a path for applications to improve performance without application code modifications. Code changes within Lustre and MPICH/ROMIO to support Lustre Lockahead have been contributed back to the upstream communities. 1,2 This paper first describes application I/O behavior and standard libraries used for shared-file access. The current Lustre shared-file locking implementation is also described to motivate the need for an improved locking mechanism. Second, the implementation of Lustre Lockahead for collective MPI-IO operations is described for both the Lustre file system and the collective MPI-IO library. Next, comparative I/O performance of current file-per-process, independent and collective shared-file I/O is presented to evaluate the benefit of this new locking method for Lustre. Finally, we describe early experience using Lustre Lock...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright 漏 2025 scite LLC. All rights reserved.
Made with 馃挋 for researchers
Part of the Research Solutions Family.