Manojkumar Krishnan scite author profile

This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an inteface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the attractiveness of using higher level abstractions to write parallel code.

show abstract

A Component Architecture for High-Performance Scientific Computing

Allan

Armstrong²,

Bernholdt

et al. 2006

The International Journal of High Performance Computing Applica

179

View full text Add to dashboard Cite

The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.

show abstract

High Performance Remote Memory Access Communication: The Armci Approach

Nieplocha

Tipparaju²,

Krishnan

et al. 2006

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

This paper describes the Aggregate Remote Memory Copy Interface (ARMCI), a portable high performance remote memory access communication interface, developed oriinally under the U.S. Department of Energy (DOE) Advanced Computational Testing and Simulation Toolkit project and currently used and advanced as a part of the run-time layer of the DOE project, Programming Models for Scalble Parallel Computing. The paper discusses the model, addresses challenges of portable implementations, and demonstrates that ARMCI delivers high performance on a variety of platforms. Special emphasis is placed on the latency hiding mechanisms and ability to optimize noncotiguous data transfers.

show abstract

SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems

Krishnan

Nieplocha

View full text Add to dashboard Cite

This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon's algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over pdgemm from the ScaLAPACK/PBBLAS suite, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than pdgemm for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated.the reported experiences in comparison to the pure MPI implementations were not encouraging. The conceptual architectural model for which our algorithm was designed is a cluster of multiprocessor nodes connected with a network that supports remote memory access communication (put/get model) between the nodes. Remote memory access (RMA) is a simple communication model and, on modern systems, is often be the fastest communication protocol available, especially when implemented in hardware as zero-copy RMA write/read operations (e.g., Infiniband, Giganet, and Myrinet). RMA is often used to implement the point-topoint MPI send/receive calls [27, 28]. To address the historically growing gap between the processor and network speed, our implementation relies on the availability of the nonblocking mode of RMA operation as the primary latency hiding mechanism (through overlapping communication with computations) [29]. In addition, each cluster node is assumed to provide efficient load/store operations that allow direct access to the data. In other words, a node of the cluster represents a shared memory communication domain. Our algorithm is explicitly aware of the task mapping to shared memory domains i.e., it is written to use shared memory to access parts of the matrix held between processors on the same SMP node, and nonblocking RMA operations to access parts of the matrix outside of the local shared memory domain (i.e., RMA domain). Note that the shared memory domain might not necessarily match the underlying SMP node configuration used as a hardware building block in many systems. For example, the entire 128-processor SGI Altix system available to us was used as a single shared memory domain, even though underneath it is implemented based on a 2-processor SMP configuration with processors sharing the memory in the module ("brick") and accessing the remainder of system memory through an interconnect network ("NUMAlink").

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.