Using MPI file caching to improve parallel write performance for large-scale scientific applications

Liao, Wei-keng; Ching, Avery; Coloma, Kenin; Nisar, Arifa; Choudhary, Alok; Chen, Jacqueline; Sankaran, Ramanan; Klasky, Scott

doi:10.1145/1362622.1362634

Cited by 20 publications

(12 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When the first collective write operation is issued, it generates an I/O thread and the I/O thread performs the write-behind for the subsequent write operations. Client-side file caching regards each client's I/O requests are related and distributes the cache metadata and local cache pages across the processes [14], [15]. Unlike POSIX, because atomic MPI-IO operations should manage the overlapping region, such as ghost cells, Liao et al suggested process rank ordering and graph coloring [16].…”

Section: Collective I/o Improvementsmentioning

confidence: 99%

“…In other words, when a collective I/O operation for x · y bytes is issued, each I/O aggregator handles (x·y) 8 bytes. In the case of node 0, P 0 and P 1 handle I/O requests issued by P 0 to P 7 while P 2 and P 3 are in charge of I/O requests from P 8 to P 15 . During the data exchange phase, all I/O aggregators have to communicate with their client processes and each process uses intra-socket or inter-socket communications according to its locations.…”

Section: Collective I/o With Different Processor Affinitymentioning

confidence: 99%

See 1 more Smart Citation

An Efficient I/O Aggregator Assignment Scheme for Multi-Core Cluster Systems

Cha

2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYAs the number of nodes in high-performance computing (HPC) systems increases, parallel I/O becomes an important issue: collective I/O is the specialized parallel I/O that provides the function of singlefile based parallel I/O. Collective I/O in most message passing interface (MPI) libraries follows a two-phase I/O scheme in which the particular processes, namely I/O aggregators, perform important roles by engaging the communications and I/O operations. This approach, however, is based on a single-core architecture. Because modern HPC systems use multi-core computational nodes, the roles of I/O aggregators need to be re-evaluated. Although there have been many previous studies that have focused on the improvement of the performance of collective I/O, it is difficult to locate a study regarding the assignment scheme for I/O aggregators that considers multi-core architectures. In this research, it was discovered that the communication costs in collective I/O differed according to the placement of the I/O aggregators, where each node had multiple I/O aggregators. The performance with the two processor affinity rules was measured and the results demonstrated that the distributed affinity rule used to locate the I/O aggregators in different sockets was appropriate for collective I/O. Because there may be some applications that cannot use the distributed affinity rule, the collective I/O scheme was modified in order to guarantee the appropriate placement of the I/O aggregators for the accumulated affinity rule. The performance of the proposed scheme was examined using two Linux cluster systems, and the results demonstrated that the performance improvements were more clearly evident when the computational node of a given cluster system had a complicated architecture. Under the accumulated affinity rule, the performance improvements between the proposed scheme and the original MPI-IO were up to approximately 26.25% for the read operation and up to approximately 31.27% for the write operation. key words: collective I/O, parallel I/O, processor affinity IntroductionAs the size of a problem increases, many scientific applications generate a large number of file-I/O operations. Today's parallel programming paradigms provide some I/O methods for scientific applications, and previous studies [2]-[4] demonstrated the importance of single-file based parallel I/O, especially collective I/O.Collective I/O in message-passing interface (MPI) follows the two-phase I/O scheme that consists of an I/O phase and a data exchange phase [5]. In the two-phase I/O, the specialized process called I/O aggregator is engaged in the both phases. In other words, because the role of I/O aggregator is to collect or distribute I/O data to other clients, collective I/O performance can be affected by the ability of the I/O aggregator. In this study, we describe the effect of processor affinity in collective I/O considering multi-core cluster systems. Especially, we explain the relationship between the placement of I/O aggregators in each node a...

show abstract

Section: Collective I/o Improvementsmentioning

confidence: 99%

Section: Collective I/o With Different Processor Affinitymentioning

confidence: 99%

An Efficient I/O Aggregator Assignment Scheme for Multi-Core Cluster Systems

Cha

2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…The collective functions require participation of all processes that collectively open the file. Many collaboration strategies have been proposed and demonstrated their successes with significant performance improvements over uncoordinated I/O, such as two-phase I/O [13], [14], disk directed I/O [15], server-directed I/O [16], persistent file domain [17], [18], active buffering [19], and collaborative caching [20], [21]. However, there has not been much effort or demonstration in using MPI-IO for accessing hierarchical and irregularly distributed data sets in high performance.…”

Section: Mpi-iomentioning

confidence: 99%

Supporting computational data model representation with high-performance I/O in parallel netCDF

Gao

Chen

Choudhary

et al. 2011

2011 18th International Conference on High Performance Computing

Self Cite

View full text Add to dashboard Cite

Abstract-Parallel computational scientific applications have been described by their computation and communication patterns. From a storage and I/O perspective, these applications can also be grouped into separate data models based on the way data is organized and accessed during simulation, analysis, and visualization. Parallel netCDF is a popular library used in many scientific applications to store scientific datasets and provides high-performance parallel I/O. Although the metadata-rich netCDF file format can effectively store and describe regular multi-dimensional array datasets, it does not address the full range of current and future computational science data models. In this paper, we present a new storage scheme in Parallel netCDF to represent a broad variety of data models used in modern computational scientific applications. This scheme also allows concurrent metadata construction for different data objects from multiple groups of application processes, an important feature in obtaining a high degree of I/O parallelism for data models exhibiting irregular data distribution. Furthermore, we employ non-blocking I/O functions to aggregate irregularly distributed data requests into large, contiguous data requests, to achieve high-performance I/O. Using an example of adaptive mesh refinement data model, we demonstrate the proposed scheme can produce scalable performance results for both data and metadata creation and access.

show abstract

“…One probing routine polls on intercommunicator to detect incoming requests from any application node, and other keeps probing on local communicator to detect the requests from peer IOD nodes. A collective MPI caching system similar to [19] [20][21] is deployed on IOD nodes to optimize the I/O requests for parallel file systems. Request detector calls corresponding MPI cache routines for performing I/O operations requested by remote application node.…”

Section: B Iodc System Designmentioning

confidence: 99%

Scaling parallel I/O performance through I/O delegate and caching system

Nisar

Liao

Choudhary

2008

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Abstract-Increasingly complex scientific applications require massive parallelism to achieve the goals of fidelity and high computational performance. Such applications periodically offload checkpointing data to file system for post-processing and program resumption. As a side effect of high degree of parallelism, I/O contention at servers doesn't allow overall performance to scale with increasing number of processors. To bridge the gap between parallel computational and I/O performance, we propose a portable MPI-IO layer where certain tasks, such as file caching, consistency control, and collective I/O optimization are delegated to a small set of compute nodes, collectively termed as I/O Delegate nodes. A collective cache design is incorporated to resolve cache coherence and hence alleviates the lock contention at I/O servers. By using popular parallel I/O benchmark and application I/O kernels, our experimental evaluation indicates considerable performance improvement with a small percentage of compute resources reserved for I/O.

show abstract

Using MPI file caching to improve parallel write performance for large-scale scientific applications

Cited by 20 publications

References 20 publications

An Efficient I/O Aggregator Assignment Scheme for Multi-Core Cluster Systems

An Efficient I/O Aggregator Assignment Scheme for Multi-Core Cluster Systems

Supporting computational data model representation with high-performance I/O in parallel netCDF

Scaling parallel I/O performance through I/O delegate and caching system

Contact Info

Product

Resources

About