In the field of high performance computing there is a growing need to process large, complex datasets. Many of these applications are file-intensive workloads, performing a large number of reads from and writes to a small number of files. When executing these workloads on cluster-based systems, performance cannot scale by simply increasing the number of compute nodes. To effectively exploit parallel resources we need to parallelize file I/O. The potential impact of exploiting parallel I/O grows as the gap between CPU and disk speeds continues to increase.While parallel I/O middleware systems (e.g., MPI I/O) provide users with environments where large datasets can be shared among multiple distributed processes, the performance of file-intensive applications depends heavily on how the data is accessed and where the data is physically located on disk. I/O operations need to be parallelized both at the application level (using middleware) and at the disk level (using partitioning).In this paper, we present a new profile-guided greedy partitioning algorithm to parallelize I/O access for file-intensive applications run on cluster-based systems. We are using MPI and MPI I/O to provide parallelization at the application level. We utilize I/O profiling to capture relevant information about the I/O stream. We then use these profiles to guide file partitioning across multiple disks to significantly improve I/O throughput.
The main goal for parallel I/O is to increase I/O parallelism by providing multiple, independent data channels between processors and disks. To realize this goal, I/O streams need to be parallelized and partitioned at multiple system layers. Contention at any level can dramatically decrease performance and limit scalability. To address this disk contention bottleneck, it is important to carefully study disk access patterns.From our previous work on I/O profiling, we found that I/O access patterns of parallel scientific applications are usually very regular and highly predictable. Thus it is possible to detect I/O access patterns statically during compiler time. Large datasets are logically linearized in file space on disk, and these intensive data accesses follow a linear space traversal. In this paper, we present our recent work on compilerdirected I/O partitioning, based on Linear Disk Access Descriptors (LDAD). We use the SUIF compiler infrastructure to perform data-flow analysis and recognize LDADs. We then use these LDADs to guide our I/O data partitioning that utilizes multiple disks to significantly increase I/O throughput.
In the area of Grid computing, there is a growing need to process large amounts of data. To support this trend, we need to develop efficient parallel storage systems that can provide for high performance for data-intensive applications. In order to overcome I/O bottlenecks and to increase I/O parallelism, data streams need to be parallelized at both the application level and the storage device level.In this paper, we propose a novel Peer-to-Peer(P2P) storage architecture for MPI applications on Grid systems. We first present an analytic model of our P2P storage architecture. Next, we describe a profile-guided data allocation algorithm that can increase the degree of I/O parallelism present in the system, as well as to balance I/O in a heterogeneous system. We present results on an actual implementation. Our experimental results show that by partitioning data across all available storage devices and carefully tuning I/O workloads in the Grid system, our Peer-to-Peer scheme can deliver scalable high performance I/O that can address I/Ointensive workloads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.