Scalable Parallel I/O on a Blue Gene/Q Supercomputer Using Compression, Topology-Aware Data Aggregation, and Subfiling

Bui, Huy; Finkel, Hal; Vishwanath, Venkatram; Habib, Salma; Heitmann, Katrin; Leigh, Jason; Papka, Michael E.; Harms, Kevin

doi:10.1109/pdp.2014.60

Cited by 20 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Parallel I/O is an active research topic because of the increasing requirements of applications for data movement to memory or storage [3]. While I/O tuning is probably the first step to increase I/O bandwidth on new architectures [4], [5], [6], improvements at different layers of the I/O software stack are also necessary. From a file system perspective, GPFS [7] [9] evaluate various collective I/O write algorithms.…”

Section: Related Workmentioning

confidence: 99%

TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers

Tessier

Vishwanath

Jeannot

2017

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Reading and writing data efficiently from storage system is necessary for most scientific simulations to achieve good performance at scale. Many software solutions have been developed to decrease the I/O bottleneck. One well-known strategy, in the context of collective I/O operations, is the twophase I/O scheme. This strategy consists of selecting a subset of processes to aggregate contiguous pieces of data before performing reads/writes. In this paper, we present TAPIOCA, an MPI-based library implementing an efficient topology-aware twophase I/O algorithm. We show how TAPIOCA can take advantage of double-buffering and one-sided communication to reduce as much as possible the idle time during data aggregation. We also introduce our cost model leading to a topology-aware aggregator placement optimizing the movements of data. We validate our approach at large scale on two leadership-class supercomputers: Mira (IBM BG/Q) and Theta (Cray XC40). We present the results obtained with TAPIOCA on a micro-benchmark and the I/O kernel of a large-scale simulation. On both architectures, we show a substantial improvement of I/O performance compared with the default MPI I/O implementation. On BG/Q+GPFS, for instance, our algorithm leads to a performance improvement by a factor of twelve while on the Cray XC40 system associated with a Lustre filesystem, we achieve an improvement of four.

show abstract

Section: Related Workmentioning

confidence: 99%

TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers

Tessier

Vishwanath

Jeannot

2017

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In our previous works in [13][14][15] we focused on data movement for relatively dense communication patterns. Our work in this paper extends our previous work [11] to deal with sparse data movement patterns and extends our work in [14] by employing a pipeline technique to reduce overhead and large memory usage caused by copying and injecting large messages.…”

Section: Related Workmentioning

confidence: 99%

Improving sparse data movement performance using multiple paths on the Blue Gene/Q supercomputer

et al. 2016

Self Cite

View full text Add to dashboard Cite

In situ analysis has been proposed as a promising solution to glean faster insights and reduce the amount of data to storage. A critical challenge here is that the reduced dataset is typically located on a subset of the nodes and needs to be written out to storage. Data coupling in multiphysics codes also exhibits a sparse data movement pattern wherein data movement occurs among a subset of nodes. We evaluate the performance of data movement for sparse data patterns on the IBM Blue Gene/Q supercomputing system "Mira" and identify performance bottlenecks. We propose a multipath data movement algorithm for sparse data patterns based on an adaptation of a maximum flow algorithm together with breadth-first search that fully exploits all the underlying data paths and I/O nodes to improve data movement. We demonstrate the efficacy of our solutions through a set of microbenchmarks and application benchmarks on Mira scaling up to 131,072 compute cores. The results show that our approach achieves up to 5X improvement in achievable throughput compared with the default mechanisms.

show abstract

“…In [18], the feasibility of data compression in the I/O forwarding layer was shown through extensive experiments on various compression libraries and datasets in the context of highperformance computing clusters. In [19], efficient data forwarding algorithms using data compression in supercomputers have been proposed. In [20], a framework incorporating various compression and decompression as well as customized compression algorithms for scientific datasets was presented.…”

Section: Data Compressionmentioning

confidence: 99%

“…The number of data flows at each layer i is defined as n i , and bandwidth function is defined by n i as in Eq. (19).…”

Section: Problem Formulation For Enhanced Data Compressionmentioning

confidence: 99%

Cluster-to-cluster data transfer with data compression over wide-area networks

Jung

Kettimuthu

Vishwanath

2015

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

h i g h l i g h t s• We model all the system components involved in end-to-end data transfer as a graph.• We formulate the problem to optimize data transfer throughput using parallel flows.• We propose a novel software I/O stack with variable parallel data flows per layer. • Optimized parallel data flows improve overall data transfer throughput.• The novel I/O stack with data compression alleviates network bottleneck situations. a b s t r a c tThe recent emergence of ultra high-speed networks up to 100 Gb/s has posed numerous challenges and has led to many investigations on efficient protocols to saturate 100 Gb/s links. However, end-toend data transfers involve many components, not only protocols, affecting overall transfer performance. These components include disk I/O subsystem, additional computation associated with data streams, and network adapters. For example, achievable bandwidth by TCP may not be implementable if disk I/O or CPU becomes a bottleneck in end-to-end data transfer. In this paper, we first model all the system components involved in end-to-end data transfer as a graph. We then formulate the problem whose goal is to achieve maximum data transfer throughput using parallel data flows. We also propose a variable data flow GridFTP XIO stack to improve data transfer with data compression. Our contributions lie in how to optimize data transfers considering all the system components involved rather than in accurately modeling all the system components involved. Our proposed formulations and solutions are evaluated through experiments on the ESnet 100G testbed and a wide-area cluster-to-cluster testbed. The experimental results on the ESnet 100G testbed show that our approach is several times faster than Globus Online-8 × faster for datasets with many 10 MB files and 3-4 × faster for other datasets of larger size files. The experimental results on the cluster-to-cluster testbed show that our variable data flow approach is up to 4 × faster than a normal cluster data transfer.

show abstract

Scalable Parallel I/O on a Blue Gene/Q Supercomputer Using Compression, Topology-Aware Data Aggregation, and Subfiling

Cited by 20 publications

References 16 publications

TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers

TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers

Improving sparse data movement performance using multiple paths on the Blue Gene/Q supercomputer

Cluster-to-cluster data transfer with data compression over wide-area networks

Contact Info

Product

Resources

About