Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems

Dillow, David; Shipman, Galen; Oral, Sarp; Zhang, Zhe; Kim, Young-Jae

doi:10.1109/pccc.2011.6108062

Cited by 11 publications

(9 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used half (120 GB/s) of the available storage from Spider (Widow1). The achievable aggregate I/O bandwidth is further limited due to congestion on the Cray 3D torus and the InfiniBand fabric, resulting from the Lustre routing algorithms in use during our measurement period [9].…”

Section: Output Absorption On Jaguarmentioning

confidence: 99%

“…This paper characterizes output burst absorption on Jaguar, a 2.33 petaflop Cray XK6 housed at the Oak Ridge Leadership Computing Center (OLCF) at Oak Ridge National Laboratory (ORNL). Storage for Jaguar is provided by Spider [9], the 10 petabyte, 240 GB/s Lustre [10] file system at OLCF. The key contribution of our study is to enhance understanding of performance behaviors for state-of-the art software as currently deployed in a leadership-class facility.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Characterizing output bottlenecks in a supercomputer

Xie

Chase

Dillow

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Abstract-Supercomputer I/O loads are often dominated by writes. HPC (High Performance Computing) file systems are designed to absorb these bursty outputs at high bandwidth through massive parallelism. However, the delivered write bandwidth often falls well below the peak. This paper characterizes the data absorption behavior of a center-wide shared Lustre parallel file system on the Jaguar supercomputer. We use a statistical methodology to address the challenges of accurately measuring a shared machine under production load and to obtain the distribution of bandwidth across samples of compute nodes, storage targets, and time intervals. We observe and quantify limitations from competing traffic, contention on storage servers and I/O routers, concurrency limitations in the client compute node operating systems, and the impact of variance (stragglers) on coupled output such as striping. We then examine the implications of our results for application performance and the design of I/O middleware systems on shared supercomputers.

show abstract

Section: Output Absorption On Jaguarmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Characterizing output bottlenecks in a supercomputer

Xie

Chase

Dillow

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…It totally contains 18,688 compute nodes and each node is powered by 2 quad-core AMD CPUs. The average I/O bandwidth of the whole system is about 80GB/s, and each node can achieve 4.67MB/s bandwidth [6]. In a production run of the GTC application at the scale of 16,384 cores on the platform of Jaguar XT5, the application would output 260GB of particle data every 120 seconds [20], with each core producing about 16.25MB per 120 seconds or 1.08MB data per second.…”

Section: Theoretical Analysis Resultsmentioning

confidence: 99%

“…In a production run at the scale of 16,384 cores, each core can output roughly two million particles per 120 seconds, resulting in 260GB of particle data per output (130MB per node) [20]. However, the average I/O throughput of its running platform, Jaguar (now Titan) at Oak Ridge National Laboratory, is around 4.7MB/s per node [6]. This difference presents a gap between the application's requirement and system capability.…”

Section: Gtc Fusion Modeling Codementioning

confidence: 99%

Data deduplication in a hybrid architecture for improving write performance

Chen

Bastnagel

Chen

2013

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers

View full text Add to dashboard Cite

Big Data computing provides a promising new opportunity for scientific discoveries and innovations. However, it also poses a significant challenge to the high-end computing community. An effective I/O solution is urgently required to support big data applications run on high-end computing systems. In this study, we propose a new approach namely DDiHA, Data Deduplication in Hybrid Architecture, to improve the write performance for write-intensive big data applications. The DDiHA approach utilizes data deduplications to reduce the size of data volumes before they are transfered and written to the storage. A hybrid architecture is introduced to facilitate data deduplications. Both theoretical study and prototyping verification were conducted to evaluate the DDiHA approach. The initial results have shown that, given the same compute resources, the DDiHA system outperformed the conventional architecture, even though it introduces additional computation workload from data deduplications. The DDiHA approach reduces the data size transferred across the network and improves the I/O system performance. It has a promising potential for write-intensive big data applications.

show abstract

“…When an I/O request was issued, it was relayed multiple hops from peer compute nodes to I/O router, then went through SION network, Object Storage Server (OSS) and eventually arrived at OST. Despite of the high network bandwidth along the critical path, the extra data copy and data processing overhead at each hop caused additional delays [15]. Overall, the bandwidth utilization of one OST was 75.6% when there were only two concurrent processes, but dropped to 53.5% when there were 32 processes.…”

Section: Degraded Bandwidth Utilization Due To Contentionmentioning

confidence: 99%

TRIO: Burst Buffer Based I/O Orchestration

Wang

Oral

Pritchard

et al. 2015

2015 IEEE International Conference on Cluster Computing

Self Cite

View full text Add to dashboard Cite

The growing computing power on leadership HPC systems is often accompanied by ever-escalating failure rates. Checkpointing is a common defensive mechanism used by scientific applications for failure recovery. However, directly writing the large and bursty checkpointing dataset to parallel file systems can incur significant I/O contention on storage servers. Such contention in turn degrades bandwidth utilization of storage servers and prolongs the average job I/O time of concurrent applications. Recently burst buffers have been proposed as an intermediate layer to absorb the bursty I/O traffic from compute nodes to storage backend. But an I/O orchestration mechanism is still desirable to efficiently move checkpointing data from burst buffers to storage backend. In this paper, we propose a burst buffer based I/O orchestration framework, named TRIO, to intercept and reshape the bursty writes for better sequential write traffic to storage servers. Meanwhile, TRIO coordinates the flushing orders among concurrent burst buffers to alleviate the contention on storage server. Our experimental results demonstrated that TRIO could efficiently utilize storage bandwidth and reduce the average job I/O time by 37% on average for dataintensive applications in typical checkpointing scenarios.

show abstract

Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems

Cited by 11 publications

References 13 publications

Characterizing output bottlenecks in a supercomputer

Characterizing output bottlenecks in a supercomputer

Data deduplication in a hybrid architecture for improving write performance

TRIO: Burst Buffer Based I/O Orchestration

Contact Info

Product

Resources

About