Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems

Oral, Sarp; Simmons, James A.; Hill, Jason; Leverman, Dustin; Wang, Feiyi; Ezell, Matt; Miller, Rob; Fuller, Douglas; Gunasekaran, Raghul; Kim, Young-Jae; Gupta, Saurabh; Tiwari, Devesh; Vazhkudai, Sudharshan S.; Rogers, James H.; Dillow, David; Shipman, Galen; Bland, Arthur S.

doi:10.1109/sc.2014.23

Cited by 37 publications

(15 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Titan's ile system, called Spider 2, is based on Lustre, an object-based parallel ile system software that is deployed on ∼75% of the top 100 systems [36]. Spider 2 has 32 PB of data storage and above 1TB/s peak I/O bandwidth [31]. This section summarizes Titan/Spider 2 based on materials from [13,30,39,40].…”

Section: Titan and Its Lustre File Systemmentioning

confidence: 99%

Predicting Output Performance of a Petascale Supercomputer

Xie

Huang

Chase

et al. 2017

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

In this paper, we develop a predictive model useful for output performance prediction of supercomputer ile systems under production load. Our target environment is TitanÐthe 3rd fastest supercomputer in the worldÐand its Lustre-based multi-stage write path. We observe from Titan that although output performance is highly variable at small time scales, the mean performance is stable and consistent over typical application run times. Moreover, we ind that output performance is non-linearly related to its correlated parameters due to interference and saturation on individual stages on the path. These observations enable us to build a predictive model of expected write times of output patterns and I/O conigurations, using feature transformations to capture non-linear relationships. We identify the candidate features based on the structure of the Lustre/Titan write path, and use feature transformation functions to produce a model space with 135,000 candidate models. By searching for the minimal mean square error in this space we identify a good model and show that it is efective.

show abstract

Section: Titan and Its Lustre File Systemmentioning

confidence: 99%

Predicting Output Performance of a Petascale Supercomputer

Xie

Huang

Chase

et al. 2017

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Even though fprof is designed to run with multiple processes on multiple nodes to scale, there exist practical concerns and constraints for deploying and running on a production system. For instance, due to the architecture of centralized metadata management in Lustre [22], excessive metadata scanning operations might adversely impact the foreground file system operations. To this end, OLCF ran fprof using a single client node for profiling the Lustrebased Spider II file system, while at LC, fprof was run on multiple nodes, resulting in a significant performance improvement.…”

Section: Deploymentmentioning

confidence: 99%

“…We ran fprof on the OLCF's center-wide Spider II file system [22] and the lscratche file system in LC [6] in May 2017. Note the difference in file system architectures of the two HPC centers outlined in Table 1.…”

Section: Profiling and Analysismentioning

confidence: 99%

Diving into petascale production file systems through large scale profiling and analysis

Wang

Sim

Harr

et al. 2017

Proceedings of the 2nd Joint International Workshop on Parallel Data Storage &Amp; Data Intensive Scalable Computing Systems

Self Cite

View full text Add to dashboard Cite

As leadership computing facilities grow their storage capacity into the multi-petabyte range, the number of files and directories leap into the scale of billions. A complete profiling of such a parallel file system in a production environment presents a unique challenge. On one hand, the time, resources, and negative performance impact on production users can make regular profiling difficult. On the other hand, the result of such profiling can yield much needed understanding of the file system's general characteristics, as well as provide insight to how users write and access their data on a grand scale. This paper presents a lightweight and scalable profiling solution that can efficiently walk, analyze, and profile multi-petabyte parallel file systems. This tool has been deployed and is in regular use on very large-scale production parallel file systems at both Oak Ridge National Lab's Oak Ridge Leadership Facility (OLCF) and Lawrence Livermore National Lab's Livermore Computing (LC) facilities. We present the results of our initial analysis on the data collected from these two large-scale production systems, organized into three use cases: (1) file system snapshot and composition, (2) striping pattern analysis for Lustre, and (3) simulated storage capacity utilization in preparation for future file systems. Our analysis shows that on the OLCF file system, over 96% of user files exhibit the default stripe width, potentially limiting performance on large files by underutilizing storage servers and disks. Our simulated block analysis quantitatively shows the space overhead when doing a forklift system migration. It also reveals that due to the difference in system compositions (OLCF vs. LC), we can achieve better performance and space trade-offs by employing different native file system block sizes. 1 INTRODUCTION Present-day large-scale United States Department of Energy (DOE) High Performance Computing (HPC) facilities, such as Oak Ridge Leadership Computing Facility (OLCF) [10], Livermore Computing Center (LC) [14], Argonne Leadership Computing Facility (ALCF) [1], and National Energy Research Scientific Computing Center (NERSC) [8], are equipped with parallel file system capacities in tens of petabytes. Next generation parallel file systems at these facilities will have capacities in the hundreds of petabytes. Understanding the file system metadata can provide useful insight into how these file systems are used and how to develop and deploy better file systems for the future [16, 17, 20, 24]. However, a tool for effectively ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or affiliate of the United States government. As such, the United States government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for government purposes only.

show abstract

“…Extrapolating from here, the expected Spider 2 performance should be at most 250 GB/s under such bursty production workloads. [10].…”

Section: Spider 2 Usage Statsmentioning

confidence: 99%

Comparative I/O workload characterization of two leadership class storage clusters

Gunasekaran

Oral

Hill

et al. 2015

Proceedings of the 10th Parallel Data Storage Workshop

Self Cite

View full text Add to dashboard Cite

The Oak Ridge Leadership Computing Facility (OLCF) is a leader in large-scale parallel file system development, design, deployment and continuous operation. For the last decade, the OLCF has designed and deployed two large center-wide parallel file systems. The first instantiation, Spider 1, served the Jaguar supercomputer and its predecessor, Spider 2, now serves the Titan supercomputer, among many other OLCF computational resources. The OLCF has been rigorously collecting file and storage system statistics from these Spider systems since their transition to production state. In this paper we present the collected I/O workload statistics from the Spider 2 system and compare it to the Spider 1 data. Our analysis show that the Spider 2 workload is more more write-heavy I/O compared to Spider 1 (75% vs. 60%, respectively). The data also show the OLCF storage policies such as periodic purges are effectively managing the capacity resource of Spider 2. Furthermore, due to improvements in tdm multipath and ib srp software, we are utilizing the Spider 2 system bandwidth and latency resources more effectively. The Spider 2 bandwidth usage statistics shows that our system is working within the design specifications. However, it is also evident that our scientific applications can be more effectively served by a burst buffer storage layer. All the data has been collected by monitoring tools developed for the Spider ecosystem. We believe the observed data set and insights will help us better design the next-generation Spider file and storage system. It will also be helpful to the larger community for building more effective large-scale file and storage systems.

show abstract

Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems

Cited by 37 publications

References 5 publications

Predicting Output Performance of a Petascale Supercomputer

Predicting Output Performance of a Petascale Supercomputer

Diving into petascale production file systems through large scale profiling and analysis

Comparative I/O workload characterization of two leadership class storage clusters

Contact Info

Product

Resources

About