Kathryn Mohror scite author profile

Kathryn Mohror

5Publications

165Citation Statements Received

76Citation Statements Given

How they've been cited

371

161

How they cite others

Affiliations

Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory, Portland State University

Publications

Order By: Most citations

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Nicolae

Moody

Gonsiorowski

et al. 2019

View full text Add to dashboard Cite

Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.Index Terms-parallel I/O; checkpoint-restart; immutable data; adaptive multilevel asynchronous I/O

show abstract

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Zhu

Chowdhury

et al. 2018

View full text Add to dashboard Cite

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

Moody¹,

Bronevetsky²,

Mohror³

et al. 2010

134

View full text Add to dashboard Cite

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times.A potential solution to this problem is to use multi-level checkpointing, which employs multiple types of checkpoints with different costs and different levels of resiliency in a single run. The goal is to design lightweight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures. While this approach is theoretically promising, it has not been fully evaluated in a large-scale, production system context.To this end we have designed a system, called the Scalable Checkpoint/Restart (SCR) library, that writes checkpoints to storage on the compute nodes utilizing RAM, Flash, or disk, in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.

show abstract

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

Chowdhury

Zhu

Heer

et al. 2019

View full text Add to dashboard Cite

Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.

show abstract

A large-scale study of MPI usage in open-source HPC applications

Laguna

Marshall

Mohror

et al. 2019

View full text Add to dashboard Cite

Understanding the state-of-the-practice in MPI usage is paramount for many aspects of supercomputing, including optimizing the communication of HPC applications and informing standardization bodies and HPC systems procurements regarding the most important MPI features. Unfortunately, no previous study has characterized the use of MPI on applications at a signicant scale; previous surveys focus either on small data samples or on MPI jobs of specic HPC centers. This paper presents the rst comprehensive study of MPI usage in applications. We survey more than one hundred distinct MPI programs covering a signicantly large space of the population of MPI applications. We focus on understanding the characteristics of MPI usage with respect to the most used features, code complexity, and programming models and languages. Our study corroborates certain ndings previously reported on smaller data samples and presents a number of interesting, previously unreported insights. CCS CONCEPTS• General and reference → Surveys and overviews; • Computing methodologies → Parallel programming languages.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kathryn Mohror

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

A large-scale study of MPI usage in open-source HPC applications

Contact Info

Product

Resources

About