McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Islam, Tanzima; Mohror, Kathryn; Bagchi, Saurabh; Moody, Adam; Supinski, Bronis R. de; Eigenmann, Rudolf

doi:10.1155/2013/341672

Cited by 25 publications

(24 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Islam et al [39] present a checkpoint-restart library that works to coalesce requests from the multiple processes to the PFS. They use information -variables name and type -to place similar data close together.…”

Section: Requests Aggregation and Reorderingmentioning

confidence: 99%

A Checkpoint of Research on Parallel I/O for High-Performance Computing

et al. 2018

View full text Add to dashboard Cite

We present a comprehensive survey on parallel I/O in the high performance computing (HPC) context. This is an important field for HPC because of the historic gap between processing power and storage latencies, which causes applications performance to be impaired when accessing or generating large amounts of data. As the available processing power and amount of data increase, I/O remains a central issue for the scientific community. In this survey, we focus on a traditional I/O stack, with a POSIX parallel file system. We present background concepts everyone could benefit from. Moreover, through the comprehensive study of publications from the most important conferences and journals in a five-year time window, we discuss the state of the art of I/O optimization approaches, access pattern extraction techniques, and performance modeling, in addition to general aspects of parallel I/O research. Through this approach, we aim at identifying the general characteristics of the field and the main current and future research topics.

show abstract

Section: Requests Aggregation and Reorderingmentioning

confidence: 99%

A Checkpoint of Research on Parallel I/O for High-Performance Computing

et al. 2018

View full text Add to dashboard Cite

show abstract

“…If an algorithm lacks a simple checking method or invariant, the Checker can be provided through comparison with a checksum over the data that was computed beforehand and stored in a safe region. 2 The Recover method can be supplied through the forward recovery phase in ABFT methods, or simply by restoring a light-weight deduplicated [1] or compressed [17] checkpoint of the data.…”

Section: Assumptionsmentioning

confidence: 99%

DINO: Divergent Node Cloning for Sustained Redundancy in HPC

Rezaei

Mueller

2015

2015 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

A plethora of resilience techniques have been investigated ranging from checkpoint/restart over redundancy to algorithm-based fault tolerance. Each technique works well for a different subset of application kernels, and depending on the kernel, has different overheads, resource requirements, and fault masking capabilities. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created.This work contributes the idea of end-to-end resilience by protecting windows of vulnerability between kernels guarded by different resilience techniques. It introduces the live vulnerability factor (LVF), a new metric that quantifies any lack of end-to-end protection for a given data structure. The work further promotes end-to-end application protection across kernels via a pragma-based specification, implemented as an extension to OpenMP, for diverse resilience schemes with minimal programming effort. This lifts the data protection burden from application programmers allowing them to focus solely on algorithms and performance while resilience is specified and subsequently embedded into the code through the compiler/library and supported by the runtime system. Two case studies demonstrate that end-to-end resilience meshes well with different execution paradigms and assess its overhead and effectiveness for different codes. In experiments, end-to-end resilience has an overhead over kernel-specific resilience of 1% on average.

show abstract

“…In [28], it showed that data compression had the potential to significantly reduce the checkpointing file sizes. If multiple applications run concurrently, a dataaware compression scheme [29] was proposed to improve the overall checkpointing efficiency. Recent study [30] shows that combining failure detection and proactive checkpointing could improve 30% efficiency compared to classical periodical checkpointing.…”

Section: Related Workmentioning

confidence: 99%

Virtual chunks: On supporting random accesses to scientific data in compressible storage systems

Zhao

Yin

Qiao

et al. 2014

2014 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Abstract-Data compression could ameliorate the I/O pressure of scientific applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the file or block brings the dilemma between efficient random accesses and high compression ratios. Filelevel compression can barely support efficient random accesses to the compressed data: any retrieval request need trigger the decompression from the beginning of the compressed file. Block-level compression provides flexible random accesses to the compressed data, but introduces extra overhead when applying the compressor to each every block that results in a degraded overall compression ratio. This paper introduces a concept called virtual chunks aiming to support efficient random accesses to the compressed scientific data without sacrificing its compression ratio. In essence, virtual chunks are logical blocks identified by appended references without breaking the physical continuity of the file content. These additional references allow the decompression to start from an arbitrary position (efficient random access), and retain the file's physical entirety to achieve high compression ratio on par with file-level compression. One potential concern of virtual chunks lies on its space overhead (from the additional references) that degrades the compression ratio, but our analytic study and experimental results demonstrate that such overhead is negligible. We have implemented virtual chunks in two forms: a middleware to the GPFS parallel file system, and a module in the FusionFS distributed file system. Large-scale evaluations on up to 1,024 cores showed that virtual chunks could help improve the I/O throughput by 2X speedup.

show abstract

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Cited by 25 publications

References 23 publications

A Checkpoint of Research on Parallel I/O for High-Performance Computing

A Checkpoint of Research on Parallel I/O for High-Performance Computing

DINO: Divergent Node Cloning for Sustained Redundancy in HPC

Virtual chunks: On supporting random accesses to scientific data in compressible storage systems

Contact Info

Product

Resources

About