FlipIt: An LLVM Based Fault Injector for HPC

Calhoun, Jon C.; Olson, Luke N.; Snir, Marc

doi:10.1007/978-3-319-14325-5_47

Cited by 43 publications

(27 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To capture and extract these patterns, however, a new method is required. While some methods exist to inject faults and statistically quantify their manifestation, such as random fault injection [2], [9], [10], [11], [12], and to use program analysis [13], [14], [15], [16], [17] to track errors on individual instructions, these methods miss the fine-grained information on error propagation as well as the context needed to explain, at a fine granularity, how errors propagate and consequently how natural resilient computations occur. In other words, these approaches do not provide the needed reasoning about how multiple computations work together to make an error disappear or to diminish its impact.…”

Section: Introductionmentioning

confidence: 99%

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Guo

Liu

Laguna

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns-combinations or sequences of computations-that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs towards patterns with natural resilience.

show abstract

Section: Introductionmentioning

confidence: 99%

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Guo

Liu

Laguna

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…Recently, a number of compiler-based FI frameworks have been developed using the LLVM compiler [19]-some examples are LLFI [36], KULFI [32], VULFI [31], and FlipIt [6]. These methods are easy to use in large-scale parallel programs, and cooperate well with error propagation analysis frameworks since these frameworks typically operate at the compiler level too.…”

Section: Background and Related Workmentioning

confidence: 99%

“…The way in which most LLVM IR-based FI tools inject faults into IR instructions is by adding a function call to the target instruction [6,31,32,36]. This function call performs large changes to the value of the result of the instruction or its arguments, and it may get inlined after optimization passes.…”

Section: (C) X64 Assembly Including Fi Instrumentationmentioning

confidence: 99%

“…On the downside, though, compiler-based FI has three disadvantages compared to binary-level FI, all caused by the fact that existing methods perform injection at the compiler intermediate representation (IR) [6,31,32,36]: (1) not all low-level dynamic binary instructions are available at the IR level for FI; (2) instrumentation at the IR level interferes with code generation and optimizations-even if FI instrumentation is done after all IR optimizations are applied, the code that is input to the compiler backend can be significantly different from the original non-faulty code, which may generate very different (many times unoptimized) machine binary code; and (3) since code cannot be fully optimized (because of (2)), most frameworks incur significant (unnecessary) overhead, increasing the time to complete FI studies in large applications.…”

Section: Introductionmentioning

confidence: 99%

“…It is important to get an accurate picture of these proportions; for example, an application that experiences a large percentage of SDCs may require algorithmic error detection mechanisms, at the expense of runtime overhead. A concern in the HPC community is that a significant number of resilience studies have been based on this FI method [3,4,6,17,18,25,31,32,[34][35][36] (including our own work), which can potentially skew FI results and, in some cases, lead to incorrect conclusions. There has been research done in showing these inaccuracies.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Refine

Georgakoudis

Laguna

Nikolopoulos

et al. 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Compiler-based fault injection (FI) has become a popular technique for resilience studies to understand the impact of soft errors in supercomputing systems. Compiler-based FI frameworks inject faults at a high intermediate-representation level. However, they are less accurate than machine code, binary-level FI because they lack access to all dynamic instructions, thus they fail to mimic certain fault manifestations. In this paper, we study the limitations of current practices in compiler-based FI and how they impact the interpretation of results in resilience studies.We propose REFINE, a novel framework that addresses these limitations, performing FI in a compiler backend. Our approach provides the portability and efficiency of compiler-based FI, while keeping accuracy comparable to binary-level FI methods. We demonstrate our approach in 14 HPC programs and show that, due to our unique design, its runtime overhead is significantly smaller than state-ofthe-art compiler-based FI frameworks, reducing the time for large FI experiments. CCS CONCEPTS• Computing methodologies → Simulation tools; Model verification and validation; • Software and its engineering → Compilers; • Hardware → Analysis and design of emerging devices and systems;

show abstract

FINJ: A Fault Injection Tool for HPC Systems

Netti

Kızıltan

Babaoğlu

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.

show abstract

FlipIt: An LLVM Based Fault Injector for HPC

Cited by 43 publications

References 15 publications

FlipTracker: Understanding Natural Error Resilience in HPC Applications

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Refine

FINJ: A Fault Injection Tool for HPC Systems

Contact Info

Product

Resources

About