International audienceErrors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifica-tions, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (Mean Time Between Failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared to the base algorithm that always perform a verification just before taking a checkpoint (p = q = 1), and we exhibit gains of up to 19%
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic checkpointing approaches devised for fail-stop errors. Instead, checkpointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less accurate verifications in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to the first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Performance evaluations based on a wide range of parameters confirm the benefit of using partial verifications under certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.
Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.
We present techniques for accelerating the floating-point computation of x/y when y is known before x. The proposed algorithms are oriented towards architectures with available fused-MAC operations. The goal is to get exactly the same result as with usual division with rounding to nearest. These techniques can be used by compilers to accelerate some numerical programs without loss of accuracy. 1 Motivation of this researchWe wish to provide methods for accelerating floating-point divisions of the form x/y, when y is known before x, either at compile-time, or at run time. We assume that a fused multiply-accumulator is available, and that division is done in software (this happens for instance on RS6000, PowerPC or Itanium architectures). The computed result must be the correctly rounded result.A naive approach consists in computing the reciprocal of y (with rounding to nearest), and then, once x is available, multiplying the obtained result by x. It is well known that that "naive method" does not always produce a correctly rounded result. One might then conclude that, since the result should always be correct, there is no interest in investigating that method. And yet, if the probability of getting an incorrect rounding was small enough, one could imagine the following strategy:• the computations that follow the naive division are performed as if the division was correct;• in parallel, using holes in the pipeline, a remainder is computed, to check whether the division was correctly rounded;• if it turns out that the division was not correctly rounded, the result of the division is corrected using the computed remainder, and the computation is started again at that point.To investigate whether that strategy is worth being applied, it is of theoretical and practical interest to have at least a rough estimation of the probability of getting an incorrect 2 rounding. Also, one could imagine that there might exist some values of y for which the naive method always work (for any x). These values could be stored. Last but not least, some properties of the naive method are used to design better algorithms. For these reasons, we have decided to dedicate a section to the analysis of the naive method.Another approach starts as previously: once x is known, it is multiplied by the precomputed reciprocal of y. Then a remainder is computed, and used to correct the final result. This does not require testing. That approach looks like the final steps of a Newton-Raphson division. It is clear from the literature that the iterative algorithms for division require an initial approximation of the reciprocal of the divisor, and that the number of iterations is reduced by having a more accurate initial approximation. Of course this initial approximation can be computed in advance if the divisor is known.The problem is to always get correctly rounded results, at very low cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.