Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Jaulmes, Luc; Casas, Marc; Moretó, Miquel; Ayguadé, Eduard; Labarta, Jesús; Valero, Mateo

doi:10.1145/2807591.2807599

Cited by 33 publications

(22 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• This paper further extends the original conference manuscript [25] with an in-depth study of the effect of page sizes, from 4KB up to 2MB, on the overheads of the techniques. Our algorithmic methods outperform the state-of-the-art on average up to 512KB page sizes.…”

Section: Introductionmentioning

confidence: 65%

See 1 more Smart Citation

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Jaulmes¹,

Moretó²,

Ayguadé³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g. by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess. We introduce a novel error correction technique for iterative linear solvers, designed to preserve both the progress made and the solver's future convergence by recovering the program's state exactly. Leveraging the asynchrony of task-based programming models, we mask our technique's overhead by overlapping error correction with the solver's normal workload. Our technique relies on analysing solvers to find redundancy in the form of relations between data. We are then able to restore discarded or corrupted data by recomputing or inverting the appropriate relations. We demonstrate that this approach allows to recover any part of three widely used Krylov subspace methods: CG, GMRES and BiCGStab, and their pre-conditioned versions. We implement our technique for CG and recover lost data at the scale of a memory page, which is the granularity at which Operating Systems (OS) report memory errors on commodity hardware, and study the effect of varying the memory page size to address non-standard sizes and the possible use of huge pages in High Performance Computing (HPC). When compared to checkpointing and to the state-of-the-art algorithmic restart technique, on small (8 cores) to large scale (1024 cores), our methods show less overhead. A trade-off arises between our straightforward and asynchronous approaches, based on the rate at which faults happen. At the lowest considered rate and page size, overlapping recoveries decreases their average cost from 5.40% to 2.24% of the ideal faultless execution time. Our methods generally outperform the state-of-the-art even with increased overheads on big page sizes, and perform similarly on edge cases. These results also indicate that our techniques are increasingly efficient as the matrix size increases.

show abstract

Section: Introductionmentioning

confidence: 65%

“…This manuscript is the journal extension of a previously published conference paper [25]. This work has been par- .…”

Section: Acknowledgmentsmentioning

confidence: 98%

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Jaulmes¹,

Moretó²,

Ayguadé³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…It computes the solution by building a basis of orthogonal vectors each iteration. We use a sparse matrix version with the task decomposition described by Jaulmes et al [19]. The manual scheduling assigns tasks to sockets in a round-robin fashion.…”

Section: Tested Applicationsmentioning

confidence: 99%

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Barrera

Moretó

Ayguadé

et al. 2018

Proceedings of the 2018 International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52× and average improvements of 1.12× with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28× on average with respect to the state-of-the-art. CCS CONCEPTS • Computing methodologies → Parallel computing methodologies; • Computer systems organization → Multicore architectures; • Mathematics of computing → Graph algorithms;

show abstract

“…The inputs are selected to balance between simulation time and LLC footprint. Finally, we use benchmark CG, a conjugate gradient method [23], implemented in OmpSs by Jaulmes et al [10]. The input is the matrix qa8fm from The University of Florida Sparse Matrix Collection [8].…”

Section: Benchmarksmentioning

confidence: 99%

Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals

Dimić

Moretó

Casas

et al. 2017

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Processor speed is improving at a faster rate than the speed of main memory, which makes memory accesses increasingly expensive. One way to solve this problem is to reduce miss ratio of the processor's last level cache by improving its replacement policy. We approach the problem by co-designing the runtime system and hardware and exploiting the semantics of the applications written in data-flow task-based programming models to provide hardware with information about the task types and task data-dependencies. We propose the Task-Type aware Insertion Policy, TTIP, which uses the runtime system to dynamically determine the best probability per task type for bimodal insertion in the recency stack and the static Dependency-Type aware Insertion Policy, DTIP, that inserts cache lines in the optimal position taking into account the dependency types of the current task. TTIP and DTIP perform similarly or better than state-of-the-art replacement policies, while requiring less hardware.

show abstract

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Cited by 33 publications

References 31 publications

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals

Contact Info

Product

Resources

About