An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm

Camargo, Edson Tavares de; Duarte, Elias P.

doi:10.1109/ladc53747.2021.9672590

2021 10th Latin-American Symposium on Dependable Computing (LADC) 2021

DOI: 10.1109/ladc53747.2021.9672590

|View full text |Cite

An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm

Edson Tavares de Camargo

Elias P. Duarte

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Can GPU performance increase faster than the code error rate?

dos Santos,

Rech

2024

J Supercomput

View full text Add to dashboard Cite

Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way to increase reliability is to sacrifice performance, e.g., using redundancies. We show in this paper that this is not always the case. As a very promising result, we found that most GPUs performance improvements also bring the benefit of increasing the number of executions correctly completed before experiencing a silent data corruption (SDC). We consider four different common GPUs’ performance optimizations: architectural solutions, software implementations, compiler optimizations, and threads degree of parallelism. We compare different implementations of a variety of parallel codes and, through beam experiments and applications profiling, we show that the performance improvement typically (but not necessarily) increases the GPU SDC rate. Nevertheless, for the vast majority of the configurations the performance gain is much higher than the SDC rate increase, allowing to process a higher amount of correct data. As we show, the programmer choices can increase up to $$25\times {}$$ 25 × the number of correctly completed executions without redesigning the algorithm nor including specific hardening solutions.

show abstract

Can GPU performance increase faster than the code error rate?

dos Santos,

Rech

2024

J Supercomput

View full text Add to dashboard Cite

show abstract

Understanding Silent Data Corruption in Processors for Mitigating its Effects

Wang,

Zhang,

Wei

et al. 2024

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Silent Data Corruption (SDC) in processors can lead to various application-level issues, such as incorrect calculations and even data loss. Since traditional techniques are not effective in detecting these errors, it is very hard to address problems caused by SDCs in processors. For the same reason, knowledge about these SDCs in the wild is limited. In this article, we conduct an extensive study on CPU SDCs in a large production CPU population, encompassing over one million processors. In addition to collecting overall statistics, we perform a detailed study to understand (1) whether certain processor features are particularly vulnerable and their potential impacts on applications; (2) the reproducibility of CPU SDCs and the triggering conditions (e.g., temperature) of those less reproducible SDCs; and (3) the challenges to mitigate and handle CPU SDCs. We further investigate the implications that our observations obtained from the above researches have on the SDC fault models, SDC mitigation strategies, and the future research fields. In addition, we design an efficient SDC mitigation approach called Farron, which uses prioritized testing to detect highly reproducible SDCs and temperature control to mitigate less-reproducible SDCs. Our experimental results indicate that Farron can achieve better coverage of CPU SDCs with lower overall overhead, compared to the baseline used in Alibaba Cloud. This demonstrates that our observations are able to assist in SDC mitigation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm

Cited by 2 publications

References 35 publications

Can GPU performance increase faster than the code error rate?

Can GPU performance increase faster than the code error rate?

Understanding Silent Data Corruption in Processors for Mitigating its Effects

Contact Info

Product

Resources

About