With the growing scale of HPC applications, there has been an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Time Between Failures (MTBF) in current systems encourages the research of suitable fault tolerance solutions. Message logging combined with uncoordinated checkpoint compose a scalable rollback-recovery solution. However, message logging techniques are usually responsible for most of the overhead during failure-free executions. Taking this into consideration, this paper proposes the Hybrid Message Pessimistic Logging (HMPLHMPL) which focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low failure-free overhead introduced by pessimistic sender-based message logging. The HMPLHMPL manages messages using a distributed controller and storage to avoid harming system’s scalability. Experiments show that the HMPLHMPL is able to reduce overhead by 34% during failure-free executions and 20% in faulty executions when compared with a pessimistic receiver-based message logging.This research has been supported by the MINECO (MICINN) Spain under contracts TIN2011-24384 and TIN2014-53172-P.Peer ReviewedPostprint (author's final draft
Abstract. Computational Science and Engineering is an inherently multidisciplinary field, the increasingly important partner of theory and experimentation in 0the development of knowledge. The Computer Architecture and Operating Systems department of the Universitat Autònoma de Barcelona has created a new innovative masters degree programme with the aim of introducing students to core concepts in this field such as large scale simulation and high performance computing. An innovative course model allows students without a computational science background to enter this arena. Students from different fields have already completed the first edition of the new course and positive feedback has been received from students and professors alike. The second edition is in development.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.