Extending the OpenCHK Model with advanced checkpoint features

Maroñas, Marcos; Mateo, Sergi; Keller, Kai; Bautista-Gomez, Leonardo; Ayguadé, Eduard; Beltrán, Vicenç

doi:10.1016/j.future.2020.06.003

Cited by 6 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For system-level fault tolerance, most approaches rely on rollback recovery. A system-level approach is given in [15], proposing compiler instructions for allowing users to specify checkpoint/restart operations, supporting from basic to advanced mechanisms currently available on dedicated libraries and the using of fault-tolerance-dedicated threads. Another system-level strategy proposes extensions to the Distem emulator [16], enabling it to evaluate fault tolerance and load balancing mechanisms in real HPC Runtimes Charm++, MPICH, and OpenMPI.…”

Section: Current Fault Tolerance Approachesmentioning

confidence: 99%

Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale

González

2022

Euro-Par 2021: Parallel Processing Workshops

View full text Add to dashboard Cite

Section: Current Fault Tolerance Approachesmentioning

confidence: 99%

Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale

González

2022

Euro-Par 2021: Parallel Processing Workshops

View full text Add to dashboard Cite

“…Similarly to Maroñas et al [57], BT's MPI implementation was expanded to support scalable checkpoint recovery (SCR). That is, at the end of each iteration, the solver inner state is checkpointed.…”

Section: Dcpmm As Checkpoint/restart (C/r) Storagementioning

confidence: 99%

Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing

Yehonatan¹,

Snir²,

Rusanovsky³

et al. 2021

Preprint

View full text Add to dashboard Cite

As the High Performance Computing (HPC) world moves towards the Exa-Scale era, huge amounts of data should be analyzed, manipulated and stored. In the traditional storage/memory hierarchy, each compute node retains its data objects in its local volatile DRAM. Whenever the DRAM's capacity becomes insufficient for storing this data, the computation should either be distributed between several compute nodes, or some portion of these data objects must be stored in a non-volatile block device such as a hard disk drive (HDD) or an SSD storage device. These standard block devices offer large and relatively cheap non-volatile storage, but their access times are orders-ofmagnitude slower than those of DRAM. Optane™ DataCenter Persistent Memory Module (DCPMM) [1], a new technology introduced by Intel, provides non-volatile memory that can be plugged into standard memory bus slots (DDR DIMMs) and therefore be accessed much faster than standard storage devices. In this work, we present and analyze the results of a comprehensive performance assessment of several ways in which DCPMM can 1) replace standard storage devices, and 2) replace or augment DRAM for improving the performance of HPC scientific computations. To achieve this goal, we have configured an HPC system such that DCPMM can service I/O operations of scientific applications, replace standard storage devices and file systems (specifically for diagnostics and checkpoint-restarting), and serve for expanding applications' main memory. We focus on keeping the scientific codes with as few changes as possible, while allowing them to access the NVM transparently as if they access persistent storage. Our results show that DCPMM allows scientific applications to fully utilize nodes' locality by providing them with sufficiently-large main memory. Moreover, it can also be used for providing a high-performance replacement for persistent storage. Thus, the usage of DCPMM has the potential of replacing standard HDD and SSD storage devices in HPC architectures and enabling a more efficient platform for modern supercomputing applications. The source code used by this work, as well as the benchmarks and other relevant sources, are available at: https://github.com/ Scientific-Computing-Lab-NRCN/StoringStorage.

show abstract

“…x[0],x[5],x [10],x [15] x[0],x[5],x [10],x [15] #pragma oss taskloop inout(x[i]) grainsize(5) for (i = 0; i < 20; i++) {...} #pragma oss taskloop inout(x[i]) grainsize(5) for (i = 0; i < 20; i++) {...}…”

Section: Methodsmentioning

confidence: 99%

“…Some of the most important are Intel TBB [65], OpenMP [84], CUDA [69] or MPI [117]. They can be classified in several different ways: shared or distributed memory, sup- In this thesis, we contribute to two programming models: OmpSs-2 [14] and OpenCHK [15]. OmpSs-2 is an already existing programming model that we enhanced with novel features.…”

Section: Programming Modelsmentioning

confidence: 99%

See 1 more Smart Citation

On the design and development of programming models for exascale systems

Maroñas Bravo

View full text Add to dashboard Cite

High Performance Computing (HPC) systems have been evolving over time to adapt to the scientific community requirements. We are currently approaching to the Exascale era. Exascale systems will incorporate a large number of nodes, each of them containing many computing resources. Besides that, not only the computing resources, but memory hierarchies are becoming more deep and complex. Overall, Exascale systems will present several challenges in terms of performance, programmability and fault tolerance. Regarding programmability, the more complex a system architecture is, the more complex to properly exploit the system. The programmability is closely related to the performance, because the performance a system can deliver is useless if users are not able to write programs that obtain such performance. This stresses the importance of programming models as a tool to easily write programs that can reach the peak performance of the system. Finally, it is well known that more components lead to more errors. The combination of large executions with a low Mean Time To Failure (MTTF) may jeopardize application progress. Thus, all the efforts done to improve performance become pointless if applications hardly finish. To prevent that, we must apply fault tolerance. The main goal of this thesis is to enable non-expert users to exploit complex Exascale systems. To that end, we have enhanced state-of-the-art parallel programming models to cope with three key Exascale challenges: programmability, performance and fault tolerance. The first set of contributions focuses on the efficient management of modern multicore/manycore processors. We propose a new kind of task that combines the key advantages of tasks with the key advantages of worksharing techniques. The use of this new task type alleviates granularity issues, thereby enhancing performance in several scenarios. We also propose the introduction of dependences in the taskloop construct so that programmers can easily apply blocking techniques. Finally, we extend taskloop construct to support the creation of the new kind of tasks instead of regular tasks. The second set of contributions focuses on the efficient management of modern memory hierarchies, focused on NUMA domains. By using the information that users provide in the dependences annotations, we build a system that tracks data location. Later, we use this information to take scheduling decisions that maximize data locality. Our last set of contributions focuses on fault tolerance. We propose a programming model that provides application-level checkpoint/restart in an easy and portable way. Our programming model offers a set of compiler directives to abstract users from system-level nuances. Then, it leverages state-of-the-art libraries to deliver high performance and includes several redundancy schemes. Los supercomputadores han ido evolucionando a lo largo del tiempo para adaptarse a las necesidades de la comunidad científica. Actualmente, nos acercamos a la era Exascale. Los sistemas Exascale incorporarán un número de nodos enorme. Además, cada uno de esos nodos contendrá una gran cantidad de recursos computacionales. También la jerarquía de memoria se está volviendo más profunda y compleja. En conjunto, los sistemas Exascale plantearán varios desafíos en términos de rendimiento, programabilidad y tolerancia a fallos. Respecto a la programabilidad, cuánto más compleja es la arquitectura de un sistema, más difícil es aprovechar sus recursos de forma adecuada. La programabilidad está íntimamente ligada al rendimiento, ya que por mucho rendimiento que un sistema pueda ofrecer, no sirve de nada si nadie es capaz de conseguir ese rendimiento porque es demasiado difícil de usar. Esto refuerza la importancia de los modelos de programación como herramientas para desarrollar programas que puedan aprovechar al máximo estos sistemas de forma sencilla. Por último, es bien sabido que tener más componentes conlleva más errores. La combinación de ejecuciones muy largas y un tiempo medio hasta el fallo (MTTF) bajo ponen en peligro el progreso de las aplicaciones. Así pues, todos los esfuerzos realizados para mejorar el rendimiento son nulos si las aplicaciones difícilmente terminan. Para evitar esto, debemos desarrollar tolerancia a fallos. El objetivo principal de esta tesis es permitir que usuarios no expertos puedan aprovechar de forma óptima los complejos sistemas Exascale. Para ello, hemos mejorado algunos de los modelos de programación paralela más punteros para que puedan enfrentarse a tres desafíos clave de los sistemas Exascale: programabilidad, rendimiento y tolerancia a fallos. El primer conjunto de contribuciones de esta tesis se centra en la gestión eficiente de procesadores multicore/manycore. Proponemos un nuevo tipo de tarea que combina los puntos clave de las tareas con los de las técnicas de worksharing. Este nuevo tipo de tarea permite aliviar los problemas de granularidad, mejorando el rendimiento en algunos escenarios. También proponemos la introducción de dependencias en la directiva taskloop, de forma que los programadores puedan aplicar blocking de forma sencilla. Finalmente, extendemos la directiva taskloop para que pueda crear nuestro nuevo tipo de tareas, además de las tareas normales. El segundo conjunto de contribuciones está enfocado a la gestión eficiente de jerarquías de memoria modernas, centrado en entornos NUMA. Usando la información de las dependencias que anota el usuario, hemos construido un sistema que guarda la ubicación de los datos. Después, con esa información, decidimos dónde ejecutar el trabajo para maximizar la localidad de datos. El último conjunto de contribuciones se centra en tolerancia a fallos. Proponemos un modelo de programación que ofrece checkpoint/restart a nivel de aplicación, de forma sencilla y portable. Nuestro modelo ofrece una serie de directivas de compilador que permiten al usuario abstraerse de los detalles del sistema. Además, gestionamos librerías punteras en tolerancia a fallos para conseguir un alto rendimiento, incluyendo varios niveles y tipos de redundancia.

show abstract

Extending the OpenCHK Model with advanced checkpoint features

Cited by 6 publications

References 12 publications

Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale

Application-Based Fault Tolerance for Numerical Linear Algebra at Large Scale

Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing

On the design and development of programming models for exascale systems

Contact Info

Product

Resources

About