Julien Jaeger scite author profile

Julien Jaeger

5Publications

30Citation Statements Received

44Citation Statements Given

How they've been cited

How they cite others

Affiliations

CEA DAM Île-de-France, University of Paris-Saclay, Maison de la Simulation

Publications

Order By: Most citations

Checkpoint/restart approaches for a thread-based MPI runtime

Adam¹,

Kermarquer

Besnard³

et al. 2019

Parallel Computing

View full text Add to dashboard Cite

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

show abstract

Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH

Jaeger

Saillard

Carribault

et al. 2015

View full text Add to dashboard Cite

MPI-3 provide functions for non-blocking collectives. To help programmers introduce non-blocking collectives to existing MPI programs, we improve the PARCOACH tool for checking correctness of MPI call sequences. These enhancements focus on correct call sequences of all flavor of collective calls, and on the presence of completion calls for all nonblocking communications. The evaluation shows an overhead under 10% of original compilation time.

show abstract

An MPI Halo-Cell Implementation for Zero-Copy Abstraction

Besnard¹,

Malony

Shende³

et al. 2015

View full text Add to dashboard Cite

Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs

Jaeger

Barthou

2012

View full text Add to dashboard Cite

HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Transparent High-Speed Network Checkpoint/Restart in MPI

Adam¹,

Besnard²,

Malony

et al. 2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Julien Jaeger

Checkpoint/restart approaches for a thread-based MPI runtime

Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH

An MPI Halo-Cell Implementation for Zero-Copy Abstraction

Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs

Transparent High-Speed Network Checkpoint/Restart in MPI

Contact Info

Product

Resources

About