International audienceThe EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart
Concurrency control in distributed and parallel applications has been studied for many years but is still an ongoing research topic. Transactional memory addresses this challenge for multicore processors by proposing to execute critical sections as restartable transactions combined with optimistic synchronization. Thus the programmer has not to reason about complex lock management and deadlocks. We believe that some of these ideas are also useful for distributed systems. Therefore, we are developing the Object Sharing Service (OSS) providing transparent data sharing for clusters and grids. OSS supports different consistency models for replica management within one application. In this paper we present the design and implementation of different transaction commit protocols for supporting transactional consistency. The main challenge of the resulting distributed transactional memory (DTM) is how to mask network latency allowing to commit transactions fast. Experiments with synthetic micro benchmarks and a MapReduce application on the Grid'5000 platform show that a DTM is efficiently providing strong consistency for shared data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.