Volunteer PC hosts represent massive computation capacity at a low cost but are challenging to employ for general parallel computing. This paper presents the design, execution model, implementation, and evaluation of the Volpex framework for robust execution of parallel codes on volunteer PC grids characterized by system and network heterogeneity, varying availability, and frequent failures. The communication model is based on one-sided Put/Get calls to an abstract global shared space enhanced to support multiple autonomous instances of the same process at different stages of execution. Our approach customizes and combines the use of replication, checkpointing, and host selection. This presents formidable challenges that are addressed in this work; efficient checkpointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and application specific host selection. The integrated runtime system is shown to effectively execute moderate size, coarse-grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. Extensive evaluation is conducted with example scientific codes on a pool of around 600 volunteer hosts. The results demonstrate the trade-offs in deploying checkpointing, redundancy, and host selection, and how these methods combine to provide application performance that is close to the ideal failure free performance.
INTRODUCTIONDistributed computing is widely deployed for resource hungry applications that need massive computational power, large storage capacity, high processing speed or quick accessibility of data. Grid computing addresses this challenge with geographically distributed networked loosely coupled clusters. Popular grid computing middleware systems include Globus Toolkit 1 and UNICORE 2,3 (Uniform Interface to COmpute REsources). Ordinary PCs have been employed successfully for large scale scientific computing, most commonly using BOINC 4 as middleware. Web browsers can host middleware for distributed computing, a popular example being Weevilscout. 5The BOINC middleware uses volunteered public PCs for scientific applications when idle. It has been remarkably successful, managing as much as 5 Petaflops of aggregate compute power and supporting over 60 scientific research projects. However, this is a tiny fraction of the volunteer compute power that could be exploited. The research presented in this paper aims to enhance the state-of-the-art for parallel computing on volunteer PC grids. We will refer to PCs made available for scientific computing when idle as volunteer nodes (or hosts, or PCs) and an execution environment composed of PCs connected by a LAN or Internet as a volunteer PC grid (VPG); even if the PCs are managed hosts in an organization. Volunteer PCs represent a potentially immense but volatile resource; they are heterogeneous in terms of architecture, networking, and operating system, prone to failure, and their availability to execute guest scientific applications...