Abstract-Computer grids have attracted great attention of both academic and enterprise communities, becoming an attractive alternative for the execution of applications that demand huge computational power, allowing the integration of computational resources spread through different administrative domains. The dynamic nature of the grid infrastructure, its high scalability, and great heterogeneity exacerbates the likelihood of errors occurrence, imposing fault tolerance as a major requirement for grid middlewares. This paper describes a flexible fault-tolerance mechanism implemented on Integrade grid middleware that allows the customization of several failure handling parameters and the combination of different failure handling techniques. This paper also presents several experiments that measure the benefits of our approach, considering several different execution environments scenarios.
I. INTRODUCTIONA computer grid comprises a hardware and software infrastructure that allows integration and sharing of distributed resources, such as software, data and peripherals, inside and among institutions. This computational infrastructure has attracted great attention of academic and enterprise communities, becoming an attractive alternative for execution of applications that demand huge computational power, and allowing the integration of computational resources spread through different administrative domains.Computational grids have been used to solve problems in varied areas of scientific, enterprise, and industrial activities, such as: computational biology, image processing for medical diagnosis, weather forecast, high energy physics, marketing simulations, and oil prospection. Grid computing has empower the conception of a new generation of applications that allow combining computations, experiments, observations, and data got in real time. The phenomena modeled by these applications require diverse software components whose compositions and interactions are extremely dynamic. Moreover, the grid infrastructure is also heterogeneous and dynamic, aggregating a great amount of computation and communication resources, databases and, sometimes, sensors and specific peripherals. The dynamism can be observed in terms of high variation in resource availability, node instability, and workload variations in nodes and network links.The dynamic nature of the grid infrastructure, its high scalability, and great heterogeneity has turn impracticable its