SUMMARYA variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long-running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long-running applications. Several fault tolerance approaches were investigated; however, it was found that rollback-recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback-recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails.