Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a faulttolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.In this dissertation, we present the first transparent, system-initiated checkpointrestart solution that directly supports RDMA networks. This new approach does not depend on a specific parallel programming model, and does not require any modification to the operating system. In addition, network connections remain active during checkpointing, thus making checkpointing more efficient.Conceptually, this dissertation can be divided into three parts. First, we introduce a new, generic model for RDMA networks, by extracting the key components for checkpointing an RDMA network. These components are the essential states that need to be saved, in order to restore the network connection on restart. This model is then applied to two distinct RDMA networks: InfiniBand, and Intel OmniPath. This work demonstrates the generality of the model, and it also describes variations needed to adapt to InfiniBand or Omni-Path.Second, we demonstrate the performance of the proposed approach. Moving from a medium-sized academic computer cluster to a petascale supercomputer, we show what issues are exposed as the application scales up, and how these issues are addressed. In particular, different strategies to drain the network at checkpoint time are investigated, based on the underlying network protocol. As a result, failure-free overhead is reduced to below 1%, even at the largest scale demonstrated: 32,752 processes.Third, we show how to retrofit transparent checkpointing into the Cloud, as RDMA networks are also becoming more popular in the Cloud. A Checkpointing as a Service approach is presented, which employs checkpointing to provide fault tolerance as a service in the Cloud, and enables application migration in heterogeneous cloud environments.