Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce -an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3× speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5× speedup when training ALBERT-large from scratch using preemptible compute nodes.Running large-scale distributed training in these circumstances requires fault-and latency-tolerant algorithms (Lian et al., 2017;Assran et al., 2019b). Most of these algorithms replace all-reduce averaging with gossip: each participant periodically downloads the latest parameters from his neigh-