“…In the deep learning regime, decentralize SGD, which was established in [30] to achieve the same linear speedup as parallel SGD in convergence rate, has attracted a lot of attentions. Many efforts have been made to extend the algorithm to directed topologies [3,42], time-varying topologies [25,42], asynchronous settings [31], and data-heterogeneous scenarios [57,62,32,67]. Techniques such as quantization/compression [2,8,26,24,58,36], periodic updates [55,25,64], and lazy communication [37,38,13] were also integrated into decentralized SGD to further reduce communiation overheads.…”