Election Coding for Distributed Learning: Protecting SignSGD against Byzantine Attacks

Sohn, Jy-yong; Han, Dong-Jun; Choi, Beongjun; Moon, Jaekyun

doi:10.48550/arxiv.1910.06093

Cited by 1 publication

(1 citation statement)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Convergence analysis on 1-bit SGD is given in (Bernstein et al, 2018a;Karimireddy et al, 2019;Safaryan & Richtárik, 2021). Bernstein et al (2018b); Sohn et al (2019); Le Phong & Phuong (2020); Lyu (2021) investigate the robustness of 1-bit SGD. Perhaps the closest works to this paper are (Tang et al, 2021;Li et al, 2021), which propose using two-stage training to enable 1-bit Adam and 1-bit Lamb, respectively.…”

Section: Related Workmentioning

confidence: 99%

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Lu¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. Its benefits, however, remain an open question on Adam-based model training (e.g. BERT and GPT). In this paper, we propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel designs: (1) adaptive variance state freezing, which eliminates the requirement of running expensive full-precision communication at early stage of training; (2) 1-bit sync, which allows skipping communication rounds with bit-free synchronization over Adam's optimizer states, momentum and variance. In theory, we provide convergence analysis for 0/1 Adam on smooth non-convex objectives, and show the complexity bound is better than original Adam under certain conditions. On various benchmarks such as BERT-Base/Large pretraining and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 90% of data volume, 54% of communication rounds, and achieve up to 2× higher throughput compared to the state-of-the-art 1-bit Adam while enjoying the same statistical convergence speed and end-to-end model accuracy on GLUE dataset and ImageNet validation set.

show abstract