The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

Alistarh, Dan; De, Christopher; Konstantinov, Nikola

doi:10.1145/3212734.3212763

Cited by 52 publications

(69 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, if both threads sample the same θ and additionally sample minibatches that are 100% biased towards the same label, the gradients derived by both threads will be maximally similar. 6 We show evidence to support this claim in the evaluation. Specifically, we show that as minibatch bias increases, less attack threads are required to move the model out of its converged state.…”

Section: Challenge: Crafting Constructive Gradient Updatesmentioning

confidence: 57%

“…In the A-SGD literature, the closest work to that presented in our paper is [6]. That work discusses how an adversary can slow convergence by influencing scheduling.…”

Section: Related Workmentioning

confidence: 97%

“…For example, the focused attack in Section 4.2 relies on a specific minibatch, but applies that minibatch to up-to-date model parameters. Thus, that attack implicitly pauses the attack thread in between Lines 9 and 11 6. Note that while gradient directions will be similar, they will not be identical because even with the same θ , the exact minibatch contents will be different.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Game of Threads

Vicarte

Schreiber

Paccagnella

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

View full text Add to dashboard Cite

As data sizes continue to grow at an unprecedented rate, machine learning training is being forced to adopt asynchronous algorithms to maintain performance and scalability. In asynchronous training, many threads share and update model parameters in a racy fashion to avoid costly interthread synchronization.This paper studies the security implications of these codes by introducing asynchronous poisoning attacks. Our attack influences training outcome-e.g., degrades model accuracy or biases the model towards an adversary-specified labelpurely by scheduling asynchronous training threads in a malicious fashion. Since thread scheduling is outside the protections of modern trusted execution environments (TEEs), e.g., Intel SGX, our attack bypasses these protections even when the training set can be verified as correct. To the best of our knowledge, this represents the first example where a class of applications loses integrity guarantees, despite being protected by enclave-based TEEs such as SGX.We demonstrate both accuracy degradation and model biasing attacks on the CIFAR-10 image recognition task, trained on Resnet-style DNNs using an asynchronous training code published by Pytorch. We also perform proof-ofconcept experiments to validate our assumptions on an SGXenabled machine. Our accuracy degradation attacks are capable of returning a converged model to pre-trained accuracy or to some accuracy in between. Our model biasing attack can force the model to predict an adversary-specified label

show abstract

Section: Challenge: Crafting Constructive Gradient Updatesmentioning

confidence: 57%

“…In the A-SGD literature, the closest work to that presented in our paper is [6]. That work discusses how an adversary can slow convergence by influencing scheduling.…”

Section: Related Workmentioning

confidence: 97%

mentioning

confidence: 99%

See 1 more Smart Citation

Game of Threads

Vicarte

Schreiber

Paccagnella

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

View full text Add to dashboard Cite

show abstract

“…The results in Theorem 6 and Corollary 3 are related to the results presented in [10] and [4]. The main differences are that in our analysis we tighten the bound with a factor (2 − θ) −1 , expand the allowed step size interval, as well as relax the maximum staleness assumption and reduce the magnitude of the bound from linear in the maximum staleness O(τ ) to the expected O(τ ).…”

Section: Convex Convergence Analysismentioning

confidence: 58%

“…Properties of Async-PSGD with sparse or componentwise updates have since been rigourously studied in recent literature due to the performance benefits of lockfreedom [28][24] [10]. The gradient sparsity assumption was relaxed in the recent work [4] which magnified the convergence time bound in the order of magnitude ∼ √ d, d being the problem dimensionality. Delayed optimization in completely asynchronous first-order optimization algorithms was analyzed initially in [2], where Agarwal et al introduce step sizes which diminish over the progression of SGD, depending on the maximum staleness allowed in the system, but not adaptive to the actual delays observed.…”

Section: Related Workmentioning

confidence: 99%

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Bäckström

Papatriantafilou

Tsigas

2019

2019 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there have been significant works on understanding the parallelism inherent to SGD, and its convergence properties. Asynchronous, parallel SGD (AsyncPSGD) has received particular attention, due to observed performance benefits. On the other hand, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence, stemming from the fact that the contribution of a thread might be based on an old (stale) view of the state. In this work we aim to deepen the understanding of AsyncPSGD in order to increase the statistical efficiency in the presence of stale gradients. We propose new models for capturing the nature of the staleness distribution in a practical setting. Using the proposed models, we derive a staleness-adaptive SGD framework, MindTheStep-AsyncPSGD, for adapting the step size in an online-fashion, which provably reduces the negative impact of asynchrony. Moreover, we provide general convergence time bounds for a wide class of staleness-adaptive step size strategies for convex target functions. We also provide a detailed empirical study, showing how our approach implies faster convergence for deep learning applications.978-1-7281-0858-2/19/$31.00 c 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

show abstract

Asynchronous Parallel Computing

Yan¹

2021

Wiley StatsRef: Statistics Reference Online

View full text Add to dashboard Cite

Many single‐core statistical algorithms do not scale well to learn the large‐scale data that we encounter. With the emergence of multiple core central processing units (CPUs), graphics processing units (GPUs), tensor processing units (TPUs), and supercomputers, parallel algorithms are developed to take advantage of multiple cores. However, parallel computing brings new challenges and bottlenecks. One major bottleneck is the synchronization among all the cores. Synchronization requires the cores that finish early to wait for other cores, and the waiting time could be very long when the number of cores is large. Asynchronous parallel computing lets the cores continue without waiting for others. However, the naive way of merely removing the synchronization step does not work in many cases. Adapting one algorithm to the asynchronous parallel environment is not trivial and usually requires careful reformulation. Moreover, the convergence of these asynchronous algorithms is challenging to show. In this article, we introduce several ways to implement asynchronous statistical algorithms. More specifically, we introduce asynchronous parallel coordinate algorithms that update the coordinates in parallel and asynchronous stochastic algorithms that evaluate the data samples in parallel.

show abstract

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

Cited by 52 publications

References 24 publications

Game of Threads

Game of Threads

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Asynchronous Parallel Computing

Contact Info

Product

Resources

About