DRPS: efficient disk-resident parameter servers for distributed machine learning

Song, Zhengyi; Zhi-gang, Wang; Yu, Ge

doi:10.1007/s11704-021-0445-2

Cited by 11 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In both asynchronous and synchronous training, aggregated gradients can be shared between GPUs through the two basic data-parallel training architectures: parameter server architecture and AllReduce architecture. Parameter server architecture [14] is a centralized architecture where all GPUs communicate to a dedicated GPU for gradients aggregation and updates. Alternately, AllReduce architecture [20] is a decentralized architecture where the GPUs share parameter updates in a ring network topology manner through the Allreduce operation.…”

Section: ) Data Parallelismmentioning

confidence: 99%

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

et al. 2022

View full text Add to dashboard Cite

Recently, Deep Neural Networks (DNNs) have recorded significant success in handling medical and other complex classification tasks. However, as the sizes of DNN models and the available datasets increase, the training process becomes more complex and computationally intensive, usually taking longer to complete. In this work, we have proposed a generic full end-to-end hybrid parallelization approach combining model and data parallelism for efficiently distributed and scalable training of DNN models. We have also proposed a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for optimal distribution of partitions on the available GPUs for computing performance optimization. We have applied our proposed approach to a real use case based on 3D Residual Attention Deep Neural Network (3D-ResAttNet) for efficient Alzheimer Disease (AD) diagnosis on multiple GPUs and compared with the existing state-of-the-art parallel methods. The experimental evaluation shows that our proposed approach is 20% averagely better than existing parallel methods in terms of training time and achieves almost linear speedup with little or no differences in accuracy performance when compared with the existing non-parallel DNN models.

show abstract

Section: ) Data Parallelismmentioning

confidence: 99%

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

et al. 2022

View full text Add to dashboard Cite

show abstract

“…(iii) Distribution of the new parameters among the workers, and retraining of the DNN[71]. To aggregate and update gradients, either a centralized architecture such as parameter server architecture[72], or a decentralized architecture such as All-Reduce[73] is used.…”

mentioning

confidence: 99%

From distributed machine to distributed deep learning: a comprehensive survey

Dehghani,

Yazdanparast

2023

J Big Data

View full text Add to dashboard Cite

Artificial intelligence has made remarkable progress in handling complex tasks, thanks to advances in hardware acceleration and machine learning algorithms. However, to acquire more accurate outcomes and solve more complex issues, algorithms should be trained with more data. Processing this huge amount of data could be time-consuming and require a great deal of computation. To address these issues, distributed machine learning has been proposed, which involves distributing the data and algorithm across several machines. There has been considerable effort put into developing distributed machine learning algorithms, and different methods have been proposed so far. We divide these algorithms in classification and clustering (traditional machine learning), deep learning and deep reinforcement learning groups. Distributed deep learning has gained more attention in recent years and most of the studies have focused on this approach. Therefore, we mostly concentrate on this category. Based on the investigation of the mentioned algorithms, we highlighted the limitations that should be addressed in future research.

show abstract