Performance Analysis and Comparison of Distributed Machine Learning Systems

Alqahtani, Salem; Demirbaş, Murat

doi:10.48550/arxiv.1909.02061

Cited by 3 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various methods for distributing Machine Learning workloads have been discussed in the literature [1] and most Machine Learning (ML) frameworks provide consistent APIs implementing multiple distribution schemes through a consistent interface. This section highlights some common distribution paradigms focusing on the techniques used to scale DeepWalk using commodity hardware (which we refer to as HUGE-CPU) and TPUs (HUGE-TPU).…”

Section: Common ML Distribution Strategiesmentioning

confidence: 99%

HUGE: Huge Unsupervised Graph Embeddings with TPUs

Brandon

Tsitsulin

Fichtenberger

et al. 2023

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Graphs are a representation of structured data that captures the relationships between sets of objects. With the ubiquity of available network data, there is increasing industrial and academic need to quickly analyze graphs with billions of nodes and trillions of edges. A common first step for network understanding is Graph Embedding, the process of creating a continuous representation of nodes in a graph. A continuous representation is often more amenable, especially at scale, for solving downstream machine learning tasks such as classification, link prediction, and clustering. A high-performance graph embedding architecture leveraging Tensor Processing Units (TPUs) with configurable amounts of high-bandwidth memory is presented that simplifies the graph embedding problem and can scale to graphs with billions of nodes and trillions of edges. We verify the embedding space quality on real and synthetic large-scale datasets.

show abstract

Section: Common ML Distribution Strategiesmentioning

confidence: 99%

HUGE: Huge Unsupervised Graph Embeddings with TPUs

Brandon

Tsitsulin

Fichtenberger

et al. 2023

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

show abstract

“…In turn, the computed gradients are passed back to the PS to be used to update the weights again. In spite of its simplicity, this architecture shows poor scalability because all workers should communicate with the PS, and thus the PS easily becomes a bottleneck when there are a large number of workers in the cluster [4], [5], [18].…”

Section: A Distributed Dnn Trainingmentioning

confidence: 99%

“…Each worker applies this aggregated gradient to its weight for the next iteration. Since communication occurs only between neighboring workers, network traffic is decentralized and, consequently, higher scalability can be obtained [5].…”

Section: A Distributed Dnn Trainingmentioning

confidence: 99%

“…A training process entity, which is assigned a GPU, is called a worker and multiple workers form a distributed training cluster. Among various methods, synchronous distributed training [3] based on ring-allreduce communication, in which the communication overhead among GPU workers is evenly distributed over all workers in the cluster [4], [5], is gaining popularity because its training performance shows good scalability in proportion to the number of GPU workers invested.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Kim

Lee

et al. 2022

IEEE Access

View full text Add to dashboard Cite

In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users' demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneityaware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average. INDEX TERMS distributed training, neural networks, dynamic scaling, heterogeneous cloud, cluster management, ring-allreduce VOLUME x, 2019

show abstract

Reliable and efficient RAR-based distributed model training in computing power network

Chen,

Li,

Natalino

et al. 2024

J. Opt. Commun. Netw.

View full text Add to dashboard Cite

The computing power network (CPN) is a novel network technology that integrates computing power from the cloud, edge, and terminals using IP/optical cross-layer networks for distributed computing. CPNs can provide an effective solution for distributed model training (DMT). As a bandwidth optimization architecture based on data parallelism, ring all-reduce (RAR) is widely used in DMT. However, any node or link failure on the ring can interrupt or block the requests deployed on the ring. Meanwhile, due to the resource competition of batch RAR-based DMT requests, inappropriate scheduling strategies will also lead to low training efficiency or congestion. As far as we know, there is currently no research that considers the survivability of rings in scheduling strategies for RAR-based DMT. To fill this gap, we propose a scheduling scheme for RAR-based DMT requests in CPNs to optimize the allocation of computing and wavelength resources considering the time dimension while ensuring reliability. In practical scenarios, service providers may focus on different performance metrics. We formulate an integer linear programming (ILP) model and a RAR-based DMT deployment algorithm (RDDA) to solve this problem considering four optimization objectives under the premise of the minimum blocking rate: minimum computing resource consumption, minimum wavelength resource consumption, minimum training time, and maximum reliability. Simulation results demonstrate that our model satisfies the reliability requirements while achieving corresponding optimal performance for DMT requests under four optimization objectives.

show abstract

Performance Analysis and Comparison of Distributed Machine Learning Systems

Cited by 3 publications

References 19 publications

HUGE: Huge Unsupervised Graph Embeddings with TPUs

HUGE: Huge Unsupervised Graph Embeddings with TPUs

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Reliable and efficient RAR-based distributed model training in computing power network

Contact Info

Product

Resources

About