IEEE INFOCOM 2022 - IEEE Conference on Computer Communications 2022
DOI: 10.1109/infocom48880.2022.9796820
|View full text |Cite
|
Sign up to set email alerts
|

Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 13 publications
0
0
0
Order By: Relevance
“…Communication acceleration: Existing communication acceleration techniques include, but are not limited to: (1) leveraging high throughput and low latency communication links, such as RDMA [34], [35], [36], InfiniBand, Intel Omni-Path, and NVIDIA's NVLink 3 ; (2) utilizing message passing interface (MPI) and MPI-like implementations like OpenMPI 4 and Gloo [37]; (3) using high-performance communication collectives, such as NCCL 5 and BLink [38], which support efficient communication between GPUs and many popular deep learning frameworks; (4) reducing data communication during synchronization process, such as gradient quantization, compression and sparsification [39], [40], [41], [42], [43], [44]; (5) using stale parameter updates to reduce the number of synchronization parameters, such as parameter freezing [45], [46], [47], Round-Robin Synchronous Parallel [48] and Bounded Staleness Parallel [49]; (6) tuning deep learning hyper-parameters, such as AutoByte [50]; (7) minimize user-level overhead by conducting parameter aggregation at the transport layer [13]; (8) improving network-layer performance, such as networklevel flow scheduling [51], [52] and congestion control [53]. Communication scheduling: Due to the layer-wise and tensor-wise structure of DNNs, some works continuously explore to maximize the overlap of communication and computation.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Communication acceleration: Existing communication acceleration techniques include, but are not limited to: (1) leveraging high throughput and low latency communication links, such as RDMA [34], [35], [36], InfiniBand, Intel Omni-Path, and NVIDIA's NVLink 3 ; (2) utilizing message passing interface (MPI) and MPI-like implementations like OpenMPI 4 and Gloo [37]; (3) using high-performance communication collectives, such as NCCL 5 and BLink [38], which support efficient communication between GPUs and many popular deep learning frameworks; (4) reducing data communication during synchronization process, such as gradient quantization, compression and sparsification [39], [40], [41], [42], [43], [44]; (5) using stale parameter updates to reduce the number of synchronization parameters, such as parameter freezing [45], [46], [47], Round-Robin Synchronous Parallel [48] and Bounded Staleness Parallel [49]; (6) tuning deep learning hyper-parameters, such as AutoByte [50]; (7) minimize user-level overhead by conducting parameter aggregation at the transport layer [13]; (8) improving network-layer performance, such as networklevel flow scheduling [51], [52] and congestion control [53]. Communication scheduling: Due to the layer-wise and tensor-wise structure of DNNs, some works continuously explore to maximize the overlap of communication and computation.…”
Section: Related Workmentioning
confidence: 99%
“…There are various factors that limit the scaling efficiency of data parallelism, such as energy consumption [7], [8], dataset privacy [9], [10], [11], and imbalance of computing resources [12]. In addition, with large-scale datasets, models, and clusters, intensive data communication due to gradient aggregation among nodes introduces a huge additional time overhead for data parallelism and limits its scalability [4], [5], [13]. Consequently, communication performance has become a significant bottleneck in distributed DNN training.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation