2022
DOI: 10.48550/arxiv.2202.00433
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TopoOpt: Optimizing the Network Topology for Distributed DNN Training

Abstract: We explore a novel approach for building DNN training clusters using commodity optical devices. Our proposal, called TOPOOPT, co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. TOPOOPT uses a novel alternating optimization technique and a group theory-inspired algorithm to find the best network topology and routing plan, together with parallelization strategy, for distributed DNN training. To motivate our proposal, we measure the communicati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 23 publications
0
7
0
Order By: Relevance
“…As illustration of Fig. Motivated by the optical switch based topologies in [33], authors in [70] proposed a co-optimizes network topology and a parallelization strategy for ANN training system named TOPOOPT. The proposed scheme searches over the parallelization strategy space with a fixed topology, and returns the communication demands to the system.…”
Section: Off-chip Communication For Neural Network Acceleratorsmentioning
confidence: 99%
“…As illustration of Fig. Motivated by the optical switch based topologies in [33], authors in [70] proposed a co-optimizes network topology and a parallelization strategy for ANN training system named TOPOOPT. The proposed scheme searches over the parallelization strategy space with a fixed topology, and returns the communication demands to the system.…”
Section: Off-chip Communication For Neural Network Acceleratorsmentioning
confidence: 99%
“…Recent years have seen a surge of interest in developing methods to distribute machine learning (ML) tasks across multiple devices (Ben-Nun and Hoefler, 2019;Mayer and Jacobsen, 2020). One approach has been to optimise the physical plane of the distributed cluster such as its compute and network devices and architectures (Parsonson et al, 2020;Khani et al, 2021;Wang et al, 2022;Ottino et al, 2022). In this work, we instead focus on optimising the virtual plane, which determines how physical layer resources are allocated to execute a job.…”
Section: Related Workmentioning
confidence: 99%
“…Figure 1: How the network overhead of six distributed deep learning jobs (encompassing object tracking, recommendation, natural language processing, and image recognition) increases with the number of workers used in Meta's GPU cluster (Wang et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations