TopoOpt: Optimizing the Network Topology for Distributed DNN Training

Wang, Weiyang; Khazraee, Moein; Zhong, Zhizhen; Ghobadi, Manya; Jia, Zhihao; Mudigere, Dheevatsa; Zhang, Ying; Kewitsch, Anthony

doi:10.48550/arxiv.2202.00433

Cited by 4 publications

(7 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As illustration of Fig. Motivated by the optical switch based topologies in [33], authors in [70] proposed a co-optimizes network topology and a parallelization strategy for ANN training system named TOPOOPT. The proposed scheme searches over the parallelization strategy space with a fixed topology, and returns the communication demands to the system.…”

Section: Off-chip Communication For Neural Network Acceleratorsmentioning

confidence: 99%

Efficient neural network accelerators with optical computing and communication

et al. 2023

View full text Add to dashboard Cite

Conventional electronic Artificial Neural Networks (ANNs) accelerators focus on architecture design and numerical computation optimization to improve the training efficiency. However, these approaches have recently encountered bottlenecks in terms of energy efficiency and computing performance, which leads to an increase interest in photonic accelerator. Photonic architectures with low energy consumption, high transmission speed and high bandwidth have been considered as an important role for generation of computing architectures. In this paper, to provide a better understanding of optical technology used in ANN acceleration, we present a comprehensive review for the efficient photonic computing and communication in ANN accelerators. The related photonic devices are investigated in terms of the application in ANNs acceleration, and a classification of existing solutions is proposed that are categorized into optical computing acceleration and optical communication acceleration according to photonic effects and photonic architectures. Moreover, we discuss the challenges for these photonic neural network acceleration approaches to highlight the most promising future research opportunities in this field.

show abstract

Section: Off-chip Communication For Neural Network Acceleratorsmentioning

confidence: 99%

Efficient neural network accelerators with optical computing and communication

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Recent years have seen a surge of interest in developing methods to distribute machine learning (ML) tasks across multiple devices (Ben-Nun and Hoefler, 2019;Mayer and Jacobsen, 2020). One approach has been to optimise the physical plane of the distributed cluster such as its compute and network devices and architectures (Parsonson et al, 2020;Khani et al, 2021;Wang et al, 2022;Ottino et al, 2022). In this work, we instead focus on optimising the virtual plane, which determines how physical layer resources are allocated to execute a job.…”

Section: Related Workmentioning

confidence: 99%

“…Figure 1: How the network overhead of six distributed deep learning jobs (encompassing object tracking, recommendation, natural language processing, and image recognition) increases with the number of workers used in Meta's GPU cluster (Wang et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

“…of workers used to execute a job is increased, the per-worker computation demands decrease, but the overall communication overhead between workers grows (see Figure 1). This shifts the performance bottleneck away from the workers themselves and into the network connecting them, and brings additional challenges with managing varying traffic characteristics for different job types and parallelisation strategies (Wang et al, 2022;Parsonson et al, 2022a;Benjamin et al, 2021Benjamin et al, , 2022.…”

Section: Introductionmentioning

confidence: 99%

“…To address the communication bottleneck in distributed computing, recent works have sought to develop optical clusters (Benjamin et al, 2020;Ballani et al, 2020;Khani et al, 2021;Wang et al, 2022;Ottino et al, 2022); machines interconnected by optical switches (Parsonson et al, 2020;Gerard et al, 2020Gerard et al, , 2021. Compared to their electronic counterparts, optically switched networks offer orders of magnitude improvements in scalability, bandwidth, latency, and power consumption (Ballani et al, 2020;Zervas et al, 2018;Mishra et al, 2021) (see Section 3).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks

Parsonson¹,

Shabka²,

Ottino³

et al. 2023

Preprint

View full text Add to dashboard Cite

From natural language processing to genome sequencing, large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and instead must be distributed across multiple devices. This has motivated the research of new compute and network systems capable of handling such tasks. In particular, recent work has focused on developing management schemes which decide how to allocate distributed resources such that some overall objective, such as minimising the job completion time (JCT), is optimised. However, such studies omit explicit consideration of how much a job should be distributed, usually assuming that maximum distribution is desirable. In this work, we show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate. To address this, we propose PAC-ML (partitioning for asynchronous computing with machine learning). PAC-ML leverages a graph neural network and reinforcement learning to learn how much to partition computation graphs such that the number of jobs which meet arbitrary user-defined JCT requirements is maximised. In experiments with five real deep learning computation graphs on a recently proposed optical architecture across four user-defined JCT requirement distributions, we demonstrate PAC-ML achieving up to 56.2% lower blocking rates in dynamic job arrival settings than the canonical maximum parallelisation strategy used by most prior works.

show abstract

Photonic Thought

Raikov

2024

SpringerBriefs in Applied Sciences and Technology

View full text Add to dashboard Cite

TopoOpt: Optimizing the Network Topology for Distributed DNN Training

Cited by 4 publications

References 23 publications

Efficient neural network accelerators with optical computing and communication

Efficient neural network accelerators with optical computing and communication

Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks

Photonic Thought

Contact Info

Product

Resources

About