Scaling Distributed Machine Learning with In-Network Aggregation

Sapio, Amedeo; Canini, Marco; Ho, Chen-Yu; Nelson, Jacob L.; Kalnis, Panos; Kim, Changhoon; Moshref, Masoud; Ports, Dan R. K.; Richtárik, Peter

doi:10.48550/arxiv.1903.06701

Cited by 16 publications

(23 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, due to the locality of this observation, the collected information can not be leveraged for strategic decisions by itself. This information needs to be collected and somehow integrated in more comprehensive analysis in order to make strategic decisions based on a global view of the acquired information [8], [11].…”

Section: Discussion and Insightsmentioning

confidence: 99%

“…These headers enable differentiated processing and information gathering within FDs, or applying custom hash functions in order to distinguish specifics sets of packets [7]. On the other hand, data aggregation helps to reduce the amount of data that needs to be transmitted in the network from FDs to Control Plane, for the execution of complex operations there [8].…”

Section: Machine Learning At the Programmable Data Plane For Improved...mentioning

confidence: 99%

See 1 more Smart Citation

Watching Smartly from the Bottom: Intrusion Detection revamped through Programmable Networks and Artificial Intelligence

Gutiérrez¹,

Branch²,

Gaspary³

et al. 2021

Preprint

View full text Add to dashboard Cite

A recent research line has explored the possibility of leveraging functionalities of Programmable Data Planes to offload part of Machine Learning algorithms to the data plane, which might contribute to increase their accuracy and responsiveness by having a more detailed visibility of the traffic. This approach introduces a significant opportunity for evolution in the critical field of Intrusion Detection. In this paper, we discuss how Programmable Data Planes might complement different stages of an Intrusion Detection System based on Machine Learning. We present two use cases that make evident the feasibility of this approach and highlight aspects that must be considered when addressing the non straightforward task of deploying solutions leveraging data-plane functionalities.

show abstract

Section: Discussion and Insightsmentioning

confidence: 99%

Section: Machine Learning At the Programmable Data Plane For Improved...mentioning

confidence: 99%

Watching Smartly from the Bottom: Intrusion Detection revamped through Programmable Networks and Artificial Intelligence

Gutiérrez¹,

Branch²,

Gaspary³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Besides gradient compression, there are other application-layer and systemlayer optimizations. For example, ByteScheduler [4] orders the gradient transmission of different layers to better overlap with forward computation; and SwitchML [6] uses a programmable switch to aggregate gradients and reduce the communication size. These proposals all suggest significant reduction on the training time.…”

Section: Discussion and Future Workmentioning

confidence: 99%

“…In response to this, there has been a surge of research from machine learning and systems communities on improving the communication efficiency of distributed training in recent years [4][5][6][7][8][9][10][11][12][13][14][15][16]. These works are primarily done at the application layer, assuming that the network has done its best to maximize communication efficiency.…”

Section: Introductionmentioning

confidence: 99%

Is Network the Bottleneck of Distributed Training?

Zhang

Chang

Lin

et al. 2020

Proceedings of the Workshop on Network Meets AI &Amp; ML

View full text Add to dashboard Cite

Recently there has been a surge of research on improving the communication efficiency of distributed training. However, little work has been done to systematically understand whether the network is the bottleneck and to what extent.In this paper, we take a first-principles approach to measure and analyze the network performance of distributed training. As expected, our measurement confirms that communication is the component that blocks distributed training from linear scale-out. However, contrary to the common belief, we find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one. Moreover, while many recent proposals on gradient compression advocate over 100× compression ratio, we show that under full network utilization, there is no need for gradient compression in 100 Gbps network. On the other hand, a lower speed network like 10 Gbps requires only 2×-5× gradients compression ratio to achieve almost linear scale-out. Compared to application-level techniques like gradient compression, network-level optimizations do not require changes to applications and do not hurt the performance of trained models. As such, we advocate that the real challenge of distributed training is for the network community to develop high-performance network transport to fully utilize the network capacity and achieve linear scale-out.

show abstract

“…However, performing gradient compression to reduce the communicated data size is not free. Some recent works (Xu et al, 2020;Sapio et al, 2019;Li et al, 2018b;Gupta et al, 2020) noticed that gradient compression harms the scalability of distributed training in some cases and suggested that these compression techniques are only beneficial for training over slow networks (Lim et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Wang,

Wu,

2021

Preprint

View full text Add to dashboard Cite

Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training.In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83× without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.

show abstract

Scaling Distributed Machine Learning with In-Network Aggregation

Cited by 16 publications

References 0 publications

Watching Smartly from the Bottom: Intrusion Detection revamped through Programmable Networks and Artificial Intelligence

Watching Smartly from the Bottom: Intrusion Detection revamped through Programmable Networks and Artificial Intelligence

Is Network the Bottleneck of Distributed Training?

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Contact Info

Product

Resources

About