Exploiting Offload Enabled Network Interfaces

Girolamo, Salvatore Di; Jolivet, Pierre; Underwood, Keith D.; Hoefler, Torsten

doi:10.1109/hoti.2015.21

Cited by 17 publications

(12 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Offloading the schedule execution to the network interface card (NIC) can provide different advantages such as asynchronous execution, lower latency, and streaming processing. Di Girolamo et al [16] show how solo collectives can be offloaded to Portals 4 [7] NICs by using triggered operations. This approach is limited by the amount of NIC resources that bounds the number of times a persistent schedule can be executed without application intervention.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Taming unbalanced training workloads in deep learning with partial collective operations

Ben-Nun

Girolamo

et al. 2020

Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27× speedup over the state-of-the-art synchronous SGD, without losing accuracy.

show abstract

Section: Discussionmentioning

confidence: 99%

“…A solo collective [16] is a wait-free operation, which forces the slow processes to execute the collective as soon as there is one process executing it. This process, called initiator, is in charge of informing the others to join the collective.…”

Section: Solo Collectivesmentioning

confidence: 99%

Taming unbalanced training workloads in deep learning with partial collective operations

Ben-Nun

Girolamo

et al. 2020

Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, this network interface emulates limited processing capabilities. 60 A general solution was provided by Voltaire 61 which included processing support in the router for collectives; this work differs from ours in that the offload is to an in-router CPU rather than a hardware augmentation of the switch.…”

Section: Related Workmentioning

confidence: 99%

Reconfigurable switches for high performance and flexible MPI collectives

Haghi

Guo

Xiong

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

There has been much effort in offloading MPI collective operations into hardware.But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited.A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.

show abstract

“…In this scenario, even a single delayed process a ects the job's training time. In contrast to the synchronous mode, in MPI, there is a wait-free operation, which is called partial collective communication [16]. It forces the slow processes to execute the collective communication as soon as there is one process executing it.…”

Section: Randomized Partial Collectivesmentioning

confidence: 99%