A Reconfigurable Compute-in-the-Network FPGA Assistant for High-Level Collective Support with Distributed Matrix Multiply Case Study

Haghi, Pouya; Guo, Anqi; Geng, Tong; Broaddus, Justin T.; Schafer, Derek; Skjellum, Anthony; Herbordt, Martin C.

doi:10.1109/icfpt51103.2020.00030

Cited by 6 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MPI offers various primitives; among them collectives are integral part of MPI and they are frequently invoked in a spectrum of HPC applications [2]. Offloading MPI collectives to network devices (NICs and switches) is gaining much interest as an effective mechanism to improve the application performance [3]- [10]. More specifically, in-network processing unlocks higher application performance by reducing interand intra-node communication and bypassing MPI software layers.…”

Section: Introduction Message Passing Interface (Mpi)mentioning

confidence: 99%

“…More specifically, in-network processing unlocks higher application performance by reducing interand intra-node communication and bypassing MPI software layers. As new classes of devices including programmable NICs/switches [11], [12], Data Processing Units (DPUs) [13], and accelerators (FPGAs, GPUs) [14]- [16] are emerging in the datacenters [17], [18], we posit that there is an unrevealed opportunity to further improve the performance by extending in-network collective processing to a new class of complex collectives.…”

Section: Introduction Message Passing Interface (Mpi)mentioning

confidence: 99%

See 1 more Smart Citation

Reconfigurable switches for high performance and flexible MPI collectives

Haghi

Guo

Xiong

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

There has been much effort in offloading MPI collective operations into hardware.But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited.A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.

show abstract

Section: Introduction Message Passing Interface (Mpi)mentioning

confidence: 99%