ButterFly BFS -- An Efficient Communication Pattern for Multi Node Traversals

Green, Oded

doi:10.48550/arxiv.2103.13577

Cited by 2 publications

(2 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note however that standard GPU systems that rely on PCIe interconnects rather than switched NVLINK are far less competitive. A recent preprint [33] claimed a performance of more than 300 GT EP S for a large Kronecker Graph on 16 V100 GPUs in a DGX-2 system, but at the time of this writing, the code was not publicly available.…”

Section: Discussionmentioning

confidence: 99%

iPUG for Multiple Graphcore IPUs: Optimizing Performance and Scalability of Parallel Breadth-First Search

Burchard

Cai

Langguth

2021

2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

Parallel graph algorithms have become one of the principal applications of high-performance computing besides numerical simulations and machine learning workloads. However, due to their highly unstructured nature, graph algorithms remain extremely challenging for most parallel systems, with large gaps between observed performance and theoretical limits. Furthermore, most mainstream architectures rely heavily on single instruction multiple data (SIMD) processing for high floatingpoint rates, which is not beneficial for graph processing which instead requires high memory bandwidth, low memory latency, and efficient processing of unstructured data.On the other hand, we are currently observing an explosion of new hardware architectures, many of which are adapted to specific purposes and diverge from traditional designs. A notable example is the Graphcore Intelligence Processing Unit (IPU), which is developed to meet the needs of upcoming machine intelligence applications.Its design eschews the traditional cache hierarchy, relying on SRAM as its main memory instead. The result is an extremely high-bandwidth, low-latency memory at the cost of capacity. In addition, the IPU consists of a large number of independent cores, allowing for true multiple instruction multiple data (MIMD) processing. Together, these features suggest that such a processor is well suited for graph processing.We test the limits of graph processing on multiple IPUs by implementing a low-level, high-performance code for breadth-first search (BFS), following the specifications of Graph500, the most widely used benchmark for parallel graph processing. Despite the simplicity of the BFS algorithm, implementing efficient parallel codes for it has proven to be a challenging task in the past. We show that our implementation scales well on a system with 8 IPUs and attains roughly twice the performance of an equal number of NVIDIA V100 GPUs using state-of-the-art CUDA code.

show abstract

Section: Discussionmentioning

confidence: 99%

iPUG for Multiple Graphcore IPUs: Optimizing Performance and Scalability of Parallel Breadth-First Search

Burchard

Cai

Langguth

2021

2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

show abstract

“…Therefore, algorithms must find a static partitioning and communication scheme, or do redistribution using e.g. butterfly communication patterns [27]. Many-to-many sequence alignments can be modeled as a graph problem and benefit from sharing sequences between different partitions without additional host-to-device transfers [28].…”

Section: G Ipu-specific Challengesmentioning

confidence: 99%

iPuma: High-Performance Sequence Alignment on the Graphcore IPU

Zhao,

Burchard,

Schroeder

et al. 2024

ISC High Performance 2024 Research Paper Proceedings (39th International Conference)

View full text Add to dashboard Cite

String alignment algorithms are an essential tool for understanding DNA and protein sequences. They demand substantial computation in real-world applications, and are thus a prime target for hardware acceleration.However, GPUs struggle to provide sufficient acceleration. Meanwhile, the recent MIMD-capable AI accelerators such as the Graphcore Intelligence Processing Unit (IPU) have become technologically viable. In this paper we present iPuma, a new implementation of Smith-Waterman sequence alignment for the IPU, which offers generalized short and medium length, one-to-one, and many-to-many high-throughput alignments for both DNA and protein sequences. iPuma is integrated into two bioinformatics pipelines, MetaHipMer2 and PASTIS.On protein datasets, iPuma shows speedups of 2.7× and 1.6× over state-of-the-art GPU and CPU implementations, respectively. We test the scalability on up to 64 IPUs, attaining a peak scoring performance of 1763 GCUPS for protein and 1168 GCUPS for DNA sequences.

show abstract

ButterFly BFS -- An Efficient Communication Pattern for Multi Node Traversals

Cited by 2 publications

References 36 publications

iPUG for Multiple Graphcore IPUs: Optimizing Performance and Scalability of Parallel Breadth-First Search

iPUG for Multiple Graphcore IPUs: Optimizing Performance and Scalability of Parallel Breadth-First Search

iPuma: High-Performance Sequence Alignment on the Graphcore IPU

Contact Info

Product

Resources

About