Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

Pearce, Roger; Gokhale, Maya; Amato, Nancy M.

doi:10.1109/sc.2010.34

Cited by 123 publications

(82 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other distributed memory implementations include the threaded 1D approach using active messages of Edmonds et al [25], and the partitioned global address space (PGAS) implementation of Cong et al [26]. Pierce et al [27] investigate BFS implementations, among other graph algorithms, on semi-external memory.…”

Section: Related Workmentioning

confidence: 99%

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

Beamer

Buluç

Asanović

et al. 2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

Abstract-Breadth-first search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional top-down approach always takes as much time as the worst case. A recently discovered bottom-up approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottom-up approach is not always advantageous, so it is combined with the top-down approach to make the direction-optimizing algorithm which adaptively switches from top-down to bottom-up as the frontier expands. We present a scalable distributed-memory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely top-down code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7× faster than a conventional top-down algorithm using the same set of optimizations and data distribution.

show abstract

Section: Related Workmentioning

confidence: 99%

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

Beamer

Buluç

Asanović

et al. 2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

show abstract

“…In both scenarios, we compare the performance of standard Linux mmap of a file with DI-MMAP. 1 For the two experiments, we use the following input sets: first a synthetic metagenome derived from a human gut sample (HC1) and second, three real-world collections of metagenomic samples.…”

Section: Metagenomic Classificationmentioning

confidence: 99%

“…In this work, we target a data-intensive node architecture with direct I/O-bus-attached Non-Volatile RAM, such as attached Flash arrays today, and STT-RAM, PCM, or memristor in the future. These persistent memory technologies provide new opportunities for extending the memory hierarchy by supporting highly concurrent read and write operations that can be exploited by throughput driven (latency tolerant) algorithms such as parallel graph traversal [1].…”

Section: Introductionmentioning

confidence: 99%

DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications

Essen

Hsieh

Ames

et al. 2012

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Abstract-We present DI-MMAP, a high-performance runtime that memory-maps large external data sets into an application's address space and shows significantly better performance than the Linux mmap system call. Our implementation is particularly effective when used with high performance locally attached Flash arrays on highly concurrent, latencytolerant data-intensive HPC applications. We describe the kernel module and show performance results on a benchmark test suite and on a new bioinformatics metagenomic classification application. For the complex metagenomics classification application, DI-MMAP performs up to 4.88× better than standard Linux mmap.

show abstract

“…Ajwani and Meyer [2,3] discuss the state-of-the-art algorithms for BFS and related graph traversal problems, and present performance results on large-scale graphs from several families. Recent work by Pierce et al [29] investigates implementations of semi-external BFS, shortest paths, and connected components.…”

Section: Parallel Bfs: Prior Workmentioning

confidence: 99%

Parallel Breadth-First Search on Distributed Memory Systems

Buluç¹,

Madduri²

2011

View full text Add to dashboard Cite

Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a levelsynchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrixpartitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD MagnyCours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

show abstract

Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory

Cited by 123 publications

References 25 publications

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications

Parallel Breadth-First Search on Distributed Memory Systems

Contact Info

Product

Resources

About