Design and implementation of parallel PageRank on multicore platforms

Zhou, Shijie; Lakhotia, Kartik; Singapura, Shreyas G.; Zeng, Hanqing; Kannan, Rajgopal; Prasanna, Viktor K.; Fox, James; Kim, Eun-A; Green, Oded; Bader, David A.

doi:10.1109/hpec.2017.8091048

Cited by 18 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Leveraging sparse linear algebra for graph processing is the focus of the GraphBLAS project, which aims at defining operations on graphs through the language of linear algebra [11], and it offers early implementations for both CPU and GPU [4,22]. Highly tuned implementations of PPR exploit the graph data-layout to maximize cache usage [25], or employ multi-machine setups to process trillions of edges [26]. Green-Marl [8] and GraphIt [24] implements PPR using Domain-Specific Languages (DSLs) that abstract the intricacies of graph processing, and optimized to fully exploits the CPU hardware.…”

Section: Cpu and Gpu Implementationsmentioning

confidence: 99%

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

Parravicini¹,

Sgherzi²,

Santambrogio³

2020

Preprint

View full text Add to dashboard Cite

Sparse matrix-vector multiplication is often employed in many data-analytic workloads in which low latency and high throughput are more valuable than exact numerical convergence. FPGAs provide quick execution times while offering precise control over the accuracy of the results thanks to reduced-precision fixed-point arithmetic. In this work, we propose a novel streaming implementation of Coordinate Format (COO) sparse matrix-vector multiplication, and study its effectiveness when applied to the Personalized PageRank algorithm, a common building block of recommender systems in e-commerce websites and social networks. Our implementation achieves speedups up to 6x over a reference floating-point FPGA architecture and a state-of-the-art multi-threaded CPU implementation on 8 different data-sets, while preserving the numerical fidelity of the results and reaching up to 42x higher energy efficiency compared to the CPU implementation. CCS CONCEPTS• Theory of computation → Graph algorithms analysis; Rounding techniques; • Hardware → Hardware accelerators.

show abstract

Section: Cpu and Gpu Implementationsmentioning

confidence: 99%

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

Parravicini¹,

Sgherzi²,

Santambrogio³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…1 for our analysis. Our reasons are two fold: 1) the two key types of memory accesses which we define below are also found in other optimized algorithms and 2) several of the optimized algorithms, such as [26], [4], have been designed to reduce the number of random memory accesses, which makes it harder to stress and evaluate the memory system with this type of algorithm.…”

Section: A Experiments Setupmentioning

confidence: 99%

“…Algorithmic optimizations have been developed to improve the spatial locality of graph analytics kernels by reducing the number of cache misses [4], [26], [25], [6], but these approaches are typically application-dependent.…”

Section: Related Workmentioning

confidence: 99%

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Green

Fox

Young

et al. 2019

2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Self Cite

View full text Add to dashboard Cite

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessarily the raw bandwidth or even the latency of memory requests. Instead, we show that performance is proportional to the number of memory channels available to handle small data transfers with limited spatial locality.Using several widely used graph frameworks, including Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), we evaluate key graph analytics kernels using two unique memory hierarchies, DDR-based and HBM/MCDRAM. Our results show that the differences in the peak bandwidths of several Pascal-generation GPU memory subsystems aren't reflected in the performance of various analytics. Furthermore, our experiments on CPU and Xeon Phi systems demonstrate that the number of memory channels utilized can be a decisive factor in performance across several different applications. For CPU systems with smaller thread counts, the memory channels can be underutilized while systems with high thread counts can oversaturate the memory subsystem, which leads to limited performance. Finally, we model the potential performance improvements of adding more memory channels with narrower access widths than are found in current platforms, and we analyze performance trade-offs for the two most prominent types of memory accesses found in graph algorithms, streaming and random accesses.

show abstract

“…Binning can be used in conjunction with both Vertex-centric or Edge-centric paradigms. Zhou et al [43,44] use a custom sorted edge list with Edgecentric processing to reduce DRAM row activations and improve memory performance. However, their sorting mechanism introduces a non-trivial pre-processing cost and imposes the use of COO format.…”

Section: Related Workmentioning

confidence: 99%

Accelerating PageRank using Partition-Centric Processing

Lakhotia,

Kannan,

Prasanna

2017

Preprint

Self Cite

View full text Add to dashboard Cite

PageRank is a fundamental link analysis algorithm that also functions as a key representative of the performance of Sparse Matrix-Vector (SpMV) multiplication. The traditional PageRank implementation generates fine granularity random memory accesses resulting in large amount of wasteful DRAM traffic and poor bandwidth utilization. In this paper, we present a novel Partition-Centric Processing Methodology (PCPM) to compute PageRank, that drastically reduces the amount of DRAM communication while achieving high sustained memory bandwidth. PCPM uses a Partition-centric abstraction coupled with the Gather-Apply-Scatter (GAS) programming model. By carefully examining how a PCPM based implementation impacts communication characteristics of the algorithm, we propose several system optimizations that improve the execution time substantially. More specifically, we develop (1) a new data layout that significantly reduces communication and random DRAM accesses, and (2) branch avoidance mechanisms to get rid of unpredictable data-dependent branches.We perform detailed analytical and experimental evaluation of our approach using 6 large graphs and demonstrate an average 2.7× speedup in execution time and 1.7× reduction in communication volume, compared to the state-of-the-art. We also show that unlike other GAS based implementations, PCPM is able to further reduce main memory traffic by taking advantage of intelligent node labeling that enhances locality. Although we use PageRank as the target application in this paper, our approach can be applied to generic SpMV computation.

show abstract

Design and implementation of parallel PageRank on multicore platforms

Cited by 18 publications

References 11 publications

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Accelerating PageRank using Partition-Centric Processing

Contact Info

Product

Resources

About