Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling

Mukkara, Anurag; Beckmann, Nathan; Abeydeera, Maleen; Ma, Xiaosong; Sánchez, Daniel

doi:10.1109/micro.2018.00010

Cited by 108 publications

(45 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Specifically, difference in speed-ups for DBG and Gorder is very small for datasets kr, tw, wl and mp. These datasets have relatively small clustering coefficient compared to other datasets [37], which makes it difficult for Gorder to approximate suitable vertex ordering. On other datasets, Gorder provides significantly higher speed-ups than any skewaware techniques.…”

Section: A Performance Excluding Reordering Timementioning

confidence: 99%

A Closer Look at Lightweight Graph Reordering

Faldu

Diamond

Grot

2019

2019 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected (hot) vertices inherently exhibit high reuse. However, their sparse distribution in memory leads to a severe underutilization of onchip cache capacity. Prior works have proposed lightweight skewaware vertex reordering that places hot vertices adjacent to each other in memory, reducing the cache footprint of hot vertices and thus improving cache efficiency. However, in doing so, they may inadvertently destroy the inherent community structure within the graph, which may negate the performance gains achieved from the reduced footprint of hot vertices.In this work, we study existing reordering techniques and demonstrate the inherent tension between reducing the cache footprint of hot vertices and preserving original graph structure. We quantify the potential performance loss due to disruption in graph structure for different graph datasets. We further show that reordering techniques that employ fine-grain reordering significantly increase misses in the higher level caches, even when they reduce misses in the last level cache.To overcome the limitations of existing reordering techniques, we propose Degree-Based Grouping (DBG), a novel lightweight reordering technique that employs a coarse-grain reordering to largely preserve graph structure while reducing the cache footprint of hot vertices. Our evaluation on 40 combinations of various graph applications and datasets shows that, compared to a baseline with no reordering, DBG yields an average application speed-up of 16.8% vs 11.6% for the best-performing existing lightweight technique.

show abstract

Section: A Performance Excluding Reordering Timementioning

confidence: 99%

A Closer Look at Lightweight Graph Reordering

Faldu

Diamond

Grot

2019

2019 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

show abstract

“…Though locality is often present in these workloads [8], standard techniques to reduce data movement struggle. Irregular prefetchers [44,47,96] can hide data access latency, but they do not reduce overall data movement [62]. Moreover, irregular workloads are poorly suited to common accelerator designs [18,65].…”

Section: Data Movement Is a Growing Problemmentioning

confidence: 99%

“…Beyond irregular computations, we believe that Memory Services can accelerate a wide range of tasks, such as background systems (e.g., garbage collection [60], data dedup [86]), cache optimization (e.g., sophisticated cache organizations [77,80,81], specialized prefetchers [6,98,99]), as well as other functionality that is prohibitively expensive in software today (e.g., work scheduling [62], fine-grain memoization [28,102]). We leave these to future work.…”

Section: Introductionmentioning

confidence: 99%

Livia

Lockerman

Feldmann

Bakhshalipour

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

Self Cite

View full text Add to dashboard Cite

In order to scale, future systems will need to dramatically reduce data movement. Data movement is expensive in current designs because (i) traditional memory hierarchies force computation to happen unnecessarily far away from data and (ii) processing-in-memory approaches fail to exploit locality. We propose Memory Services, a flexible programming model that enables data-centric computing throughout the memory hierarchy. In Memory Services, applications express functionality as graphs of simple tasks, each task indicating the data it operates on. We design and evaluate Livia, a new system architecture for Memory Services that dynamically schedules tasks and data at the location in the memory hierarchy that minimizes overall data movement. Livia adds less than 3% area overhead to a tiled multicore and accelerates challenging irregular workloads by 1.3× to 2.4× while reducing dynamic energy by 1.2× to 4.7×. CCS Concepts • Computer systems organization → Processors and memory architectures.

show abstract

“…Traversal scheduling: Mukkara et al proposed HATS [6], a hardware-accelerator implementing locality-aware scheduling to exploit cache locality for graphs exhibiting community structure. While effective, it requires intrusive hardware changes, including a specialized hardware unit with each core and an ISA change on the host core.…”

Section: Related Workmentioning

confidence: 99%

Domain-Specialized Cache Management for Graph Analytics

Faldu

Diamond

Grot

2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected, hot, vertices inherently exhibit high reuse. However, this work finds that state-of-the-art hardware cache management schemes struggle in capitalizing on their reuse due to highly irregular access patterns of graph analytics.In response, we propose GRASP, domain-specialized cache management at the last-level cache for graph analytics. GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. GRASP keeps hardware cost negligible by leveraging lightweight software support to pinpoint hot vertices, thus eliding the need for storage-intensive prediction mechanisms employed by state-of-the-art cache management schemes. On a set of diverse graph-analytic applications with large high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the best-performing prior scheme. GRASP remains robust on low-/no-skew datasets, whereas prior schemes consistently cause a slowdown.

show abstract

Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling

Cited by 108 publications

References 49 publications

A Closer Look at Lightweight Graph Reordering

A Closer Look at Lightweight Graph Reordering

Livia

Domain-Specialized Cache Management for Graph Analytics

Contact Info

Product

Resources

About