GNNerator: A Hardware/Software Framework for Accelerating Graph Neural Networks

Stevens, Jacob R.; Das, Dipankar; Avancha, Sasikanth; Kaul, Bharat; Raghunathan, Anand

doi:10.48550/arxiv.2103.10836

Cited by 1 publication

(1 citation statement)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Narrow shard based data reuse. To further reduce DRAM access, we explore data reuse through sharding strategies [21], [43], [66], [67]. In Figure 7, DIMM-0 computes partial reduction of destination vertexes v 1 -v 8 .…”

Section: Sourcementioning

confidence: 99%

GCNear: A Hybrid Architecture for Efficient GCN Training with Near-Memory Processing

Zhu¹,

Liu²,

Wei³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, Graph Convolutional Networks (GCNs) have become state-of-the-art algorithms for analyzing noneuclidean graph data. However, it is challenging to realize efficient GCN training, especially on large graphs. The reasons are many-folded: 1) GCN training incurs a substantial memory footprint. Full-batch training on large graphs even requires hundreds to thousands of gigabytes of memory to buffer the intermediate data for back-propagation. 2) GCN training involves both memory-intensive data reduction and computation-intensive features/gradients update operations. Such a heterogeneous nature challenges current CPU/GPU platforms. 3) The irregularity of graphs and the complex training dataflow jointly increase the difficulty of improving a GCN training system's efficiency.This paper presents GCNear, a hybrid architecture to tackle these challenges. Specifically, GCNear adopts a DIMM-based memory system to provide easy-to-scale memory capacity. To match the heterogeneous nature, we categorize GCN training operations as memory-intensive Reduce and computation-intensive Update operations. We then offload Reduce operations to on-DIMM NMEs, making full use of the high aggregated local bandwidth. We adopt a CAE with sufficient computation capacity to process Update operations. We further propose several optimization strategies to deal with the irregularity of GCN tasks and improve GCNear's performance. Comprehensive evaluations on twelve GCN training tasks demonstrate that GCNear achieves 24.8× / 2.2× geomean speedup and 61.9× / 6.4× (geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and NVIDIA V100 GPU. To deal with deep GCN models and the ever-increasing graph scale, we also propose a Multi-GCNear system. Compared to state-of-the-art Roc and DistGNN systems, Multi-GCNear achieves up to 2.1× and 3.1× higher training speed, respectively.

show abstract