Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Nagasaka, Yusuke; Nukada, Akira; Kojima, Ryosuke; Matsuoka, Satoshi

doi:10.1109/ccgrid.2019.00037

Cited by 6 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, the partitioning of large graphs performed via the Kernighan-Lin algorithm to make partitions denser and minimize the transfers between partitions, which harm performance. Second, scheduling of partitions to the GPU is optimized by batching together small sparse partitions that can be computed together [172], and also profiling transfer and computation times in first GNN layer to later pipeline different chunks perfectly. Third, NeuGraph also eliminates redundant computation by fusing multiple edges together.…”

Section: Gcn Gin Sgcmentioning

confidence: 99%

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

et al. 2021

View full text Add to dashboard Cite

Graph Neural Networks (GNNs) have exploded onto the machine learning scene in recent years owing to their capability to model and learn from graph-structured data. Such an ability has strong implications in a wide variety of fields whose data are inherently relational, for which conventional neural networks do not perform well. Indeed, as recent reviews can attest, research in the area of GNNs has grown rapidly and has lead to the development of a variety of GNN algorithm variants as well as to the exploration of ground-breaking applications in chemistry, neurology, electronics, or communication networks, among others. At the current stage research, however, the efficient processing of GNNs is still an open challenge for several reasons. Besides of their novelty, GNNs are hard to compute due to their dependence on the input graph, their combination of dense and very sparse operations, or the need to scale to huge graphs in some applications. In this context, this article aims to make two main contributions. On the one hand, a review of the field of GNNs is presented from the perspective of computing. This includes a brief tutorial on the GNN fundamentals, an overview of the evolution of the field in the last decade, and a summary of operations carried out in the multiple phases of different GNN algorithm variants. On the other hand, an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.

show abstract

Section: Gcn Gin Sgcmentioning

confidence: 99%

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

et al. 2021

View full text Add to dashboard Cite

show abstract

“…For e.g., Alg. 1 [34] shows a SpMM example with an input sparse matrix A in COO format, and the other matrix B in dense format. The algorithm iterates over the nonzero elements in the matrix A (nnz), and then multiplies with the corresponding elements in the matrix B.…”

Section: B Acf Performance Analysismentioning

confidence: 99%

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

Qin

Jeong

Won

et al. 2021

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ∼4× speedup over performing format conversions in software.

show abstract

“…Apart from the optimization on convolution algorithm, other works focus on the optimization of neural network processing with the consideration of hardware architecture [25], [26]. Some works discuss factors of memory layout, and optimizes register/memory efficiency and tiling strategies of GEMM-based convolution [19], [27], [28], pooling and softmax layers [29], [30], [31]. In [32], the authors also use kernel fusion to eliminate data transfer with off-chip memory, but their fusion is cross-layers, which is different from our technique that apply kernel fusion within a single layer.…”

Section: Related Workmentioning

confidence: 99%

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

Jia

Liang

et al. 2020

IEEE Trans. Comput.

View full text Add to dashboard Cite

Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed into special domains then perform element-wise multiplication, which can be transformed into batched GEMM operation. Different stages of computation contain multiple tasks with different computation and memory behaviors, and they share intermediate data, which provides the opportunity to fuse these tasks into a monolithic kernel. But traditional kernel fusion suffers from the problem of insufficient shared memory, which limits the performance. In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel. GPU thread blocks are assigned with different computation tasks and we design a mapping algorithm to assign tasks to thread blocks. We build a scheduler which fetches and executes the tasks following the dependency relationship. Evaluation of modern CNNs shows that our techniques achieve an average of 1.25X and 1.7X speedup compared to cuDNN's two implementations on Winograd convolution algorithm.

show abstract

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Cited by 6 publications

References 20 publications

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

Contact Info

Product

Resources

About