FPGA and GPU implementation of large scale SpMV

Shan, Yi; Wu, Tianji; Wang, Yu; Wang, Bo; Wang, Zilong; Xu, Ning; Yang, Huazhong

doi:10.1109/sasp.2010.5521144

Cited by 28 publications

(11 citation statements)

References 11 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If this is not the case zero padding is typically used to adapt the row size. Other approaches are based on statically assigning partial dot-products to multiple processing engines [23], [24]. A control unit is used to manage the communication, and ensure proper execution.…”

Section: B Spmv On Fpgamentioning

confidence: 99%

Compiled multithreaded data paths on FPGAs for dynamic workloads

Halstead¹,

Najjar²

2013

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)

View full text Add to dashboard Cite

Abstract-Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized toeach individual application. In this paper we describe the compiler generation of these hardware structures from a C subset targeting a Convey HC-2ex machine. We describe how this compilation approach differs from other C to HDL compilers. We use the compiler to generate a multithreaded sparse matrix vector multiplication kernel and compare its performance to existing FPGA, and highly optimized software implementations.

show abstract

Section: B Spmv On Fpgamentioning

confidence: 99%

Compiled multithreaded data paths on FPGAs for dynamic workloads

Halstead¹,

Najjar²

2013

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)

View full text Add to dashboard Cite

show abstract

“…A substantial body of literature has explored the optimization of sparse formats and algorithms for CPUs [6,7,1] and GPGPUs [8,9,10,11,12]. In general, these optimizations aim to minimize the irregularity of the matrix structure by selecting a format best suited for the matrix kernel.…”

Section: A Conventional Sparse Data Formatsmentioning

confidence: 99%

“…Depending on the implementation, the meta-data for CSR is either pre-loaded into the bitstream or dynamically accessed from external memory. While earlier designs were restricted to on-die memory capacities (e.g., [18]), more recent designs incorporate memory hierarchies that can handle large data sets exceeding the available onchip memories [24,25,26,11,10,27,9,28,29,30,14,23].…”

Section: Related Workmentioning

confidence: 99%

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Kestur

Davis

Chung

2012

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines

View full text Add to dashboard Cite

We present the design and implementation of a universal, single-bitstream library for accelerating matrixvector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtimeprogrammable decoder that performs on-the-fly decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.

show abstract

“…Similar activities exist on CPUs in a more generic setting [11]. Despite this, a substantial amount of work on sparse linear algebra is focused on saturating available memory bandwidth by increasing internal parallelism of an SpMV kernel, either in general purpose [4], [3], [9] or application specific [1], [2], [6] setting.…”

Section: Introductionmentioning

confidence: 99%

Dataflow acceleration of Krylov subspace sparse banded problems

Burovskiy

Girdlestone

Davies

et al. 2014

2014 24th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Abstract-Most of the efforts in the FPGA community related to sparse linear algebra focus on increasing the degree of internal parallelism in matrix-vector multiply kernels. We propose a parametrisable dataflow architecture presenting an alternative and complementary approach to support acceleration of banded sparse linear algebra problems which benefit from building a Krylov subspace. We use banded structure of a matrix A to overlap the computations Ax, A 2 x, . . . , A k x by building a pipeline of matrix-vector multiplication processing elements (PEs) each performing A i x. Due to on-chip data locality, FLOPS rate sustainable by such pipeline scales linearly with k. Our approach enables trade-off between the number k of overlapped matrix power actions and the level of parallelism in a PE. We illustrate our approach for Google PageRank computation by power iteration for large banded single precision sparse matrices. Our design scales up to 32 sequential PEs with floating point accumulation and 80 PEs with fixed point accumulation on Stratix V D8 FPGA. With 80 single-pipe fixed point PEs clocked at 160Mhz, our design sustains 12.7 GFLOPS.

show abstract

FPGA and GPU implementation of large scale SpMV

Cited by 28 publications

References 11 publications

Compiled multithreaded data paths on FPGAs for dynamic workloads

Compiled multithreaded data paths on FPGAs for dynamic workloads

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Dataflow acceleration of Krylov subspace sparse banded problems

Contact Info

Product

Resources

About