SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula, Christina; Fernandez, Ivan J.; Gómez-Luna, Juan; Koziris, Nectarios; Goumas, Georgios; Mutlu, Onur

doi:10.48550/arxiv.2201.05072

Cited by 3 publications

(6 citation statements)

References 123 publications

(176 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CSR instead lists the number of non-zero elements in each row and their column position. We adopt a COO scheme in our work, as has been shown in [9] to lead to greater efficiency in distributed computing. Moreover, it results in a simpler hardware implementation for controlling the execution of SpMV multiplication.…”

Section: Add Multmentioning

confidence: 99%

“…Most works in SpMV multiplication in NMC consider highperformance computing solutions. The authors of [9] and [6] propose to integrate SpMV computing units into DRAM banks on a 3D integration using Through Silicon Vias (TSV).…”

Section: B Near-memory Computingmentioning

confidence: 99%

“…Of the various tiling strategies proposed in the literature [9], fixed-row tiling is particularly appealing in our scenario. Such a mapping partitions the sparse matrix into 2D rectangular tiles, which all have the same height and a variable width (depending on the matrix sparsity).…”

Section: Data Partitioning and Mappingmentioning

confidence: 99%

“…Our work takes a similar approach but differentiates from related efforts by proposing a dedicated near-memory processing unit for SpMV, operating on a 16-bit floating-point data representation. Being specialized for SpMV multiplication, our design is much more area-efficient than solutions for generalpurpose near-memory processors [9]. Moreover, the floatingpoint capability of our architecture allows for larger dynamic ranges in data representation than fixed-point alternatives [8], which is a key requirement for GNNs.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A 16-bit Floating-Point Near-SRAM Architecture for Low-power Sparse Matrix-Vector Multiplication

Eggermann,

Rios,

Ansaloni

et al. 2023

2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)

View full text Add to dashboard Cite

State-of-the-art Artificial Intelligence (AI) algorithms, such as graph neural networks and recommendation systems, require floating-point computation of very large matrix multiplications over sparse data. Their execution in resourceconstrained scenarios, like edge AI systems, requires a) careful optimization of computing patterns, leveraging sparsity as an opportunity to lower computational requirements, and b) using dedicated hardware. In this paper, we introduce a novel near-memory floating-point computing architecture dedicated to the parallel processing of sparse matrix-vector multiplication (SpMV). This architecture can be integrated at the periphery of memory arrays to exploit the inherent parallelism of memory structures to speed up computation. In addition, it uses its proximity to memory to achieve high computational capability and very low latency. The illustrated implementation, operating at 1GHz, can compute up to 370 MFLOPS (millions of floating-point operations per second) while computing SpMV multiplications, while incurring a modest 17% area overhead when interfaced with a 4KB SRAM array.

show abstract

Section: Add Multmentioning

confidence: 99%

Section: B Near-memory Computingmentioning

confidence: 99%

Section: Data Partitioning and Mappingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A 16-bit Floating-Point Near-SRAM Architecture for Low-power Sparse Matrix-Vector Multiplication

Eggermann,

Rios,

Ansaloni

et al. 2023

2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)

View full text Add to dashboard Cite

show abstract

“…Spare linear algebra: A growing number of hardware solutions are being designed for sparse linear algebra, like Sparse-TPU [25], SpArch [58], SparseP [19], etc. Some specifically target sparsity in deep learning algebra, e.g, SNAP [57], Sticker [56], [12].…”

Section: Fpgamentioning

confidence: 99%

DPU-v2: Energy-efficient execution of irregular directed acyclic graphs

Shah

Meert

Verhelst

2022

2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)

View full text Add to dashboard Cite

A growing number of applications like probabilistic machine learning, sparse linear algebra, robotic navigation, etc., exhibit irregular data flow computation that can be modeled with directed acyclic graphs (DAGs). The irregularity arises from the seemingly random connections of nodes, which makes the DAG structure unsuitable for vectorization on CPU or GPU. Moreover, the nodes usually represent a small number of arithmetic operations that cannot amortize the overhead of launching tasks/kernels for each node, further posing challenges for parallel execution.To enable energy-efficient execution, this work proposes DAG processing unit (DPU) version 2, a specialized processor architecture optimized for irregular DAGs with static connectivity. It consists of a tree-structured datapath for efficient data reuse, a customized banked register file, and interconnects tuned to support irregular register accesses. DPU-v2 is utilized effectively through a targeted compiler that systematically maps operations to the datapath, minimizes register bank conflicts, and avoids pipeline hazards. Finally, a design space exploration identifies the optimal architecture configuration that minimizes the energy-delay product. This hardwaresoftware co-optimization approach results in a speedup of 1.4×, 3.5×, and 14× over a state-of-the-art DAG processor ASIP, a CPU, and a GPU, respectively, while also achieving a lower energy-delay product. In this way, this work takes an important step towards enabling an embedded execution of emerging DAG workloads.

show abstract

Energy Efficiency Impact of Processing in Memory: A Comprehensive Review of Workloads on the UPMEM Architecture

Falevoz,

Legriel

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Processing-in-Memory (PIM) architectures have emerged as a promising solution for data-intensive applications, providing significant speedup by processing data directly within the memory. However, the impact of PIM on energy efficiency is not well characterized. In this paper, we provide a comprehensive review of workloads ported to the first PIM product available on the market, namely the UPMEM architecture, and quantify the impact on each workload in terms of energy efficiency. Less than the half of the reviewed papers provide insights on the impact of PIM on energy efficiency, and the evaluation methods differ from one paper to the other. To provide a comprehensive overview, we propose a methodology for estimating energy consumption and efficiency for both the PIM and baseline systems at data center level, enabling a direct comparison of the two systems. Our results show that PIM can provide significant energy savings for data intensive workloads. We also identify key factors that impact the energy efficiency of UPMEM PIM, including the workload characteristics. Overall, this paper provides valuable insights for researchers and practitioners looking to optimize energy efficiency in data-intensive applications using UPMEM PIM architecture.

show abstract

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Cited by 3 publications

References 123 publications

A 16-bit Floating-Point Near-SRAM Architecture for Low-power Sparse Matrix-Vector Multiplication

A 16-bit Floating-Point Near-SRAM Architecture for Low-power Sparse Matrix-Vector Multiplication

DPU-v2: Energy-efficient execution of irregular directed acyclic graphs

Energy Efficiency Impact of Processing in Memory: A Comprehensive Review of Workloads on the UPMEM Architecture

Contact Info

Product

Resources

About