In-Memory Data Parallel Processor

FujikiDaichi,; MahlkeScott,; DasReetuparna,

doi:10.1145/3296957.3173171

Cited by 44 publications

(16 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Fixed Point Dot Product (FiPDP) 13 . FiPDP is a classical dot-product 𝑆 = 𝑁 𝑖=1 𝐴 𝑖 × 𝐵 𝑖 , where two vectors are multiplied element-wise, and the result vector is summed.…”

Section: Analysis Of Real-life Examples Using the Bitlet Modelmentioning

confidence: 99%

“…Using a configuration of 𝑋 𝐵𝑠 = 4096 and 𝑅 = 1024 increases the PIM Pure (and combined PIM+CPU) throughput to about 100 GOPS, which is higher than the CPU Pure throughput of 31 GOPS stated above. 13 https://en.wikipedia.org/wiki/Dot_product…”

Section: Analysis Of Real-life Examples Using the Bitlet Modelmentioning

confidence: 99%

“…Processing vast amounts of data on traditional von Neumann architectures involves many data transfers between the central processing unit (CPU) and the memory. These transfers degrade performance and consume energy [10,13,30,32,35,36]. Enabled by emerging memory technologies, recent memristive processing-in-memory (PIM) 1 solutions show great potential in reducing costly data transfers by performing computations using individual memory cells [8,24,27,33,43].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems

Ronen¹,

Eliahu²,

Leitersdorf³

et al. 2021

Preprint

View full text Add to dashboard Cite

Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing. The tool uncovers interesting tradeoffs between, mainly, the PIM computation complexity (cycles required to perform a computation through PIM), the amount of memory used for PIM, the system memory bandwidth, and the data transfer size. Despite its simplicity, the model reveals new insights when applied to real-life examples. The model is demonstrated for several synthetic examples and then applied to explore the influence of different parameters on two systems -IMAGING and FloatPIM. Based on the demonstrations, insights about PIM and its combination with CPU are concluded. CCS Concepts: • Hardware → Emerging architectures; • Computing methodologies → Model development and analysis.

show abstract

Section: Analysis Of Real-life Examples Using the Bitlet Modelmentioning

confidence: 99%

Section: Analysis Of Real-life Examples Using the Bitlet Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems

Ronen¹,

Eliahu²,

Leitersdorf³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…FlexFlow is another dataflow model dealing with parallel types mismatch between the computation and CNN workloads [31]. These works attempt to make the best advantages of computation parallelism, data reuse, and flexibility [5,16,34].…”

Section: Dataflowmentioning

confidence: 99%

Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

Zhou¹,

Yangshuo²,

Xiao³

et al. 2021

Preprint

View full text Add to dashboard Cite

The ever-increasing computation complexity of fast-growing Deep Neural Networks (DNNs) has requested new computing paradigms to overcome the memory wall in conventional Von Neumann computing architectures. The emerging Computing-In-Memory (CIM) architecture has been a promising candidate to accelerate neural network computing. However, the data movement between CIM arrays may still dominate the total power consumption in conventional designs. This paper proposes a flexible CIM processor architecture named Domino to enable stream computing and local data access to significantly reduce the data movement energy. Meanwhile, Domino employs tailored distributed instruction scheduling within Network-on-Chip (NoC) to implement inter-memory-computing and attain mapping flexibility. The evaluation with prevailing CNN models shows that Domino achieves 1.15-to-9.49× power efficiency over several stateof-the-art CIM accelerators and improves the throughput by 1.57-to-12.96×.

show abstract

“…Processing-in-Memory (PIM) is a promising paradigm for accelerating memory-bandwidth-bound workloads, which have low arithmetic intensity [34,[48][49][50][51][52][53][54][55][56][57][58]. The key idea of the PIM paradigm is to move computation close to (i.e., processing-near-memory) or even into the memory devices (i.e., processing-using-memory) where the data resides (i.e., caches [48,[59][60][61][62][63][64][65], DRAM [33,34,[49][50][51][52][53][54][55][56][57][58], stor-age [109][110][111][112][113][114][115][116][117]), eliminating the need to move the data to the processor and resulting in higher performance and lower energy consumption. Stencil computations are a prime candidate for acceleration using the PIM paradigm.…”

Section: Introductionmentioning

confidence: 99%

Casper: Accelerating Stencil Computations Using Near-Cache Processing

Denzler¹,

Oliveira²,

Hajinazar³

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Stencil computations are commonly used in a wide variety of scientific applications, ranging from largescale weather prediction to solving partial differential equations. Stencil computations are characterized by three properties: (1) low arithmetic intensity, (2) limited temporal data reuse, and (3) regular and predictable data access pattern. As a result, stencil computations are typically bandwidth-bound workloads, which only experience limited benefits from the deep cache hierarchy of modern CPUs. In this work, we propose Casper, a near-cache accelerator consisting of specialized stencil computation units connected to the last-level cache (LLC) of a traditional CPU. Casper is based on two key ideas: (1) avoiding the cost of moving rarely reused data throughout the cache hierarchy, and (2) exploiting the regularity of the data accesses and the inherent parallelism of stencil computations to increase overall performance. With minimal changes in LLC address decoding logic and data placement, Casper performs stencil computations at the peak LLC bandwidth. We show that by tightly coupling lightweight stencil computation units near LLC, Casper improves performance of stencil kernels by 1.65× on average (up to 4.16×) compared to a commercial high-performance multi-core processor, while reducing system energy consumption by 35% on average (up to 65%). Casper provides 37× (up to 190×) improvement in performance-per-area compared to a state-of-the-art GPU.

show abstract

In-Memory Data Parallel Processor

Cited by 44 publications

References 37 publications

The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems

The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems

Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

Casper: Accelerating Stencil Computations Using Near-Cache Processing

Contact Info

Product

Resources

About