BLADE: An in-Cache Computing Architecture for Edge Devices

Simon, William; Qureshi, Yasir Mahmood; Rios, Marco; Levisse, Alexandre; Zapater, Marina; Atienza, David

doi:10.1109/tc.2020.2972528

Cited by 61 publications

(36 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PrIM is opensource and publicly available at [168]. Unlike these prior works, DAMOV is applicable to and can be used to study other PIM architectures than processing-in/-near DRAM, including processing-in/-near cache [68,[93][94][95][169][170][171], processing-in/-near storage [40,[172][173][174][175][176][177][178][179][180][181], and processing-in/-near emerging NVMs [81,82,90,91,100,182,183]. This is possible since DAMOV's methodology and benchmarks are mainly concerned with broadly characterizing data movement bottlenecks in an application, independent of the underlying PIM architecture.…”

Section: Discussionmentioning

confidence: 99%

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

F.¹,

Gómez-Luna²,

Ghose³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

F.¹,

Gómez-Luna²,

Ghose³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…These signals can be further combined via a nor gate to achieve a xor operation. Finally, further processing allows complex operations such as addition and multiplication to be performed [55,56]. The operation results are then written back to the cache.…”

Section: Blade -In-cache Computingmentioning

confidence: 99%

Gem5-X

Qureshi

Simon

Zapater

et al. 2021

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this article, we present gem5-X, a system-level simulation framework, based on gem-5, for architectural exploration of heterogeneous many-core systems. To demonstrate the capabilities of gem5-X, real-time video analytics is used as a case-study. It is composed of two kernels, namely, video encoding and image classification using convolutional neural networks (CNNs). First, we explore through gem5-X the benefits of latest 3D high bandwidth memory (HBM2) in different architectural configurations. Then, using a two-step exploration methodology, we develop a new optimized clustered-heterogeneous architecture with HBM2 in gem5-X for video analytics application. In this proposed clustered-heterogeneous architecture, ARMv8 in-order cluster with in-cache computing engine executes the video encoding kernel, giving 20% performance and 54% energy benefits compared to baseline ARM in-order and Out-of-Order systems, respectively. Furthermore, thanks to gem5-X, we conclude that ARM Out-of-Order clusters with HBM2 are the best choice to run visual recognition using CNNs, as they outperform DDR4-based system by up to 30% both in terms of performance and energy savings.

show abstract

“…Many approaches to Logic-In-Memory can be found in literature; however, two main approaches can be distinguished. The first one can be classified as Near-Memory Computing (NMC) [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], since the memory inner array is not modified and logic circuits are added at the periphery of this; the second one can be denoted as Logic-in-Memory (LiM) [19][20][21][22][23][24][25][26][27][28], since the memory cell is directly modified by adding logic circuits to it.…”

Section: Introductionmentioning

confidence: 99%

“…In an NMC architecture, logic and arithmetic circuits are added on the memory array periphery, in some cases exploiting 3D structures; therefore, the distance between computational and memory circuits is shortened, resulting in power saving and latency reduction for the data exchange between these. For instance: in [3], logic and arithmetic circuits are added on the bottom of an SRAM (Static Random Access Memory) array, where the data are transferred from different memory blocks, elaborated and, then, written back to the array; in [2], a DRAM (Dynamic Random Access Memory) is modified to perform logic bitwise operations on the bitlines, and the sense amplifiers are configured as programmable logic gates. Near-Memory Computing allows to maximise the memory density, with minimal modifications to the memory array itself, which is the most critical part of memory design; this results in a limited performance improvement with respect to computing systems based on conventional memories.…”

Section: Introductionmentioning

confidence: 99%

Custom Memory Design for Logic-in-Memory: Drawbacks and Improvements over Conventional Memories

et al. 2021

View full text Add to dashboard Cite

The speed of modern digital systems is severely limited by memory latency (the “Memory Wall” problem). Data exchange between Logic and Memory is also responsible for a large part of the system energy consumption. Logic-in-Memory (LiM) represents an attractive solution to this problem. By performing part of the computations directly inside the memory the system speed can be improved while reducing its energy consumption. LiM solutions that offer the major boost in performance are based on the modification of the memory cell. However, what is the cost of such modifications? How do these impact the memory array performance? In this work, this question is addressed by analysing a LiM memory array implementing an algorithm for the maximum/minimum value computation. The memory array is designed at physical level using the FreePDK 45nm CMOS process, with three memory cell variants, and its performance is compared to SRAM and CAM memories. Results highlight that read and write operations performance is worsened but in-memory operations result to be very efficient: a 55.26% reduction in the energy-delay product is measured for the AND operation with respect to the SRAM read one. Therefore, the LiM approach represents a very promising solution for low-density and high-performance memories.

show abstract

BLADE: An in-Cache Computing Architecture for Edge Devices

Cited by 61 publications

References 50 publications

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

Gem5-X

Custom Memory Design for Logic-in-Memory: Drawbacks and Improvements over Conventional Memories

Contact Info

Product

Resources

About