<i>Neural Cache</i>: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Eckert, Charles A.; Wang, Xiaowei; Wang, Jingcheng; Subramaniyan, Arun; Sylvester, Dennis; Blaauw, David; Das, Reetuparna; Iyer, Ravi

doi:10.1109/mm.2019.2908101

Cited by 15 publications

(8 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…to 4 OoO cores. The area overhead of the L1 in-cache computing unit is 0.5% of the core area using the estimates by Eckert et al (2018) and Wikichip (2016), which is not significant. Hence, it will not be considered for the rest of the paper.…”

Section: Architectural Explorationmentioning

confidence: 78%

“…In-cache computing allows massive Single Instruction Multiple Data (SIMD)-like operations to be performed in the cache hierarchy as proposed by Jeloka et al (2016). In our work, we use an in-cache computing architecture similar to BLADE proposed by , targeted for the L1 cache of ARM-based many-core systems, as opposed to the Last Level Cache (LLC), as in NeuralCache proposed by Eckert et al (2018). Regarding HBM proposed by Lee et al (2014), emerging memory architectures have been explored, but mainly for GPUs, as discussed in Chatterjee et al (2017).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

2019

High Performance Computing (HPC 2019)

View full text Add to dashboard Cite

The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively.

show abstract

Section: Architectural Explorationmentioning

confidence: 78%

Section: Related Workmentioning

confidence: 99%

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

2019

High Performance Computing (HPC 2019)

View full text Add to dashboard Cite

show abstract

“…A fair comparison to [3] is, however, difficult as it considers a complete system-PPAC would need to be integrated into a system for a fair comparison. We note, however, that if the method in [3] is used to compute MVPs, an element-wise multiplication between two vectors whose entries are L-bit requires L 2 + 5L − 2 clock cycles [4], which is a total of 34 clock cycles for 4-bit numbers. Then, the reduction (via sum) of an N -dimensional vector with L-bits per entry requires O(L log 2 (N )) clock cycles, which is at least 64 clock cycles for a 256-dimensional vector with 8-bit entries (as the product of two 4-bit numbers results in 8-bit).…”

Section: B Comparison With Existing Acceleratorsmentioning

confidence: 99%

“…Hence, an inner product between two 4-bit vectors with 256 entries requires at least 98 clock cycles-PPAC requires only 16 clock cycles for the same operation. This significant difference in the number of clock cycles is caused by the fact that the design in [4] is geared towards data-centric applications in which element-wise operations are performed between high-dimensional vectors to increase parallelism. PPAC aims at accelerating a wide range of MVP-like operations, which is why we included dedicated hardware (such as the row pop-count) to speed up element-wise vector multiplication and vector sum-reduction.…”

Section: B Comparison With Existing Acceleratorsmentioning

confidence: 99%

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

Castañeda

Bobbett

Gallyas-Sanhueza

et al. 2019

2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either generalpurpose but only support atomistic operations, or are specialized to accelerate a single task. We propose the Parallel Processor in Associative Content-addressable memory (PPAC), a novel in-memory accelerator that supports a range of matrix-vectorproduct (MVP)-like operations that find use in traditional and emerging applications. PPAC is, for example, able to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The fully-digital nature of PPAC enables its implementation with standard-cellbased CMOS, which facilitates automated design and portability among technology nodes. To demonstrate the efficacy of PPAC, we provide post-layout implementation results in 28nm CMOS for different array sizes. A comparison with recent digital and mixed-signal PIM accelerators reveals that PPAC is competitive in terms of throughput and energy-efficiency, while accelerating a wide range of applications and simplifying development.

show abstract

“…Near data processing is applicable not only in traditional memory, such as SRAM [4,5] and DRAM [2,[6][7][8][9][10], but also in emerging memory, such as PCM [11], STT-MRAM [12], and ReRAM [13]. There are also various attempts to reduce the data movement overhead by computation offloading to storage devices [3,[14][15][16].…”

Section: Introductionmentioning

confidence: 99%

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Kim

Lee

et al. 2021

Electronics

View full text Add to dashboard Cite

Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU. ISC is confirmed to significantly contribute to energy reduction.

show abstract

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Cited by 15 publications

References 31 publications

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Contact Info

Product

Resources

About