This paper describes a multi-functional deep in-memory processor for inference applications. Deep inmemory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to 5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (≤1%) accuracy degradation in all four applications as compared to the conventional architecture.
2Emerging inference applications require processing of huge data volumes [1]. A conventional inference architecture ( Fig. 1) implements memory access, data transfer from memory to processor, data aggregation, and slicing. In such architectures, memory access energy dominates, e.g., an 8-b SRAM read access and an 8-b MAC consumes 5pJ and 1pJ in 65nm CMOS, respectively. Additionally, the memoryprocessor interface presents a severe throughput bottleneck. Deep in-memory signal processing concept was proposed in [2] to overcome these challenges by embedding mixed-signal processing in the periphery of the SRAM bit-cell array (BCA). However, an IC implementation needs to address a host of new challenges including the stringent row & column pitch-matching requirements imposed by the BCA without altering its storage density or its read/write functionality, and enabling multiple functions with mixed signal circuitry. Recently [3], a single function, 5×1-b in-memory classifier IC has been demonstrated.The proposed deep in-memory inference architecture has four stages ( Fig.1): 1) multi-row functional read (MR-FR), 2) bit-line (BL) processing (BLP), 3) cross BL processing (CBLP), and 4) ADC and slicing. The MR-FR accesses multiple rows in one pre-charge cycle using pulse-width modulated word-line (PWM-WL) signals to generate a BL voltage drop proportional to a weighted sum of multiple bits stored in multiple rows in the column, and also performs word-level add/subtract. The BLP implements reconfigurable column pitch-matched mixed-signal circuits to execute computations such as multiply/absolute value/comparison on the BL voltages, in a massively column-parallel fashion. The CBLP aggregates the BLP outputs into a scalar which is sliced to obtain the final decision. The BLP and CBLP can be reconfigured to operate the architecture in either a dot product (DP) mode or Manhattan distance (MD) mode. Reconfigurable stages enable multiple functions (Fig. 1 table) including normal read/write. The chip architecture (Fig. 2) includes a digital controller (CTRL) and a CORE. The normal Technology Die size CTRL operating freq. SRAM capacity Bitcell dimension Supply voltage SVM 963.1 Matched filter 481.5 KNN 33.6K Template matching 33.6K SVM 1.7M Matched filter 3.4M KNN 54.3K Template matching 54.3K Energy per decision (pJ) Decision Throughput (Decisions/s) 65 nm CMOS 1.2 mm × 1.2 mm 1 GHz 16 KB (1 bank of 512 × 256-b) CORE: 1.0 V, CTRL: 0.85 V 2.11 × 0.92 um 2
In this paper, an energy efficient, memory-intensive, and high throughput VLSI architecture is proposed for convolutional networks (C-Net) by employing compute memory (CM) [1], where computation is deeply embedded into the memory (SRAM). Behavioral models incorporating CM's circuit non-idealities and energy models in 45 nm SOI CMOS are presented. System-level simulations using these models demonstrate that the probability of handwritten digit recognition P r > 0.99 can be achieved using the MNIST database [2], along with a 24.5× reduced energy delay product, a 5.0× reduced energy, and a 4.9× higher throughput as compared to the conventional system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.