Processing in memory: the Terasys massively parallel PIM array

Gokhale, Maya; Holmes, Ben; Iobst, Kent

doi:10.1109/2.375174

Cited by 311 publications

(114 citation statements)

References 6 publications

Supporting

Mentioning

113

Contrasting

Unclassified

Order By: Relevance

“…Although processing in memory (PIM) computing paradigm was proposed in 1990s to overcome the memory wall problems [35,36,37,38], it was not deemed practical in industry due to the high costs of integration. However, with the advanced 3D integration technologies available today, the PIM concept has gained renewed research interest due to the potential of practical feasibility.…”

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

Emerging Accelerator Platforms for Data Centers

Özdal

2018

IEEE Des. Test

View full text Add to dashboard Cite

Today's server architectures are designed considering the needs of a wide range of applications. For example, superscalar processors include complex control logic for out of order execution to extract instruction-level parallelism (ILP) from arbitrary programs. However, not all workloads utilize the features of a superscalar processor effectively. For example, a workload that exhibits a regular execution pattern (e.g. a dense linear algebra kernel) may not require the expensive ILP control logic for parallelism. Instead, it can be run on a throughput-oriented architecture with thousands of simple cores, such as a GPU, which can lead to much better performance and power efficiency. On the other hand, only a limited class of data-parallel applications can utilize the high throughputs provided by such architectures. As a matter of fact, existing CPU and GPU platforms may not be the most efficient choices for the compute patterns of a wide range of applications.For big data workloads, access to data is typically at least as important bottleneck as computation. The memory subsystems of today's CPU architectures are optimized for workloads that have reasonable data access locality. CPU cache hierarchies include different sizes of caches, which help capture different levels of access localities in different applications. However, if an application exhibits very little or no locality, the data access operations become inefficient for these architectures.As an example, let us consider graph applications that run on very large and unstructured datasets. Typically, the data of a vertex is computed/updated based on the data of its neighbors. In an unstructured graph, the neighbors of a vertex are stored in memory locations that may be far from each other. So, traversing the neighbors of a vertex may involve a random memory access per neighbor. If the graph is large enough so that it does not fit into the last level cache (LLC), each access to a neighbor's data may require a random DRAM access, which is typically hundreds of clock cycles. However, existing CPU architectures are not optimized for frequent random DRAM accesses. For example, each Intel Haswell Xeon core has 10 line-fill-buffers (LFBs), which means that each core can handle at most 10 L1 cache misses at a given time. However, an off-chip DRAM latency of hundreds of cycles requires hundreds of outstanding memory requests to be able to utilize the full DRAM bandwidth available in the system [1]. It was reported that 10 or more Xeon cores were needed for various graph applications to fully utilize the available DRAM bandwidth [2]. Furthermore, due to the low compute to memory-access ratios in graph applications, these cores are frequently stalled while waiting for data from off-chip memory. This leads to high power consumption by 10+ superscalar cores while not doing useful work. It was shown that custom architectures that target such communication patterns have the potential to improve power efficiency by a factor of 50x or more compared to the general-purpose CPU...

show abstract

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

Emerging Accelerator Platforms for Data Centers

Özdal

2018

IEEE Des. Test

View full text Add to dashboard Cite

show abstract

“…The DAPP, STARAN, CM-2, and GAPP computer architectures [39] used large number of processing units positioned in proximity to memory arrays to implement a massively parallel SIMD computer. Gokhale et al [21] designed TeraSys, a computer architecture comprising a conventional host processor where at least part of its memory was replaced by a PIM array, integrating memory and ALUs in close proximity. Hall et al [24] developed DIVA, the Data-Intensive Architecture, combining PIM memories with external host processors and performing selected computations in processing elements near memory and reducing the volume of data transferred across the long and slow processor-memory interface.…”

Section: Related Workmentioning

confidence: 99%

From Processing-in-Memory to Processing-in-Storage

Kaplan

Yavits

Ginosar

2017

JSFI

View full text Add to dashboard Cite

Near-data in-memory processing research has been gaining momentum in recent years. Typical processing-in-memory architecture places a single or several processing elements next to a volatile memory, enabling processing without transferring data to the host CPU. The increased bandwidth to and from volatile memory leads to performance gain. However processing-in-memory does not alleviate von Neumann bottleneck for big data problems, where datasets are too large to fit in main memory.We present a novel processing-in-storage system based on Resistive Content Addressable Memory (ReCAM). It functions simultaneously as a mass storage and as a massively parallel associative processor. ReCAM processing-in-storage resolves the bandwidth wall by keeping computation inside the storage arrays, without transferring it up the memory hierarchy.We show that ReCAM based processing-in-storage architecture may outperform existing processing-in-memory and accelerator based designs. ReCAM processing-in-storage implementation of Smith-Waterman DNA sequence alignment reaches a speedup of almost five over a GPU cluster. An implementation of in-storage inline data deduplication is presented and shown to achieve orders of magnitude higher throughput than traditional CPU and DRAM based systems.Keywords IntroductionUntil the breakdown of Dennard scaling designers focused on improving performance of a single core by increasing instruction level parallelism. In recent years, as Dennard scaling slowed down but Moore's law endured, the focus has shifted to improving parallelism by increasing the number of cores in multicore processors [16]. However, memory bandwidth does not improve at the same rate, making von Neumann bottleneck one of the main performance limiting factors.Data is typically fetched to CPU's main memory from a non-volatile storage such as hard disks or Flash SSDs. Consequently, storage bandwidth and access time pose a major constraint to performance improvement. The problem worsens in datacenter cloud environment, where datasets are distributed among multiple nodes across the datacenter. In such case, data transfer adds latency and reduces bandwidth even further, lowering the performance upper bound.This challenge has motivated renewed interest in Near-Data Processing (NDP) [7]. The main premise of NDP is shifting computing closer to data. NDP seeks to minimize data movement by computing at the most appropriate location in the memory hierarchy, which can be cache, main memory or persistent storage. With NDP, less data needs to be transferred through levels of hierarchy, thus alleviating the limited bandwidth problem. Placing computing resources at the cache level or in main memory (also known as Processing-in-Memory or PiM) does not address the emerging big data problems, where datasets are too large to fit in main memory.Resistive CAM (ReCAM), a storage device based on emerging resistive materials in the bitcell with a novel non-von Neumann Processing-in-Storage (PRinS) compute paradigm, is proposed in order to mitigate the s...

show abstract

“…3, we represent the main differences between PiM, LiM and IMC architectures. PiM [3] [4] [5] consists in putting the computation unit next to the memory while keeping the two dissociated. It is generally implemented in stand alone memories fabricated with a DRAM process.…”

Section: Related Work a In-memory Computingmentioning

confidence: 99%

Smart instruction codes for in-memory computing architectures compatible with standard SRAM interfaces

Kooli

Charles

Touzet

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

Abstract-This paper presents the computing model for InMemory Computing architecture based on SRAM memory that embeds computing abilities. This memory concept offers significant performance gains in terms of energy consumption and execution time. To handle the interaction between the memory and the CPU, new memory instruction codes were designed. These instructions are communicated by the CPU to the memory, using standard SRAM buses. This implementation allows (1) to embed In-Memory Computing capabilities on a system without Instruction Set Architecture (ISA) modification, and (2) to finely interlace CPU instructions and in-memory computing instructions.

show abstract

Processing in memory: the Terasys massively parallel PIM array

Cited by 311 publications

References 6 publications

Emerging Accelerator Platforms for Data Centers

Emerging Accelerator Platforms for Data Centers

From Processing-in-Memory to Processing-in-Storage

Smart instruction codes for in-memory computing architectures compatible with standard SRAM interfaces

Contact Info

Product

Resources

About