In-DRAM near-data approximate acceleration for GPUs

Yazdanbakhsh, Amir; Song, Choungki; Sacks, Jacob; Lotfi-Kamran, Pejman; Esmaeilzadeh, Hadi; Kim, Nam Sung

doi:10.1145/3243176.3243188

Cited by 23 publications

(11 citation statements)

References 69 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…通用的近数据计算架构方面代表性工作有: AMD Research 的 TOP-PIM [15] , Carnegie Mellon Univeristy 的 TOM [16] , University of Wisconsin-Madison 的 DRAMA [28] 和 NDA [7] , Seoul National University 的 PEI [13] , IBM Research 的 AMC (active memory cube) [29] 和基于多核 CPU 的近数据计算系统 [30] , Stanford University 的 HRL [14] , Brown University 为近数据计算设计的并发数据结构 [31] , Georgia Institute of Technology 的 AxRAM [32] , 以及 Chinese Academy of Sciences 的 proPRAM [33] , 具体如下. 图 5 一个 NDC 系统的例子 [15] Figure 5 An example of NDC system [15] Main GPU [16] Figure 6 The architecture of TOM [16] 量的限制, 充分分析大量应用在性能和能耗方面的特征.…”

Section: 通用的近数据计算架构unclassified

Development of processing-in-memory

Mao¹,

Shu²,

Li³

et al. 2021

Sci. Sin.-Inf.

View full text Add to dashboard Cite

With the explosive increase of processed data, data transmission through the bus between CPU and the main memory has become a bottleneck in the traditional von Neumann architecture. On top of this, popular data-intensive workloads, such as neural networks and graph computing applications, have poor data locality, which results in a substantial increase of the cache miss rate. Processing such popular data-intensive workloads hinders the entire system since the data transmission causes long latency and high energy consumption. Processing-in-memory greatly reduces this data transmission by equipping the main memory with computation ability, alleviating the problems of poor performance and high energy consumption caused by a large amount of data and a poor data locality. Processing-in-memory consists of two different approaches. One method involves integrating computation resources into the main memory with high-bandwidth interconnects (i.e., near data computing). The other method consists of employing memory arrays to compute directly (i.e., computing-inmemory). These two approaches have their own advantages and disadvantages, as well as suitable scenarios. In this survey, the birth and development of processing-in-memory is firstly introduced and discussed. Its techniques, ranging from hardware to microarchitecture, are then presented. Furthermore, the challenges faced by processingin-memory are analyzed. Finally, the opportunities that processing-in-memory offers for popular applications are discussed.

show abstract

Section: 通用的近数据计算架构unclassified

Development of processing-in-memory

Mao¹,

Shu²,

Li³

et al. 2021

Sci. Sin.-Inf.

View full text Add to dashboard Cite

show abstract

“…In an NMP system with 3D memory cubes, the processing capability is in the base logic die under a stack of DRAM layers to utilize the ample internal bandwidth [5]. Later research also proposes near-bank processing with logic near memory banks in the same DRAM layer to exploit even higher bandwidth [20,21], such as FIMDRAM [22] announced recently by Samsung. Recent proposals [23,24,25,26,27] have also explored augmenting traditional DIMMs with computation in the buffer die to provide low-cost but bandwidth-limited NMP solutions.…”

Section: Near-memory Processingmentioning

confidence: 99%

Continual Learning Approach for Improving the Data and Computation Mapping in Near-Memory Processing System

Majumder,

Huang,

Kim

et al. 2021

Preprint

View full text Add to dashboard Cite

The resurgence of near-memory processing (NMP) with the advent of big data has shifted the computation paradigm from processor-centric to memory-centric computing. To meet the bandwidth and capacity demands of memory-centric computing, 3D memory has been adopted to form a scalable memory-cube network. Along with NMP and memory system development, the mapping for placing data and guiding computation in the memory-cube network has become crucial in driving the performance improvement in NMP. However, it is very challenging to design a universal optimal mapping for all applications due to unique application behavior and intractable decision space.In this paper, we propose an artificially intelligent memory mapping scheme, AIMM, that optimizes data placement and resource utilization through page and computation remapping. Our proposed technique involves continuously evaluating and learning the impact of mapping decisions on system performance for any application. AIMM uses a neural network to achieve a near-optimal mapping during execution, trained using a reinforcement learning algorithm that is known to be effective for exploring a vast design space. We also provide a detailed AIMM hardware design that can be adopted as a plugin module for various NMP systems. Our experimental evaluation shows that AIMM improves the baseline NMP performance in single and multiple program scenario by up to 70% and 50%, respectively.

show abstract

“…For workloads suffering from either the limited DRAM bandwidth or the long DRAM access latency on GPU, nearbank computing is a promising architecture to alleviate these performance bottlenecks because of both abundant bank-level memory bandwidth and reduced memory access latency. However, prior near-bank computing accelerators [3], [23], [67], [76] are domain-customized, since they have simple data paths, application-specific mapping strategies, and inefficient general purpose programming language support. The lack of programmability for these accelerators confines them to a niche application market, adding non-recurring engineering costs in manufacturing.…”

Section: Motivationmentioning

confidence: 99%

“…This solution provides a mediocre bandwidth improvement because intra-stack memory accesses are still bounded by the limited number of through-siliconvias (TSVs) between memory dies and the base logic die. To overcome this bandwidth bottleneck of TSVs, recent nearbank accelerators [3], [23], [67], [76] further move simple arithmetic units closer to the DRAM banks to harvest the abundant bank-internal bandwidth (around 10× w.r.t. processon-logic-die solution [23]).…”

Section: Introductionmentioning

confidence: 99%

“…However, this solution is limited by the available bandwidth provided by TSVs (currently 307GB/s for one stack with 1024 TSVs [68]), and scaling TSVs is very difficult due to the large area overhead (already 18.8% of each 3D layer [68]). To solve the TSV challenge, recent work pushes simple compute logic adjacent to each bank to utilize the enormous bank-level bandwidth for domain-specific acceleration [3], [23], [67], [76]. However, it is challenging to support general purpose near-bank architecture due to the large area overhead of fabricating general purpose cores in DRAM dies.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

Xie

Ding

et al. 2021

Preprint

View full text Add to dashboard Cite

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instructionmultiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging.To address these issues, we propose MPU (Memory-centric Processing Unit), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads while leveraging bank-level bandwidth, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU's hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.

show abstract

In-DRAM near-data approximate acceleration for GPUs

Cited by 23 publications

References 69 publications

Development of processing-in-memory

Development of processing-in-memory

Continual Learning Approach for Improving the Data and Computation Mapping in Near-Memory Processing System

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

Contact Info

Product

Resources

About