A case for exploiting subarray-level parallelism (SALP) in DRAM

Kim, Yoongu; Seshadri, Vivek; Lee, Donghyuk; Liu, Jamie; Mutlu, Onur

doi:10.1145/2366231.2337202

Cited by 148 publications

(224 citation statements)

References 42 publications

Supporting

Mentioning

223

Contrasting

Order By: Relevance

“…We use PinPoints [58] to obtain the representative phases of each application. Our simulation executes at least 200 million instructions on each core [9,16,35,38]. Performance Metric.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…Several types of commodity DRAM (Micron's RLDRAM [52] and Fujitsu's FCRAM [62]) provide low latency at the cost of high area overhead [35,38]. Many prior works (e.g., [8,9,17,35,38,45,56,65,66,70,84]) propose various architectural changes within DRAM chips to reduce latency. In contrast, FLY-DRAM does not require any changes to a DRAM chip.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Latency Variation in Modern DRAM Chips

Chang

Kashyap

Hassan

et al. 2016

Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science

Self Cite

117

View full text Add to dashboard Cite

Long DRAM latency is a critical performance bottleneck in current systems. DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is significant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation.The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make several new observations about latency variation within DRAM. We find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced.Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system. We conclude that the experimen-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. tal characterization and analysis of latency variation within modern DRAM, provided by this work, can lead to new techniques that improve DRAM and system performance.

show abstract

“…We use PinPoints [58] to obtain the representative phases of each application. Our simulation executes at least 200 million instructions on each core [9,16,35,38]. Performance Metric.…”

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Understanding Latency Variation in Modern DRAM Chips

Chang

Kashyap

Hassan

et al. 2016

Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science

Self Cite

117

View full text Add to dashboard Cite

show abstract

“…These techniques can be applied to the custom DRAM technology in SILO to increase vault capacities without compromising the access latency. Other techniques target reducing DRAM latency by overlapping accesses to different subarrays [60] and improving row-buffer locality by exploiting access patterns [61]- [64]. While these techniques allow overlapping access latencies of different requests, they do not reduce the actual access latency.…”

Section: Dram Latency Optimizationmentioning

confidence: 99%

Farewell My Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers

Shahab

Zhu

Margaritov

et al. 2018

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

View full text Add to dashboard Cite

The slowdown in technology scaling mandates rethinking of conventional CPU architectures in a quest for higher performance and new capabilities. This work takes a step in this direction by questioning the value of on-chip shared lastlevel caches (LLCs) in server processors and argues for a better alternative. Shared LLCs have a number of limitations, including on-chip area constraints that limit storage capacity, long planar interconnect spans that increase access latency, and contention for the shared cache capacity that hurts performance under workload colocation. To overcome these limitations, we propose a Die-Stacked Private LLC Organization (SILO), which combines conventional on-chip private L1 (and optionally, L2) caches with a per-core private LLC in die-stacked DRAM. By stacking LLC slices directly above each core, SILO avoids long planar wire spans. The use of private caches inherently avoids inter-core cache contention. Last but not the least, engineering the DRAM for latency affords low access delays while still providing over 100MB of capacity per core in today's technology. Evaluation results show that SILO outperforms state-of-the-art conventional cache architectures on a range of scale-out and traditional workloads while delivering strong performance isolation under colocation.

show abstract

“…The subarray controller consists of address latches, local decoders, and counters. The address latches are essential for multisubarray activation [54]. The counters are used for continuously updating addresses to local decoders for the bulk-style µ-operations.…”

Section: Microarchitecture For Controllersmentioning

confidence: 99%

“…To achieve this, each subarray and bank has their independent controllers with latches. Previous work [54] shows such modication incurs ignorable area overhead. The detailed controller design is shown in Section 4.3.…”

Section: Optimizing Bank Reorganizationmentioning

confidence: 99%

Drisa

Niu

Malladi

et al. 2017

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

262

View full text Add to dashboard Cite

Data movement between the processing units and the memory in traditional von Neumann architecture is creating the "memory wall" problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the rst one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite. To address the challenge, we propose DRISA, a DRAM-based Recongurable I n-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be recongured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate signicant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8⇥ speedup and 1.2⇥ better energy eciency compared with ASICs, and 7.7⇥ speedup and 15⇥ better energy eciency over GPUs with integer operations.

show abstract

A case for exploiting subarray-level parallelism (SALP) in DRAM

Cited by 148 publications

References 42 publications

Understanding Latency Variation in Modern DRAM Chips

Understanding Latency Variation in Modern DRAM Chips

Farewell My Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers

Drisa

Contact Info

Product

Resources

About