Current and future challenges of DRAM metallization

Weber, Daniel; Thies, A.; Kahler, U.; Lepper, Markus; Schutz, R. J.

doi:10.1109/iitc.2005.1499974

Cited by 8 publications

(7 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Combining memory and processing components on the same chip imposes serious design challenges. For example, DRAM designs use only three metal layers [201,247], while conventional processor designs typically use more than ten [45,52,229,259]. While these challenges prevent the fabrication of fast logic transistors, UPMEM overcomes these challenges via DPU cores that are relatively deeply pipelined and fine-grained multithreaded [92,178,230,231,237] to run at several hundred megahertz.…”

Section: Introductionmentioning

confidence: 99%

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Gómez-Luna¹,

Hajj²,

Fernández³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM).Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. CCS Concepts: • Hardware → Dynamic memory; • Computing methodologies → Model development and analysis; • Computer systems organization → Architectures.

show abstract

Section: Introductionmentioning

confidence: 99%

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Gómez-Luna¹,

Hajj²,

Fernández³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, memory technology challenges prevented from its successful materialization in commercial products. For example, the limited number of metal layers in DRAM [155,156] makes conventional processor designs impractical in commodity DRAM chips [157][158][159][160].…”

Section: Introductionmentioning

confidence: 99%

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Gómez-Luna¹,

Guo²,

Brocard³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., computing systems with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck.Our goal is to understand the potential of modern generalpurpose PIM architectures to accelerate machine learning training. To do so, we (1) implement several representative classic machine learning algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our experimental evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound machine learning workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27× faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34× faster than a state-of-theart GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8× and 3.2× than state-of-the-art CPU and GPU versions, respectively.To our knowledge, our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture. We conclude this paper with several key observations, takeaways, and recommendations that can inspire users of machine learning workloads, programmers of PIM architectures, and hardware designers and architects of future memory-centric computing systems.

show abstract

“…However, the signal integrity of these devices is also critical when decreasing the dimensions of memory cells and interconnects. [1][2][3][4][5] In sub-50-nm random access memories (DRAMs) and flash memories, the word and bit-line pitches and the address and input/output (I/O) line pitches are reduced to less than 100 and 200 nm, respectively. This narrowing of the pitches leads to an increase in parasitic capacitance resulting in access-time delays for DRAMs, read/write-time delays for flash memories, and high power consumption for both devices.…”

Section: Introductionmentioning

confidence: 99%

“…Highdensity plasma chemical-vapor-deposited silicon dioxide (HDP-CVD SiO 2 ), which has been widely used for gap filling, will no longer be used because of its gap filling and k-value limitations, i.e., the maximum gap-filling aspect ratio (AR) is about 4 or less and the k value is 4.1. 3) Even though air-gap structures utilizing an insufficient gap-filling capability have been proposed, [9][10][11][12][13] developments on the viamisalignment tolerance and reliability of the process are still required. One possible solution for CVD-ILDs is the use of low-k flowable carbon-doped oxide (k ¼ 2:8{4:3), 3,[14][15][16] which has better gap-filling and planarizing capabilities than conventional CVD-ILDs.…”

Section: Introductionmentioning

confidence: 99%

Low-Shrinkage Spin-On Glass for Low Parasitic Capacitance Gap-Filling Process in Advanced Memory Devices

Ryuzaki¹,

Sakurai

Yoshikawa

et al. 2012

Jpn. J. Appl. Phys.

View full text Add to dashboard Cite

A new low-dielectric-constant spin-on glass (SOG) with a k value of 2.4 has been developed for a gap-filling process in advanced memory devices. The low-shrinkage characteristic of the SOG during thermal curing provides capabilities of gap filling and planarizing as high as those of conventional reflowable SOGs. The low-shrinkage SOG has thermal stability up to 800 C and chemical stability against diluted hydrofluoric acid, sulfuric acid-hydrogen peroxide, and amine-based solutions, which makes it possible to be used as an interlevel dielectric of memory devices. Tungsten and aluminum interconnects fabricated using the low-shrinkage SOG showed a parasitic capacitance 30% lower than those fabricated using silicon dioxide and a sufficiently long line-to-line dielectric breakdown lifetime. Taking advantage of the high chemical stability of the SOG, an all-wet damageless via-formation process using an amine-based photoresist stripper has been developed. By using the process, the lowshrinkage SOG can be applied to multilevel metallization. #

show abstract

Current and future challenges of DRAM metallization

Cited by 8 publications

References 2 publications

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Low-Shrinkage Spin-On Glass for Low Parasitic Capacitance Gap-Filling Process in Advanced Memory Devices

Contact Info

Product

Resources

About