Data movement aware computation partitioning

Tang, Xulong; Kislal, Orhan; Kandemir, Mahmut; Karakoy, Mustafa

doi:10.1145/3123939.3123954

Cited by 44 publications

(14 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Programmable logic: Programmable logic PEs can include general purpose processor cores such as CPUs [31,40,[70][71][72], GPUs [26,27,38,73,74], and accelerated processing units (APU) [75] that can execute complex workloads. These cores are usually trimmed down (fewer computation units, less complex cache hierarchies without L2/L3 caches, or lower operating frequencies) from their conventional counterparts due to power, area, and thermal constraints.…”

Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning

confidence: 99%

“…The most common optimization knobs in DCCs include selecting offloading workloads for memory, selecting the most suitable PE in/near memory, or the timing of executing selected offloads. To implement the policy, management techniques have employed code annotation [1,13,16,19,24,25,28,31,32,37,40,57,91,95], compiler-based code analysis [27,39,40,70,92,96], and online heuristics [27][28][29]38,71,72,74]. Table 1 classifies prominent works based on these attributes.…”

Section: Resource Management Of Data-centric Computing Systemsmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Khan

Pasricha

Kim

2020

JLPEA

View full text Add to dashboard Cite

Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers have proposed various memory architectures that enable DCC systems, such as logic layers in 3D-stacked memories or charge-sharing-based bitwise operations in dynamic random-access memory (DRAM). However, application-specific memory access patterns, power and thermal concerns, memory technology limitations, and inconsistent performance gains complicate the offloading of computation in DCC systems. Therefore, designing intelligent resource management techniques for computation offloading is vital for leveraging the potential offered by this new paradigm. In this article, we survey the major trends in managing PIM and NMP-based DCC systems and provide a review of the landscape of resource management techniques employed by system designers for such systems. Additionally, we discuss the future challenges and opportunities in DCC management.

show abstract

Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning

confidence: 99%

Section: Resource Management Of Data-centric Computing Systemsmentioning

confidence: 99%

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Khan

Pasricha

Kim

2020

JLPEA

View full text Add to dashboard Cite

show abstract

“…threadblock remapping across multiple kernel calls), threadblocks should be mapped to the module containing pages accessed by them. The module to which a page is mapped can be identified from its physical address, requiring a address translation per page or OS support as done in [51]. Using the page mapping and TAFs, appropriate threadblocks can be co-located with their data-pages.…”

Section: Threadblock Mappingmentioning

confidence: 99%

“…TAF enables deciding the optimal/near-optimal mapping just-in-time during data allocation, and is therefore applicable to systems with both unified and discrete memory. Similarly, compiler and runtime support to allow thread remapping at runtime has been proposed [10,32,51]. But like dynamic page re-mapping, re-mapping GPU threadblocks during kernel execution incurs overhead and will limit the scalability.…”

Section: Related Workmentioning

confidence: 99%

Tafe

Punniyamurthy

Gerstlauer

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

In multi-GPU and multi-chiplet GPU systems exhibiting NUMA behavior, information about addresses accessed by threads is crucial for various optimizations such as data/thread co-location and cache/scratchpad memory management. To make optimal decisions and avoid runtime overhead, knowledge about dynamic, potentially data-dependent access patterns should be available before kernel execution. Existing approaches require rewriting of applications or can only capture static, data-independent patterns. In this paper, we propose TAFE, a framework for accurate dynamic thread address footprint estimation of GPU applications. TAFE combines minimal static address pattern annotations with dynamic data dependency tracking to compute threadblock-specific address footprints prior to kernel launch. We propose a low-overhead software mechanism to track dynamic data-dependencies and provide an optional lightweight hardware extension to support transparent tracking. We evaluate TAFE on different NUMA GPU system configurations. TAFE achieves 91% estimation accuracy across a wide range of access patterns while incurring less than 3% tracking and estimation overhead. We further demonstrate benefits of using TAFE for efficient data/compute co-location. A TAFE-optimized thread/page mapping, can reduce off-chip traffic by 23% (up to 62%) while requiring only minimal, architecture-oblivious annotations from programmer. Furthermore, a TAFE-optimized system achieves on average 45% and 32% (up to 2x) higher performance compared to an unoptimized baseline and 10% and 22% over existing static, data-independent schemes across multiple system configurations. CCS Concepts • Computer systems organization → Parallel architectures.

show abstract

“…• Processing-in-Memory Architectures: On one hand, there is plenty of recent work on PIM [10,12,16,18,20,30,32,34,36,42,46,52,53,67,73,73,75,76,88,93,95] that built lightweight processors, recongurable or application-specic logics in the logic die of HMC [74] or HBM [58]. For example, Active Memory Cube [67] is a representative design with HMC.…”

Section: Related Workmentioning

confidence: 99%

Drisa

Niu

Malladi

et al. 2017

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

263

View full text Add to dashboard Cite

Data movement between the processing units and the memory in traditional von Neumann architecture is creating the "memory wall" problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the rst one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite. To address the challenge, we propose DRISA, a DRAM-based Recongurable I n-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be recongured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate signicant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8⇥ speedup and 1.2⇥ better energy eciency compared with ASICs, and 7.7⇥ speedup and 15⇥ better energy eciency over GPUs with integer operations.

show abstract

Data movement aware computation partitioning

Cited by 44 publications

References 52 publications

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Tafe

Drisa

Contact Info

Product

Resources

About