Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Tsai, Po-An; Chen, Changping; Sánchez, Daniel

doi:10.1109/micro.2018.00058

Cited by 23 publications

(41 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Programmable logic: Programmable logic PEs can include general purpose processor cores such as CPUs [31,40,[70][71][72], GPUs [26,27,38,73,74], and accelerated processing units (APU) [75] that can execute complex workloads. These cores are usually trimmed down (fewer computation units, less complex cache hierarchies without L2/L3 caches, or lower operating frequencies) from their conventional counterparts due to power, area, and thermal constraints.…”

Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning

confidence: 99%

“…For both PIM and NMP systems, it is important to determine what computation will be sent (offloaded) to the memory PE. Offloading can be performed at different granularities, e.g., instructions (including small groups of instructions) [1,13,16,19,24,25,28,32,37,39,40,42,57,91,92], threads [71], Nvidia's CUDA blocks/warps [27,29], kernels [26], and applications [38,41,73,74]. Instruction-level offloading is often used with a fixed-function accelerator and PIM systems [1,13,16,19,24,25,28,29,32,37,39,42,57,92].…”

Section: Data Offloading Granularitymentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Khan

Pasricha

Kim

2020

JLPEA

View full text Add to dashboard Cite

Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers have proposed various memory architectures that enable DCC systems, such as logic layers in 3D-stacked memories or charge-sharing-based bitwise operations in dynamic random-access memory (DRAM). However, application-specific memory access patterns, power and thermal concerns, memory technology limitations, and inconsistent performance gains complicate the offloading of computation in DCC systems. Therefore, designing intelligent resource management techniques for computation offloading is vital for leveraging the potential offered by this new paradigm. In this article, we survey the major trends in managing PIM and NMP-based DCC systems and provide a review of the landscape of resource management techniques employed by system designers for such systems. Additionally, we discuss the future challenges and opportunities in DCC management.

show abstract

Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning

confidence: 99%

Section: Data Offloading Granularitymentioning

confidence: 99%

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Khan

Pasricha

Kim

2020

JLPEA

View full text Add to dashboard Cite

show abstract

“…Many prior works [21,40,42,44,52] use compiler annotations or hardware profiling to dynamically move pages in NUMA systems. Re-mapping pages at kernel execution time becomes infeasible as the system size scales.…”

Section: Related Workmentioning

confidence: 99%

“…There are numerous prior works aimed at addressing accessdependent bottlenecks, including access-pattern-aware prefetching [30], cache management [31] and page allocation [21,40,42,52]. However, all these solutions require knowledge about data/thread locality, which in turn depends on potentially data-dependent application access patterns that traditionally have to be learnt or predicted during kernel execution.…”

Section: Introductionmentioning

confidence: 99%

Tafe

Punniyamurthy

Gerstlauer

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

In multi-GPU and multi-chiplet GPU systems exhibiting NUMA behavior, information about addresses accessed by threads is crucial for various optimizations such as data/thread co-location and cache/scratchpad memory management. To make optimal decisions and avoid runtime overhead, knowledge about dynamic, potentially data-dependent access patterns should be available before kernel execution. Existing approaches require rewriting of applications or can only capture static, data-independent patterns. In this paper, we propose TAFE, a framework for accurate dynamic thread address footprint estimation of GPU applications. TAFE combines minimal static address pattern annotations with dynamic data dependency tracking to compute threadblock-specific address footprints prior to kernel launch. We propose a low-overhead software mechanism to track dynamic data-dependencies and provide an optional lightweight hardware extension to support transparent tracking. We evaluate TAFE on different NUMA GPU system configurations. TAFE achieves 91% estimation accuracy across a wide range of access patterns while incurring less than 3% tracking and estimation overhead. We further demonstrate benefits of using TAFE for efficient data/compute co-location. A TAFE-optimized thread/page mapping, can reduce off-chip traffic by 23% (up to 62%) while requiring only minimal, architecture-oblivious annotations from programmer. Furthermore, a TAFE-optimized system achieves on average 45% and 32% (up to 2x) higher performance compared to an unoptimized baseline and 10% and 22% over existing static, data-independent schemes across multiple system configurations. CCS Concepts • Computer systems organization → Parallel architectures.

show abstract

“…However, eschewing a cache hierarchy makes PIM far less efficient on applications with significant locality and complicates several other issues, such as synchronization and coherence. In fact, prior work shows that for many applications, conventional cache hierarchies are far superior to PIM [5,40,90,97].…”

Section: Introductionmentioning

confidence: 99%

Livia

Lockerman

Feldmann

Bakhshalipour

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

Self Cite

View full text Add to dashboard Cite

In order to scale, future systems will need to dramatically reduce data movement. Data movement is expensive in current designs because (i) traditional memory hierarchies force computation to happen unnecessarily far away from data and (ii) processing-in-memory approaches fail to exploit locality. We propose Memory Services, a flexible programming model that enables data-centric computing throughout the memory hierarchy. In Memory Services, applications express functionality as graphs of simple tasks, each task indicating the data it operates on. We design and evaluate Livia, a new system architecture for Memory Services that dynamically schedules tasks and data at the location in the memory hierarchy that minimizes overall data movement. Livia adds less than 3% area overhead to a tiled multicore and accelerates challenging irregular workloads by 1.3× to 2.4× while reducing dynamic energy by 1.2× to 4.7×. CCS Concepts • Computer systems organization → Processors and memory architectures.

show abstract

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Cited by 23 publications

References 68 publications

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures

Tafe

Livia

Contact Info

Product

Resources

About