2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 2018
DOI: 10.1109/micro.2018.00058
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Abstract: Conventional multicores rely on deep cache hierarchies to reduce data movement. Recent advances in die stacking have enabled near-data processing (NDP) systems that reduce data movement by placing cores close to memory. NDP cores enjoy cheaper memory accesses and are more area-constrained, so they use shallow cache hierarchies instead. Since neither shallow nor deep hierarchies work well for all applications, prior work has proposed systems that incorporate both. These asymmetric memory hierarchies can be high… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 23 publications
(41 citation statements)
references
References 68 publications
0
41
0
Order By: Relevance
“…Programmable logic: Programmable logic PEs can include general purpose processor cores such as CPUs [31,40,[70][71][72], GPUs [26,27,38,73,74], and accelerated processing units (APU) [75] that can execute complex workloads. These cores are usually trimmed down (fewer computation units, less complex cache hierarchies without L2/L3 caches, or lower operating frequencies) from their conventional counterparts due to power, area, and thermal constraints.…”
Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning
confidence: 99%
See 1 more Smart Citation
“…Programmable logic: Programmable logic PEs can include general purpose processor cores such as CPUs [31,40,[70][71][72], GPUs [26,27,38,73,74], and accelerated processing units (APU) [75] that can execute complex workloads. These cores are usually trimmed down (fewer computation units, less complex cache hierarchies without L2/L3 caches, or lower operating frequencies) from their conventional counterparts due to power, area, and thermal constraints.…”
Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning
confidence: 99%
“…For both PIM and NMP systems, it is important to determine what computation will be sent (offloaded) to the memory PE. Offloading can be performed at different granularities, e.g., instructions (including small groups of instructions) [1,13,16,19,24,25,28,32,37,39,40,42,57,91,92], threads [71], Nvidia's CUDA blocks/warps [27,29], kernels [26], and applications [38,41,73,74]. Instruction-level offloading is often used with a fixed-function accelerator and PIM systems [1,13,16,19,24,25,28,29,32,37,39,42,57,92].…”
Section: Data Offloading Granularitymentioning
confidence: 99%
“…Many prior works [21,40,42,44,52] use compiler annotations or hardware profiling to dynamically move pages in NUMA systems. Re-mapping pages at kernel execution time becomes infeasible as the system size scales.…”
Section: Related Workmentioning
confidence: 99%
“…There are numerous prior works aimed at addressing accessdependent bottlenecks, including access-pattern-aware prefetching [30], cache management [31] and page allocation [21,40,42,52]. However, all these solutions require knowledge about data/thread locality, which in turn depends on potentially data-dependent application access patterns that traditionally have to be learnt or predicted during kernel execution.…”
Section: Introductionmentioning
confidence: 99%
“…However, eschewing a cache hierarchy makes PIM far less efficient on applications with significant locality and complicates several other issues, such as synchronization and coherence. In fact, prior work shows that for many applications, conventional cache hierarchies are far superior to PIM [5,40,90,97].…”
Section: Introductionmentioning
confidence: 99%