Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture 2017
DOI: 10.1145/3123939.3123954
|View full text |Cite
|
Sign up to set email alerts
|

Data movement aware computation partitioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 44 publications
(14 citation statements)
references
References 52 publications
0
14
0
Order By: Relevance
“…Programmable logic: Programmable logic PEs can include general purpose processor cores such as CPUs [31,40,[70][71][72], GPUs [26,27,38,73,74], and accelerated processing units (APU) [75] that can execute complex workloads. These cores are usually trimmed down (fewer computation units, less complex cache hierarchies without L2/L3 caches, or lower operating frequencies) from their conventional counterparts due to power, area, and thermal constraints.…”
Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning
confidence: 99%
See 1 more Smart Citation
“…Programmable logic: Programmable logic PEs can include general purpose processor cores such as CPUs [31,40,[70][71][72], GPUs [26,27,38,73,74], and accelerated processing units (APU) [75] that can execute complex workloads. These cores are usually trimmed down (fewer computation units, less complex cache hierarchies without L2/L3 caches, or lower operating frequencies) from their conventional counterparts due to power, area, and thermal constraints.…”
Section: J Low Power Electron Appl 2020 10 X For Peer Review 7 Omentioning
confidence: 99%
“…The most common optimization knobs in DCCs include selecting offloading workloads for memory, selecting the most suitable PE in/near memory, or the timing of executing selected offloads. To implement the policy, management techniques have employed code annotation [1,13,16,19,24,25,28,31,32,37,40,57,91,95], compiler-based code analysis [27,39,40,70,92,96], and online heuristics [27][28][29]38,71,72,74]. Table 1 classifies prominent works based on these attributes.…”
Section: Resource Management Of Data-centric Computing Systemsmentioning
confidence: 99%
“…threadblock remapping across multiple kernel calls), threadblocks should be mapped to the module containing pages accessed by them. The module to which a page is mapped can be identified from its physical address, requiring a address translation per page or OS support as done in [51]. Using the page mapping and TAFs, appropriate threadblocks can be co-located with their data-pages.…”
Section: Threadblock Mappingmentioning
confidence: 99%
“…TAF enables deciding the optimal/near-optimal mapping just-in-time during data allocation, and is therefore applicable to systems with both unified and discrete memory. Similarly, compiler and runtime support to allow thread remapping at runtime has been proposed [10,32,51]. But like dynamic page re-mapping, re-mapping GPU threadblocks during kernel execution incurs overhead and will limit the scalability.…”
Section: Related Workmentioning
confidence: 99%
“…• Processing-in-Memory Architectures: On one hand, there is plenty of recent work on PIM [10,12,16,18,20,30,32,34,36,42,46,52,53,67,73,73,75,76,88,93,95] that built lightweight processors, recongurable or application-specic logics in the logic die of HMC [74] or HBM [58]. For example, Active Memory Cube [67] is a representative design with HMC.…”
Section: Related Workmentioning
confidence: 99%