2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2014
DOI: 10.1109/ispass.2014.6844487
|View full text |Cite
|
Sign up to set email alerts
|

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

Abstract: On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi or Kepler GPUs and Intel's forthcoming MIC "Knights Landing" (KNL), support both software-managed caches, aka. shared memory (GPUs) or near memory (KNL), and hardware-managed L1 data caches (D-caches). Furthermore, shared memory and the L1 D-cache on a GPU utilize the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
3

Relationship

3
4

Authors

Journals

citations
Cited by 24 publications
(8 citation statements)
references
References 11 publications
0
8
0
Order By: Relevance
“…We evaluate 14 workloads from Rodinia Benchmark [9] with their default inputs. We include two additional applications: Matrix Multiplication (MM), a highly efficient version using tiled cache [21]; and Fast Fourier Transformation (FFT), an optimized version fully utilizing on-chip memory [33]. We also include two widely-used workloads: Barnes Hut N body Simulation (BH) and Single-Source Shortest Paths (SSSP) from Lonestar GPU suite [8], both of which exhibit irregular memory access patterns.…”
Section: L1 D-cache) Is Shown Inmentioning
confidence: 99%
“…We evaluate 14 workloads from Rodinia Benchmark [9] with their default inputs. We include two additional applications: Matrix Multiplication (MM), a highly efficient version using tiled cache [21]; and Fast Fourier Transformation (FFT), an optimized version fully utilizing on-chip memory [33]. We also include two widely-used workloads: Barnes Hut N body Simulation (BH) and Single-Source Shortest Paths (SSSP) from Lonestar GPU suite [8], both of which exhibit irregular memory access patterns.…”
Section: L1 D-cache) Is Shown Inmentioning
confidence: 99%
“…Second, we can choose to let only the first warp load the data into shared memory, and other warps then access the data from shared memory. However, this way incurs overhead due to operations moving data from/into register into/from shared memory [14]. Additional control flow is also needed to ensure that the global memory data are loaded only once and a synchronization is necessary to eliminate potential data races.…”
Section: Pattern 3: Promote Variables From Shared Memory / Global Memmentioning
confidence: 99%
“…The trade-offs between software-managed shared memory and hardware-managed D-cache on GPUs have been studied in [14]. Gebhart et al [7] made the observation that different applications have different needs for various memory resources.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Shared memory requires explicit management and can benefit applications with predictable data access patterns but it's not appropriate for applications with irregular access patterns. For such applications, hardware managed caches still play an important role to hide long off-chip memory access latencies [6]. With a broader application domain range, recent GPUs including the NVIDIA Fermi [1] and Kepler [7] architectures, provide a configurable L1D cache with shared memory on each SM.…”
Section: Introductionmentioning
confidence: 99%