Tag-Split Cache for Efficient GPGPU Cache Utilization

Li, Lingda; Hayes, Ari B.; Song, Shuaiwen Leon; Zhang, Eddy Z.

doi:10.1145/2925426.2926253

Cited by 9 publications

(3 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(B) Memory Divergence Frequency and Degree Significance. GPU memory divergence can significantly bottleneck performance, thus becomes a popular research topic in recent years [33,35,44,49,52]. It is also an important indicator on whether a program is well optimized for memory access.…”

Section: Case Studiesmentioning

confidence: 99%

“…This is because modern GPUs have very limited L1 data cache and massively threaded GPU applications often exceed the L1 capacity, causing severe thrashing [24,43]. Additionally, cache-level resources (e.g., MSHR entries and load/store queues) are also very limited, often causing severe resource congestion (e.g., MSHR allocation failures) [30,32,33]. To tackle this problem, many architecture solutions are provided, e.g., enabling bypassing threshold in tag store [32] and proposing new bypassing policy [31].…”

Section: Case Studiesmentioning

confidence: 99%

See 1 more Smart Citation

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Shen

Song

et al. 2018

Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018

Self Cite

View full text Add to dashboard Cite

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CU-DAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

show abstract

Section: Case Studiesmentioning

confidence: 99%

Section: Case Studiesmentioning

confidence: 99%

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Shen

Song

et al. 2018

Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Rhu et al [2013] proposed a locality-aware memory hierarchy that adaptively adjusts the memory access granularity to prevent overfetching, providing better off-chip bandwidth utilization. Furthermore, with regard to adaptive memory access granularity, Li et al [2016] proposed a tag-split cache to enable fine storage granularity to improve cache utilization while keeping a coarse access granularity to avoid excessive cache requests. proposed a scheme to tolerate memory miss latencies for SIMD cores by masking out threads in a warp that are waiting on data and allowing other threads to continue execution, hence utilizing the idle execution slots.…”

Section: Cache Managementmentioning

confidence: 99%

Cooperative Caching for GPUs

Dublish

Nagarajan

Topham

2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies. In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

show abstract