TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture

Lee, Jaekyu; Kim, Hyesoon

doi:10.1109/hpca.2012.6168947

Cited by 111 publications

(94 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PDP-G and PDP-S achieve average IPC improvement of 44.6% and 45.4% respectively, which are close to that of PDP-P. PDP-S performs very similarly to PDP-P for most of the benchmarks. This is because GPU programs usually have similar behavior for all the threads [27], and the optimal PD estimated by one of the SIMT cores is probably also the optimal PD for the rest. In the following sections, we adopt the PDP-S design since it is the cheapest one.…”

Section: Cache Bypassing On Gpusmentioning

confidence: 99%

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

148

View full text Add to dashboard Cite

Abstract-With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency.The massive amount of memory requests generated by GPUs cause cache contention and resource congestion. Existing CPU cache management policies that are designed for multicore systems, can be suboptimal when directly applied to GPU caches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturating on-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cachesensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPC improvement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

show abstract

Section: Cache Bypassing On Gpusmentioning

confidence: 99%

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

148

View full text Add to dashboard Cite

show abstract

“…Although higher TLP provides better latency hiding capability, it has been observed that increased TLP sometimes may hurt the performance due to the cache contention problem [22]. To address this performance anomaly, we proposed to either use the dynamic SMdueling approach [14] or a simple static threshold to limit the number of active warps. More details are discussed in Section 6.1.…”

Section: Figure 14 the Workload Buffermentioning

confidence: 99%

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Xiang

Yang²,

Zhou

2014

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…Prior efforts [8,10,15,16,17,26,29,30,31] demonstrate that horizontal partitioning on memory or LLC is effective in eliminating inter-program interference and improving performance. With vertical partitioning and, more generally, our partitioning policy space, one important question is whether the benefits from the horizontal memory and cache partitioning can be accumulated (i.e., should we go vertical in partitioning?…”

Section: Going Vertical?mentioning

confidence: 99%

“…In particular, Qureshi et al [32] design a utility-based cache partitioning scheme that allocates appropriate cache resources based on application miss rate monitored through dedicated hardware. More recently, cache partitioning is also adopted in heterogeneous GPU-CPU architectures to promote fair resource sharing among CPU and GPU applications [30], which exhibit drastically different memory access characteristics. Other efforts [3,9,12,15,25,28] classify workloads based on hardware profiling, and then choose appropriate scheduling policies for different classifications.…”

Section: Related Workmentioning

confidence: 99%

“…Several recent solutions attempt to segregate applications with different memory resources requirements by horizontally partitioning either main memory (DRAM banks) [10,16,17,29] or cache [15,24,30,31,32] into exclusive slices. These approaches avoid interference for programs with small memory footprints but might hamper performance of larger workloads by effectively reducing capacity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation