Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

Li, Chao; Yang, Yi; Dai, Hongwen; Yan, Shengen; Mueller, Frank; Zhou, Huiyang

doi:10.1109/ispass.2014.6844487

Cited by 24 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate 14 workloads from Rodinia Benchmark [9] with their default inputs. We include two additional applications: Matrix Multiplication (MM), a highly efficient version using tiled cache [21]; and Fast Fourier Transformation (FFT), an optimized version fully utilizing on-chip memory [33]. We also include two widely-used workloads: Barnes Hut N body Simulation (BH) and Single-Source Shortest Paths (SSSP) from Lonestar GPU suite [8], both of which exhibit irregular memory access patterns.…”

Section: L1 D-cache) Is Shown Inmentioning

confidence: 99%

Locality-Driven Dynamic GPU Cache Bypassing

Song

Dai

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

Self Cite

104

View full text Add to dashboard Cite

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from singleinstruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

show abstract

Section: L1 D-cache) Is Shown Inmentioning

confidence: 99%

Locality-Driven Dynamic GPU Cache Bypassing

Song

Dai

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

Self Cite

104

View full text Add to dashboard Cite

show abstract

“…Second, we can choose to let only the first warp load the data into shared memory, and other warps then access the data from shared memory. However, this way incurs overhead due to operations moving data from/into register into/from shared memory [14]. Additional control flow is also needed to ensure that the global memory data are loaded only once and a synchronization is necessary to eliminate potential data races.…”

Section: Pattern 3: Promote Variables From Shared Memory / Global Memmentioning

confidence: 99%

“…The trade-offs between software-managed shared memory and hardware-managed D-cache on GPUs have been studied in [14]. Gebhart et al [7] made the observation that different applications have different needs for various memory resources.…”

Section: Related Workmentioning

confidence: 99%

“…As expected, how to effectively utilize such on-chip memory resources has a significant impact on application performance. However, it is non-trivial for application developers to explicitly manage these on-chip memory resources as the trade-offs among these resources are complex and sometimes nonintuitive [14]. More importantly, as on-chip resources have been changing significantly for different generations of GPUs, an optimized kernel upon one generation becomes suboptimal on another one.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatic data placement into GPU on-chip memory resources

Yang

Lin

et al. 2015

2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Self Cite

View full text Add to dashboard Cite

Although graphics processing units (GPUs) rely on threadlevel parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as onchip memory resources vary among different GPU generations, performance portability has become a daunting challenge.In this paper, we tackle this problem with compilerdriven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680.

show abstract

“…Shared memory requires explicit management and can benefit applications with predictable data access patterns but it's not appropriate for applications with irregular access patterns. For such applications, hardware managed caches still play an important role to hide long off-chip memory access latencies [6]. With a broader application domain range, recent GPUs including the NVIDIA Fermi [1] and Kepler [7] architectures, provide a configurable L1D cache with shared memory on each SM.…”

Section: Introductionmentioning

confidence: 99%

RACB: Resource Aware Cache Bypass on GPUs

Dai

Kartsaklis

et al. 2014

2014 International Symposium on Computer Architecture and High Performance Computing Workshop

Self Cite

View full text Add to dashboard Cite

Caches are universally used in computing systems to hide long off-chip memory access latencies. Unlike CPUs, massive threads running simultaneously on GPUs bring a tremendous pressure on memory hierarchy. As a result, the limitation of cache resources becomes a bottleneck for a GPU to exploit thread-level parallelism (TLP) and memory-level parallelism (MLP) and achieve high performance. In this paper, we propose a mechanism to bypass L1D and L2 cache based on the availability of cache resources.Our proposed mechanism is based on the observation that a huge number of stalls coming from limited cache resources prohibit GPUs from providing a higher throughput. So we propose Resource Aware Cache Bypass (RACB) with minor hardware changes to eliminate such stalls to improve performance.We examine the effectiveness of this approach when applied to L1D and L2 cache separately as well as together. Evaluation results with NVIDIA Computing SDK show that RACB generally improves performance the most when applied to both L1D and L2 cache, which is up to 88.05% and on an average of 16.73%; additionally, energy is saved up to 22.35% and on an average of 5.88% with minor hardware overheads.

show abstract

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

Cited by 24 publications

References 11 publications

Locality-Driven Dynamic GPU Cache Bypassing

Locality-Driven Dynamic GPU Cache Bypassing

Automatic data placement into GPU on-chip memory resources

RACB: Resource Aware Cache Bypass on GPUs

Contact Info

Product

Resources

About