Cache Coherence for GPU Architectures

Singh, Inderpreet; Shriraman, Arrvindh; Fung, Wilson Wai Lun; O'Connor, Mike; Aamodt, Tor M.

doi:10.1109/mm.2014.4

Cited by 39 publications

(56 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPU L1 caches typically feature a write-through policy, with [1] or without [33], [35] write-allocation. This policy saves bandwidth compared to a write-back policy [41], [16], since GPU applications have very little reuse on written data. The L2 cache is write-back with write-allocation, which is the same design choice as a conventional CPU LLC.…”

Section: Baseline Gpu Architecturementioning

confidence: 99%

“…The L2 cache is write-back with write-allocation, which is the same design choice as a conventional CPU LLC. Modern GPUs typically do not provide hardware support for L1 cache coherence to avoid the overhead that coherence messages add to NoC traffic and memory access latency [41], [16]. Current GPU L2 caches do not enforce inclusion.…”

Section: Baseline Gpu Architecturementioning

confidence: 99%

“…Current GPU L2 caches do not enforce inclusion. NVIDIA GPU caches are non-inclusive non-exclusive caches [2], [41], meaning cache lines that are brought into L1 caches are also brought into the L2 cache, but an L2 cache line is evicted silently (without recalling L1 caches) when replacement happens. This design reduces the number of redundant data copies than inclusive caches, but still allows L1 caches to locally provide shared input values.…”

Section: Baseline Gpu Architecturementioning

confidence: 99%

See 2 more Smart Citations

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

148

View full text Add to dashboard Cite

Abstract-With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency.The massive amount of memory requests generated by GPUs cause cache contention and resource congestion. Existing CPU cache management policies that are designed for multicore systems, can be suboptimal when directly applied to GPU caches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturating on-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cachesensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPC improvement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

show abstract

Section: Baseline Gpu Architecturementioning

confidence: 99%

Section: Baseline Gpu Architecturementioning

confidence: 99%

Section: Baseline Gpu Architecturementioning

confidence: 99%

See 1 more Smart Citation

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

148

View full text Add to dashboard Cite

show abstract

“…One could envision scoped transactions. Singh et al proposed temporal coherence, which is a time-based self-invalidation coherence protocol for GPUs [27]. Scopes could potentially be applied to temporal coherence to reduce self-invalidations.…”

Section: Related Workmentioning

confidence: 99%

Synchronization Using Remote-Scope Promotion

Orr

Che

Yilmazer

et al. 2015

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Heterogeneous system architecture (HSA) and OpenCL™ define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a priori. It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower scope. This puts programmers in a conundrum: optimize the common case by synchronizing at a faster small scope or use work stealing at a slower large scope.In this paper, we propose to extend scoped synchronization with remote-scope promotion. This allows the most frequent sharers to synchronize through a small scope. Infrequent sharers synchronize by promoting that remote small scope to a larger shared scope. Synchronization using remote-scope promotion provides performance robustness for dynamic workloads, where the benefits provided by scoped synchronization and work stealing are hard to anticipate. Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average. In contrast, synchronization using remote-scope promotion achieves a robust 1.25x speedup on average, across a diverse set of graph benchmarks and inputs.

show abstract

“…This is a time-based coherence protocol for GPUs, namely Temporal Coherence (TC) [80], based on globally synchronized counters. With TC-Strong, these synchronized counters are maintained in the GPU cores and L2 controllers, allowing to self-invalidate cache blocks and maintain coherence, thus eliminating coherence traffic, and reducing are overhead and protocol complexity.…”

Section: Timestamps-based Coherencementioning

confidence: 99%

Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

García¹

View full text Add to dashboard Cite

AcknowledgmentCuatro años de esfuerzo dan para mucho. Echándo la vista atrás, muchas son las personas que me han apoyado, muchos los que me han escuchado. Gracias a todos.A Ramón y Lola. Sin ellos, sin su apoyo y su amor, no habría llegado hasta aquí. Sin su inspiración no habría siquiera empezado el camino. Ellos plantaron la semilla de la curiosidad, tan necesaria para superar la frustración y las dificultades a las que todo investigador tiene que enfrentar en su camino.A Víctor. Hermano y amigo. El espejo en el que me miro. Me has dado fuerza en cada etapa de mi vida. Gracias por todo. Habrá que pensar una buena ruta en bici para celebrarlo.A Carla. Por su paciencia, por su ayuda, por respaldarme cuando lo he necesitado. Siempre me has animado a seguir adelante y has permanecido a mi lado en cada etapa del doctorado. Más que eso, eres uno de los pilares principales de mi vida.A Salomé, Eloy y Sergio. Por quererme y aceptarme como uno más de la familia. Realmente yo lo siento del mismo modo. Gracias.A mis amigos. Samuel, Eduardo, Manolo. Ya sea jugando a algún juego, organizándo viajes juntos, o saliendo de fiesta o a tomar algo, siempre habéis estado ahí, siendo partícipes de mis logros y mis frustraciones. Tanto como yo de las vuestras. Y espero que siga siendo así por muchos, muchos años.Allá por 2007 empezaba mi andadura por la universidad, hace ya diez años.Ésta ha sido mi segunda casa, y le debo mucho. Y tengo muchos que agradecer también a toda la gente que he conocido en este etapa. Incluyendoéstaúltima fase en el Grupo de Arquitecturas Paralelas.Por supuesto y ante todo a mis directores, Alberto, Antonio y Maria Engracia. Me habéis guiado e inspirado, desafiado y ayudado en cada etapa de mi doctorado. Vuestra paciencia y esfuerzo han sido claves para la consecución de este trabajo. He aprendido mucho durante estos años. No sólo en materia de arquitectura de computadores, sino también en investigación científica; a amarla, a observar los pequeños detalles, a analizar los datos y entender qué esta ocurriendo para así poder seguir avanzando. Ellos han cambiado mi vida, profesional y personal, para siempre.A Núria, Salva y Eduardo. No nos vemos tanto como antes, nuestras vidas, sus obligaciones, nos alejan. Pero aún nos quedarán esas tardes de cervezas para ponernos al día. Siempre.A mis compañeros de laboratorio. José Vicente, José María, Migue, Vicent, Joan, Fran, Javi, Roberto, Carlos, Santi, y un largo etcétera. A los que se fueron, Knut, Mario, Crispín. Tantos y tantos. A todos os debo algo. Lo que ha unido el BoardGameArena que no lo separen nuestras exitosas carreras en el extranjero, como un tren de mercancías desbocado y sin frenos. Y por supuesto a Ricardo. El verdadero pilar del laboratorio, en lo técnico y en lo personal. Tu increible paciencia y dedicación te hacen pieza central del trabajo de todos nosotros en el grupo.

show abstract

Cache Coherence for GPU Architectures

Cited by 39 publications

References 40 publications

Adaptive Cache Management for Energy-Efficient GPU Computing

Adaptive Cache Management for Energy-Efficient GPU Computing

Synchronization Using Remote-Scope Promotion

Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

Contact Info

Product

Resources

About