Scheduling Page Table Walks for Irregular GPU Applications

Shin, Seunghee; Cox, Guilherme; Oskin, Mark; Loh, Gabriel H.; Solihin, Yan; Bhattacharjee, Abhishek; Basu, Abhik

doi:10.1109/isca.2018.00025

Cited by 45 publications

(18 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach is built upon the unique execution characteristics of GPU and effectively increases the TLB reach with minimal hardware overhead. Meanwhile, our approach is complementary to most prior works (e.g., page table walk optimization for irregular applications [56]) and can be combined with them to further improve the UVM performance. Address translation optimizations: There exists a substantial body of research works, both from the OS community and the architecture community, focusing on address translation optimizations [2,7,8,12,29,37,42].…”

Section: Related Workmentioning

confidence: 95%

“…Pham et al [43] proposed a Bloom filter-based hardware mechanism that can be used to reduce the overheads imposed by cache flushes due to virtual page remappings. Shin et al [56] explored various critical warp-aware page table walking strategies to accelerate irregular application address translations. Margaritov et al [32] proposed parallel translation prefetching to avoid multiple levels of sequential page table walks in CPUs.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing Address Translations in Throughput Processors via Compression

Tang

Zhang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Efficient memory sharing among multiple compute engines plays an important role in shaping the overall application performance on CPU-GPU heterogeneous platforms. Unified Virtual Memory (UVM) is a promising feature that allows globally-visible data structures and pointers such that the GPU can access the physical memory space on the CPU side, and take advantage of the host OS paging mechanism without explicit programmer effort. However, a key requirement for the guaranteed performance is effective hardware support of address translation. Particularly, we observe that GPU execution suffers from high TLB miss rates in a UVM environment, especially for irregular and/or memory-intensive applications. In this paper, we propose simple yet effective compression mechanisms for address translations to improve GPU TLB hit rates. Specifically, we explore and leverage the TLB compressibility during the execution of GPU applications to design efficient address translation compression with minimal runtime overhead. Experimental results across 22 applications indicate that our proposed approach significantly improves GPU TLB hit rates, which translate to 12% average performance improvement. Particularly, for 16 irregular and/or memory-intensive applications, the performance improvements achieved reach up to 69.2%, with an average of 16.3%.

show abstract

Section: Related Workmentioning

confidence: 95%

Section: Related Workmentioning

confidence: 99%

Enhancing Address Translations in Throughput Processors via Compression

Tang

Zhang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

Section: Virtual Address Translation In Gpusmentioning

confidence: 99%

“…Similar to the cache hierarchy, the TLB hierarchy on a GPU consists of multiple levels [41]. Each GPU core or compute unit (CU) is equipped with a private L1-TLB that is typically fully associative to eliminate conict misses [9,41]. The L1-TLBs are typically backed by a larger L2-TLB, which is shared between all the available CUs in the GPU and is usually multi-ported to allow for concurrent lookups [9].…”

Section: Virtual Address Translation In Gpusmentioning

confidence: 99%

“…The process starts with the execution of a memory instruction by a CU, triggering a memory request to an L1 cache. 1 : On a translation request from the CU, that misses in both the L1-TLB and L2-TLB, the request is forwarded to the page walk buer on the IOMMU which is located on the CPU die [41,50]. Once a hardware page table walker is available, this request is picked up by the walker to perform a page table walk.…”

Section: Virtual Address Translation In Gpusmentioning

confidence: 99%

See 1 more Smart Citation

Valkyrie

Baruah

Mojumder

Abellán

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Programming on a GPU has been made considerably easier with the introduction of Virtual Memory features, which support common pointer-based semantics between the CPU and the GPU. However, supporting virtual memory on a GPU comes with some additional costs and overhead, with the largest being from the support for address translation. The fact that a massive number of threads run concurrently on a GPU means that the translation lookaside buers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of GPU workloads shows that TLB misses can be extremely high (up to 99%), which inevitably leads to signicant performance degradation due to long-latency page-table walks. Our proling of TLB-sensitive workloads reveals a high degree of page sharing across the dierent cores of a GPU. In many applications, a page can be accessed in temporal proximity by multiple cores, following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, an integrated cooperative TLB prefetching mechanism and an inter L1-TLB probing scheme that can eciently reduce TLB bottlenecks in GPUs. Our evaluation using a diverse set of GPU workloads reveals that Valkyrie is able to achieve an average speedup of 1.95⇥, while adding modest hardware overhead. CCS CONCEPTS • Computing methodologies ! Graphics processors; • Software and its engineering ! Virtual memory.

show abstract