Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems

Baruah, Trinayan; Dincer, Ali Tolga; Mojumder, Saiful A.; Abellán, José Luis; Ukidave, Yash; Joshi, Ajay; Rubin, Norman; Kim, John; Kaeli, David

doi:10.1109/hpca47549.2020.00055

Cited by 33 publications

(13 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All of our experiments are run with a 4KB page size, which is the common page size used in prior studies on address translation hardware design on GPUs [9,41,42]. While larger pages (e.g., 2MB) have the potential of reducing L1-TLB misses, they have large page migration latencies [8,11,19] and can also increase the average number of stalled wavefronts on TLB misses to 100% [8,9] and hence are not always optimal to use.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…They eliminate the need to perform explicit memory copies, as the GPU driver and the runtime handle all page transfers to/from the GPU. They lower programmer burden by managing CPU-to-GPU and GPU-to-GPU data transfers [1,11] and support oversubscription of memory [29,34]. To support all these capabilities, GPU vendors have added virtual memory support, providing the required hardware and software.…”

Section: Introductionmentioning

confidence: 99%

“…Unfortunately, enabling Virtual Memory support on GPUs introduces signicant performance overhead. Features such as demand paging and page eviction create performance bottlenecks [11,18,29]. Address translation on GPUs is a major bottleneck, given the limited size of the private L1-TLBs, which in turn, generates severe pressure on the shared L2-TLB.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Valkyrie

Baruah

Mojumder

Abellán

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

Programming on a GPU has been made considerably easier with the introduction of Virtual Memory features, which support common pointer-based semantics between the CPU and the GPU. However, supporting virtual memory on a GPU comes with some additional costs and overhead, with the largest being from the support for address translation. The fact that a massive number of threads run concurrently on a GPU means that the translation lookaside buers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of GPU workloads shows that TLB misses can be extremely high (up to 99%), which inevitably leads to signicant performance degradation due to long-latency page-table walks. Our proling of TLB-sensitive workloads reveals a high degree of page sharing across the dierent cores of a GPU. In many applications, a page can be accessed in temporal proximity by multiple cores, following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, an integrated cooperative TLB prefetching mechanism and an inter L1-TLB probing scheme that can eciently reduce TLB bottlenecks in GPUs. Our evaluation using a diverse set of GPU workloads reveals that Valkyrie is able to achieve an average speedup of 1.95⇥, while adding modest hardware overhead. CCS CONCEPTS • Computing methodologies ! Graphics processors; • Software and its engineering ! Virtual memory.

show abstract

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Valkyrie

Baruah

Mojumder

Abellán

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

show abstract

“…A critical mechanism for UM is prefetching, page-eviction due to memory over subscription, and page migration between GPUs. The works of [Agarwal et al 2015;Baruah et al 2020;Ganguly et al 2019Ganguly et al , 2020Young et al 2018] proposed new algorithms to improve UM performance in the case of transparent memory management. In contrast, our approach controls the page placement and replication manually based on analysis of memory access patterns.…”

Section: Cuda Unified Memory For Multi-gpu Systemsmentioning

confidence: 99%

GPU Accelerated Path Tracing of Massive Scenes

et al. 2021

View full text Add to dashboard Cite

This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 GPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs. We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer. The approach was validated on scenes of sizes up to 169 GB. We show that only 1–5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.

show abstract

“…Hardware-based TLB shootdown. There have been a number of approaches to handle the problem of TLB cache coherence at the hardware layer [7,10,12,42,43,48,49,51,60,62]. Several of these hardware-based approaches attempt to squeeze performance using non-traditional TLB designs, such as multi-level TLB hierarchies.…”

Section: Related Workmentioning

confidence: 99%

ECO TLB

Maass

Kumar

Kim

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We propose ecoTLB-software-based eventual translation lookaside buffer (TLB) coherence-which eliminates the overhead of the synchronous TLB shootdown mechanism in operating systems that use address space identifiers (ASIDs). With an eventual TLB coherence, ecoTLB improves the performance of free and page swap operations by removing the inter-processor interrupt (IPI) overheads incurred to invalidate TLB entries. We show that the TLB shootdown has implications for page swapping in particular in emerging, disaggregated data centers and demonstrate that ecoTLB can improve both the performance and the specific swapping policy decisions using ecoTLB's asynchronous mechanism. We demonstrate that ecoTLB improves the performance of real-world applications, such as Memcached and Make, that perform page swapping using Infiniswap, a solution for next generation data centers that use disaggregated memory, by up to 17.2%. Moreover, ecoTLB improves the 99th percentile tail latency of Memcached by up to 70.8% due to its asynchronous scheme and improved policy decisions. Furthermore, we show that recent features to improve security in the Linux kernel, like kernel page table isolation (KPTI), can result in significant performance overheads on architectures without support for specific instructions to clear single entries in tagged TLBs, falling back to full TLB flushes. In this scenario, ecoTLB is able to recover the performance lost for supporting KPTI due to its asynchronous shootdown scheme and its support for tagged TLBs. Finally, we demonstrate that ecoTLB improves the performance of free operations by up to 59.1% on a 120-core machine and improves the performance of Apache on a 16-core machine by up to 13.7% compared to baseline Linux, and by up to 48.2% compared to ABIS, a recent state-of-the-art research prototype that reduces the number of IPIs.

show abstract

Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems

Cited by 33 publications

References 36 publications

Valkyrie

Valkyrie

GPU Accelerated Path Tracing of Massive Scenes

ECO TLB

Contact Info

Product

Resources

About