Supporting x86-64 address translation for 100s of GPU lanes

Power, Jason; Hill, Mark D.; Wood, David А.

doi:10.1109/hpca.2014.6835965

Cited by 112 publications

(98 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is achieved by removing the private TLBs from the SMs, and instead use virtual L1s and a single, shared TLB close to the GPU L2 (with similar configuration to the GPU L2-TLB of VI Hammer) for the translations of the entire cluster. A conservative 11 area analysis using Cacti, shows that our MMU requires at least 49% less area compared to an approach that has private TLBs, such as the one proposed by [Power et al 2014] and used by [Power et al 2015]. For VI Hammer we model 16 private TLBs with 32 entries, fully set-associative, and a shared 1024 entries 32-way set-associative TLB, while VIPS-G uses only a single 1024 entries 32-way set-associative extended TLB -that is, in the latter, we also account the extra area for the classification, owner, and V/I bits.…”

Section: Area Reduction Analysismentioning

confidence: 99%

“…In contrast with [Power et al 2014] proposal that relies on private TLBs at every GPU SM in coordination with a highly multi-threaded page-walker for the translations, we use a single shared TLB for the whole GPU attached to the L2 and virtual (VIVT) address for the GPU L1s. This is possible with the use of a coherence protocol such as VIPS-G, which is based on self-invalidation, and therefore does not involve upwards traffic as in the case of SC protocols.…”

Section: Related Workmentioning

confidence: 99%

“…If the permission can be acquired by the region buffer, the requestor directly accesses the memory through a dedicated interconnect, otherwise it has to access the directory before accessing the memory. To support fused architectures, they adopt a GPU MMU design similar to [Power et al 2014] for the address translations. Compared to our approach, HSC simplifies device coherence since it removes the requirement for explicit synchronization between devices.…”

Section: Qualitative Comparisonmentioning

confidence: 99%

“…Instead, we use a shared (among all SMs) TLB near the GPU L2, and virtually-indexed, virtually-tagged (VIVT) GPU L1 caches that eliminate the need for translations. This, saves all the translation energy for the L1 hits, while it maintains the effectiveness of a shared GPU TLB [Power et al 2014]. Since in our approach TLBs also keep page classification, the GPU TLB keeps the device classification between CPU/GPU.…”

Section: Introductionmentioning

confidence: 99%

“…As proposed by [Power et al 2014] efficiently supporting GPU address translation, requires a private TLB at each SM, a shared second level TLB between cores, and a highly multi-threaded page-walker. Although such an approach is efficient in performance, it comes with the trade-off of using complex hardware structures (such as multiple private, fully-associative L1-TLBs, and page-walkers), which increase the dynamic power of the core.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead

Koukos

Ros

Hägersten

2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals, implement coherence protocols for sequential consistency (SC) between CPU cores, and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the GPU memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency-aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.

show abstract

Section: Area Reduction Analysismentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Qualitative Comparisonmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead

Koukos

Ros

Hägersten

2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

A survey of techniques for architecting TLBs

Mittal

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Translation lookaside buffer (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high‐end servers. Because TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects, and system engineers.

show abstract