Hardware Translation Coherence for Virtualized Systems

YanZi,; VeselýJán,; CoxGuilherme,; BhattacharjeeAbhishek,

doi:10.1145/3140659.3080211

Cited by 8 publications

(19 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, in addition to the list of victim vCPUs approximated by the guest OS, the list of victim cores is approximated by the VMM. Combined with the flushing of translation structures upon a nested page table update, these approximations result in frequent needless evictions of unrelated translations [44]. TLB shootdown activity has been observed to be a significant bottleneck in prior studies [4,6,12,25,33,40,44] and is also confirmed by our own experiments.…”

Section: Introductionsupporting

confidence: 83%

“…Current VMMs do not track the gVA of pages used by the guest. Since modern processors only permit invalidations of individual TLB entries when the gVA (for guest pages) is known, when the VMM updates the nested page table, translation structures are completely flushed [44]. Repopulating the flushed 2-dimensional page tables are very expensive.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attc (@C)

Gugale

Gulur

Marathe

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Heterogeneous memory systems are getting popular, however they face significant challenges from translation coherence overheads from page remappings. Translation coherence, which is typically implemented in software, can consume up to 50% of the runtime for some applications in virtualized platforms. In this paper, we propose ATTC-Addressable TLB-based Translation Coherence, a hardware translation coherence scheme which eliminates almost all of the overheads associated with software-based coherence mechanisms, and overcomes the challenges in existing hardware schemes. Unlike other proposals (HATRIC, UNITD) that require on-chip TLB tags to enforce coherence and are capable of tracking only the last level page table entries of either the guest or host page tables, ATTC tracks changes to both guest and host page tables without requiring any additional metadata in L1, L2 TLBs. ATTC enforces a "point of coherence" uniformly for both guest and host page table updates using an addressable TLB (ATLB) in the DRAM akin to the one in [41]. An inverse mapping table (-present in DRAM) that maps host physical pages to ATLB locations helps to precisely track translations. We study the proposed ATTC scheme in detail for an emerging hybrid memory organization (a mix of DRAM and NVM) and show that ATTC practically eliminates all translation coherence overheads, yielding an average improvement of 35.7% over a baseline software coherence scheme in virtualized environment and 7.4% over the hardware HATRIC scheme. CCS CONCEPTS • Computer systems organization → Heterogeneous (hybrid) systems; • Software and its engineering → Virtual memory.

show abstract

Section: Introductionsupporting

confidence: 83%

Section: Introductionmentioning

confidence: 99%

Attc (@C)

Gugale

Gulur

Marathe

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Hardware-based TLB shootdown. There have been a number of approaches to handle the problem of TLB cache coherence at the hardware layer [7,10,12,42,43,48,49,51,60,62]. Several of these hardware-based approaches attempt to squeeze performance using non-traditional TLB designs, such as multi-level TLB hierarchies.…”

Section: Related Workmentioning

confidence: 99%

“…Furthermore, UNITD adds a costly content-addressable memory (CAM) to each TLB to perform reverse address translations when checking whether a page translation is present in a specific TLB, thereby greatly increasing the TLB's power consumption. HATRIC [62] is a hardware mechanism similar to UNITD and piggybacks translation coherence information using the existing cache coherence protocols.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ECO TLB

Maass

Kumar

Kim

et al. 2020

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

We propose ecoTLB-software-based eventual translation lookaside buffer (TLB) coherence-which eliminates the overhead of the synchronous TLB shootdown mechanism in operating systems that use address space identifiers (ASIDs). With an eventual TLB coherence, ecoTLB improves the performance of free and page swap operations by removing the inter-processor interrupt (IPI) overheads incurred to invalidate TLB entries. We show that the TLB shootdown has implications for page swapping in particular in emerging, disaggregated data centers and demonstrate that ecoTLB can improve both the performance and the specific swapping policy decisions using ecoTLB's asynchronous mechanism. We demonstrate that ecoTLB improves the performance of real-world applications, such as Memcached and Make, that perform page swapping using Infiniswap, a solution for next generation data centers that use disaggregated memory, by up to 17.2%. Moreover, ecoTLB improves the 99th percentile tail latency of Memcached by up to 70.8% due to its asynchronous scheme and improved policy decisions. Furthermore, we show that recent features to improve security in the Linux kernel, like kernel page table isolation (KPTI), can result in significant performance overheads on architectures without support for specific instructions to clear single entries in tagged TLBs, falling back to full TLB flushes. In this scenario, ecoTLB is able to recover the performance lost for supporting KPTI due to its asynchronous shootdown scheme and its support for tagged TLBs. Finally, we demonstrate that ecoTLB improves the performance of free operations by up to 59.1% on a 120-core machine and improves the performance of Apache on a 16-core machine by up to 13.7% compared to baseline Linux, and by up to 48.2% compared to ABIS, a recent state-of-the-art research prototype that reduces the number of IPIs.

show abstract

Latr

et al. 2018

Self Cite

View full text Add to dashboard Cite

We propose Latr-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, Latr can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from Latr's mechanism. For example, Latr improves the latency of munmap() by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from Latr: without any application-level changes, Latr improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.

show abstract

Hardware Translation Coherence for Virtualized Systems

Cited by 8 publications

References 63 publications

Attc (@C)

Attc (@C)

ECO TLB

Latr

Contact Info

Product

Resources

About