Exploiting Page Table Locality for Agile TLB Prefetching

Vavouliotis, Georgios; Alvarez, Lluc; Karakostas, Vasileios; Nikas, Konstantinos; Koziris, Nectarios; Jiménez, Daniel; Casas, Marc

doi:10.1109/isca52012.2021.00016

Cited by 15 publications

(13 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In x86-64 architectures, the cache line size is 64 bytes and each PTE occupies precisely 8 bytes. As a result, a single 64-byte cache line can accommodate up to 8 contiguouslystored PTEs [37,69,76,79]. When a requested PTE is read from memory, it is grouped with 7 neighboring PTEs and they are stored into a 64-byte cache line.…”

Section: Virtual Memory Subsystemmentioning

confidence: 99%

“…Figure 1 depicts the operation of a system with STLB prefetching, considering the most common scenario whereby a Prefetch Buffer (PB) is used to store the prefetched PTEs and the prefetch logic is engaged on STLB misses [26,53,79]. When an instruction or data memory access occurs, the corresponding first-level TLB is looked up and, on a miss, the STLB is probed.…”

Section: Translation Prefetchingmentioning

confidence: 99%

“…To validate the observations of Section 3.1, we analyze the instruction cache (I-cache) and TLB behavior of 45 industrial server workloads provided by Qualcomm (QMM) for CVP-1 [14] and IPC-1 [22]. The QMM workloads were also used in recent works on TLB management [61,79]. We further study the SPEC CPU 2006 [2] and SPEC CPU 2017 [10] benchmark suites.…”

Section: Analyzing Industrial Server Workloadsmentioning

confidence: 99%

“…Despite the use of multi-level TLB hierarchies and other hardware and software schemes for accelerating address translation, frequent data TLB misses still cause significant performance degradation due to long miss penalties [30,32,40,47,54,58,63]. In response, the research community has proposed many techniques for reducing the overhead of address translation associated with data accesses [36,38,53,56,60,66,68,69,73,74,79,82].…”

Section: Introductionmentioning

confidence: 99%

“…The second module of Morrigan is the Small Delta Prefetcher (SDP), a sequential prefetcher activated when the IRIP module is unable to produce new prefetches. Finally, both IRIP and SDP exploit page table locality [69,79] to perform cost-effective spatial prefetching.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Morrigan: A Composite Instruction TLB Prefetcher

Vavouliotis¹,

Alvarez

Grot

et al. 2021

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses.This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%. CCS CONCEPTS• Software and its engineering → Virtual memory; • Applied computing → Data centers.

show abstract

Section: Virtual Memory Subsystemmentioning

confidence: 99%

Section: Translation Prefetchingmentioning

confidence: 99%

Section: Analyzing Industrial Server Workloadsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Morrigan: A Composite Instruction TLB Prefetcher

Vavouliotis¹,

Alvarez

Grot

et al. 2021

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

show abstract

Multi-level PWB and PWC for Reducing TLB Miss Overheads on GPUs

Yang

Zhang

Jia

et al. 2022

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Pinning Page Structure Entries to Last-Level Cache for Fast Address Translation

2022

View full text Add to dashboard Cite

As the memory footprint of emerging applications keeps increasing, the address translation becomes a critical performance bottleneck due to frequent misses on TLB. In addition, the TLB miss penalty becomes more critical in modern computer systems because the levels of the hierarchical page table (a.k.a. radix page table) are increasing to extend address space. In order to reduce the TLB misses, modern highperformance processors employ a multi-level TLB structure that uses a large last-level TLB. Employing a large last-level TLB might reduce the TLB misses. However, its capacity is still limited, and it can incur chip area overheads. In this paper, we propose a PSE Pinning mechanism that provides a large PSE (Page Structure Entry) store by dedicating some space of the last-level cache for only storing the page structure entries. PSE Pinning is based on three key observations. First, memory-intensive applications suffer from frequent misses on the last-level cache. Thus, most space of the last-level cache is not well utilized. Second, most PSEs are fetched from the main memory during the page table walk process, meaning the cache lines for the PSEs are frequently evicted from the on-chip caches. Lastly, a small number of PSEs are frequently accessed while others are not. By exploiting these three observations, PSE Pinning pins the frequently accessed page structure entries to the last-level caches so that they can reside on the cache. Experimental results show that PSE Pinning improves the performance of memory-intensive workloads suffering from frequent L2 TLB misses by 7.8% on average.

show abstract

Exploiting Page Table Locality for Agile TLB Prefetching

Cited by 15 publications

References 52 publications

Morrigan: A Composite Instruction TLB Prefetcher

Morrigan: A Composite Instruction TLB Prefetcher

Multi-level PWB and PWC for Reducing TLB Miss Overheads on GPUs

Pinning Page Structure Entries to Last-Level Cache for Fast Address Translation

Contact Info

Product

Resources

About