Performance analysis of the memory management unit under scale-out workloads

Proceedings of the 42nd Annual International Symposium on Computer Architecture

Gandhi

Ayar

et al. 2015

Self Cite

137

115

Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes.To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications.We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.

Section: Related Workmentioning

confidence: 99%

Redundant memory mappings for fast access to large memories

Proceedings of the 42nd Annual International Symposium on Computer Architecture

Gandhi

Ayar

et al. 2015

Self Cite

137

115

“…This overhead comes from the increased latency of the write operation in the NEMsCAM cell. However, the write operation: (i) takes place only after TLB misses which occur rarely compared to TLB hits, and (ii) adds latency to an already slow operation, i.e., L2-TLB access (∼7 cycles [17]) including potentially the penalty of L2-TLB miss (several tens of cycles [20]). Consequently, the NEMsCAM TLBs have negligible impact on the execution time for most workloads (0.32% on average) while reducing significantly the energy spent on the TLB hierarchy.…”

Section: B Resultsmentioning

confidence: 99%

“…In case of a hit, the TLB returns the physical address, and the memory operation proceeds. In case of a miss, the operation stalls until the translation is retrieved from the memory which might take tens of cycles [20].…”

Section: Introductionmentioning

confidence: 99%

NEMsCAM: A novel CAM cell based on nano-electro-mechanical switch and CMOS for energy efficient TLBs

Seyedi

Proceedings of the 2015 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH´15)

Cosemans³

et al. 2015

Self Cite

Abstract-In this paper we propose a novel Content Addressable Memory (CAM) cell, NEMsCAM, based on both Nanoelectro-mechanical (NEM) switches and CMOS technologies. The memory component of the proposed CAM cell is designed with two complementary non-volatile NEM switches and located on top of the CMOS-based comparison component. As a use case for the NEMsCAM cell, we design first-level data and instruction Translation Lookaside Buffers (TLBs) with 16nm CMOS technology at 2GHz. The simulations show that the NEMsCAM TLB reduces the energy consumption per search operation (by 27%), write operation (by 41.9%) and standby mode (by 53.9%), and the area (by 40.5%) compared to a CMOSonly TLB with minimal performance overhead.

“…In case of a TLB miss, a hardware state machine walks the page table, a process named page walk, and fetches the corresponding page table entry from memory. Thus, the TLB is the most crucial component for accelerating virtual memory, and its miss ratio significantly affects the performance of the processor [13,15,30,36].…”

Section: Address Translation Hardware Supportmentioning

confidence: 99%

“…Only the recent work on TLB Pred [41] considers huge pages for improving the dynamic energy efficiency in TLBs. The performance of TLB Pred depends on huge pages successfully reducing misses, but prior work shows that huge pages can still incur high performance overheads due to TLB misses [13,15,36]. In response, researchers proposed techniques that further increase the TLB reach [13,22,35,42,43,50] to overcome the limitations of huge pages.…”

Section: Introductionmentioning

confidence: 97%

Energy-efficient address translation

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Gandhi

Cristal

et al. 2016

Self Cite

Address translation is fundamental to processor performance. Prior work focused on reducing Translation Lookaside Buffer (TLB) misses to improve performance and energy, whereas we show that even TLB hits consume a significant amount of dynamic energy.To reduce the energy cost of address translation, we first propose Lite, a mechanism that monitors the performance and utility of L1 TLBs, and adaptively changes their sizes with way-disabling. The resulting TLB Lite organization opportunistically reduces the dynamic energy spent in address translation by 23% on average with minimal impact on TLB miss cycles. To further reduce the energy and performance overheads of L1 TLBs, we also propose RMM Lite that targets the recently proposed Redundant Memory Mappings (RMM) address-translation mechanism. RMM maps most of a process's address space with arbitrarily large ranges of contiguous pages in both virtual and physical address space using a modest number of entries in a range TLB. RMM Lite adds to RMM an L1-range TLB and the Lite mechanism. The high hit ratio of the L1-range TLB allows Lite to downsize the L1-page TLBs more aggressively. RMM Lite reduces the dynamic energy spent in address translation by 71% on average. Above the near-zero L2 TLB misses from RMM, RMM Lite further reduces the overhead from L1 TLB misses by 99%.These proposed designs target current and future energyefficient memory system design to meet the ever increasing memory demands of applications.