Hyperscalars run services across a large fleet of servers, serving billions of users worldwide. However, these services exhibit different behaviors compared to commonly available benchmark suites, leading to server architectures that are suboptimal for cloud workloads. As datacenters emerge as the primary server processor market, optimizing server processors for cloud workloads by better understanding their behavior is an area of interest. To address this, we present MemProf, a memory profiler that profiles the three major reasons for stalls in cloud workloads: code-fetch, memory bandwidth, and memory latency. We use MemProf to understand the behavior of cloud workloads at Meta and propose and evaluate micro-architectural and memory system design improvements that help cloud workloads' performance.MemProf's code analysis shows that cloud workloads at Meta execute the same code across CPU cores. Using this, we propose shared micro-architectural structures-a shared L2 I-TLB and a shared L2 cache. Next, to help with memory bandwidth stalls, using workloads' memory bandwidth distribution, we find that only a few pages contribute to most of the system bandwidth. We use this finding to evaluate a new high-bandwidth, small-capacity memory tier and show that it performs 1.46× better than the current baseline configuration. Finally, we look into ways to improve memory latency for cloud workloads. Profiling using MemProf reveals that L2 hardware prefetchers, which are commonly used to reduce memory latency, have very low coverage and consume a significant amount of memory bandwidth. To help improve future hardware prefetcher performance, we built an efficient memory tracing tool to collect and validate production memory access traces. Our memory tracing tool adds significantly less overhead than DynamoRIO, enabling tracing production workloads.
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existing resistive memories is to make both techniques work together seamlessly and e ciently.To address this challenge, we propose WoLFRaM, a new mechanism that combines both wear-leveling and fault tolerance techniques at low cost by using a programmable resistive address decoder (PRAD). The key idea of WoLFRaM is to use PRAD for implementing 1) a new e cient wear-leveling mechanism that remaps write accesses to random physical locations on the y, and 2) a new e cient fault tolerance mechanism that recovers from faults by remapping failed memory blocks to available physical locations. Our evaluations show that, for a Phase Change Memory (PCM) based system with cell endurance of 10 8 writes, WoLFRaM increases the memory lifetime by 68% compared to a baseline that implements the best state-of-the-art wear-leveling and fault correction mechanisms. WoLFRaM's average / worst-case performance and energy overheads are 0.51% / 3.8% and 0.47% / 2.1% respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.