Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors

Ferdman, Michael; Adileh, Almutaz; Kocberber, Onur; Volos, Stavros; Alisafaee, Mohammad; Jevdjic, Djordje; Kaynak, Cansu; Popescu, Adrian; Ailamaki, Anastasia; Falsafi, Babak

doi:10.1145/2382553.2382557

Cited by 28 publications

(7 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ferdman et al have demonstrated the mismatch between cloud workloads and modern out-of-order cores [11,12]. Through their detailed analysis of scale-out workloads on modern cores, they discovered several important characteristics of these workloads: 1) Scale-out workloads suffer from high instruction cache miss rates, and large instruction caches and pre-fetchers, are inadequate; 2) instruction and memory-level parallelism are low, thus leaving the advanced out-of-order core underutilized; 3) the working set sizes exceed the capacity of the on-chip caches; 4) bandwidth utilization of scale-out workloads is low.…”

Section: Characterizing Cloud Workloadsmentioning

confidence: 99%

Integrated 3D-stacked server designs for increasing physical density of key-value stores

Gutierrez

Cieslak

Giridhar

et al. 2014

Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Key-value stores, such as Memcached, have been used to scale web services since the beginning of the Web 2.0 era. Data center real estate is expensive, and several industry experts we have spoken to have suggested that a significant portion of their data center space is devoted to key-value stores. Despite its wide-spread use, there is little in the way of hardware specialization for increasing the efficiency and density of Memcached; it is currently deployed on commodity servers that contain high-end CPUs designed to extract as much instruction-level parallelism as possible. Out-oforder CPUs, however have been shown to be inefficient when running Memcached.To address Memcached efficiency issues, we propose two architectures using 3D stacking to increase data storage efficiency. Our first 3D architecture, Mercury, consists of stacks of ARM Cortex-A7 cores with 4GB of DRAM, as well as NICs. Our second architecture, Iridium, replaces DRAM with NAND Flash to improve density. We explore, through simulation, the potential efficiency benefits of running Memcached on servers that use 3D-stacking to closely integrate low-power CPUs with NICs and memory. With Mercury we demonstrate that density may be improved by 2.9×, power efficiency by 4.9×, throughput by 10×, and throughput per GB by 3.5× over a state-of-the-art server running optimized Memcached. With Iridium we show that density may be increased by 14×, power efficiency by 2.4×, and throughput by 5.2×, while still meeting latency requirements for a majority of requests.

show abstract

Section: Characterizing Cloud Workloadsmentioning

confidence: 99%

Integrated 3D-stacked server designs for increasing physical density of key-value stores

Gutierrez

Cieslak

Giridhar

et al. 2014

Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…Replacing server memory with lowerbandwidth mobile DRAM results in between zero and 1.55x performance degradation of workloads such as SPEC-CPU, PARSEC, and SPEC-OMP [33]. However, most cloud workloads severely underutilize the available memory bandwidth [17,33], even during peak times. Ferdman et al show that the per-core off-chip bandwidth utilization of map-reduce, media streaming, web front end, and web search is at most 25% of the available bandwidth.…”

Section: Cost Of Non-interleaved Address Mappingmentioning

confidence: 99%

DIMMer

Zhang

Ehsan

Ferdman

et al. 2014

Proceedings of the ACM Symposium on Cloud Computing

Self Cite

View full text Add to dashboard Cite

Lack of energy proportionality in server systems results in significant waste of energy when operating at low utilization, a common scenario in today's data centers. We propose DIMMer, an approach to eliminate the idle power consumption of unused system components, motivated by two key observations. First, even in their lowest-power states, the power consumption of server components remains significant. Second, unused components can be powered off entirely without sacrificing server availability. We demonstrate that unused memory capacity can be powered off, eliminating the energy waste of self-refresh for unallocated memory, while still allowing for all capacity to be available on a moment's notice. Similarly, only one CPU socket must remain powered on, allowing unused CPUs and attached memory to be powered off entirely. The DIMMer vision can improve energy proportionality and achieve energy savings. Using a Google cluster trace as well as in-house experiments, we estimate up to 50% savings on DRAM and 18.8% on CPU background energy. At $0.10/kWh, this corresponds to 0.6% of total data center cost.

show abstract

“…As server workloads operate on a large volume of data, they produce active memory working sets that dwarf the capacity-limited on-chip caches of server processors and reside in the o -chip memory; hence, these applications frequently miss the data in the on-chip caches and access the long-latency memory to retrieve it. Such frequent data misses preclude server processors from reaching their peak performance because cores are idle waiting for the data to arrive [1,4,[12][13][14][15][16][17][18][19][20][21][22][23][24].…”

Section: Introductionmentioning

confidence: 99%

A Survey on Recent Hardware Data Prefetching Approaches with An Emphasis on Servers

Bakhshalipour,

Shakerinava,

Golshan

et al. 2020

Preprint

View full text Add to dashboard Cite

Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. e fruitfulness of data prefetching is evident to both industry and academy: nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access pa erns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the e ciency of prefetching in a speci c way. In this survey, we discuss the fundamental concepts in data prefetching and study state-of-the-art hardware data prefetching approaches.

show abstract

Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors

Cited by 28 publications

References 24 publications

Integrated 3D-stacked server designs for increasing physical density of key-value stores

Integrated 3D-stacked server designs for increasing physical density of key-value stores

DIMMer

A Survey on Recent Hardware Data Prefetching Approaches with An Emphasis on Servers

Contact Info

Product

Resources

About