Another Trip to the Wall

Radulovic, Milan; Zivanovic, Darko; Daniel, Ruiz; Supinski, Bronis R. de; McKee, Sally A.; Radojković, Petar; Ayguadé, Eduard

doi:10.1145/2818950.2818955

Cited by 39 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results (and results from previous studies) suggest that a future ARCHER service could even benefit from architectures where HBM-like technologies with limited capacity replace main memory, rather than using a hybrid solution (such as the MC-DRAM+DRAM seen on the Intel Xeon Phi). The reasoning here is that using HBM technologies as a main memory replacement allows applications to access the best performance without application code modifications whereas in the hybrid approach the only way to use the HBM without code modification is as an additional, large cache level, which can limit the performance gains available [6]. Another option would be to use a combination of processors with high memory bandwidth alongside processors with high memory capacity.…”

Section: Discussionmentioning

confidence: 99%

A Survey of Application Memory Usage on a National Supercomputer: An Analysis of Memory Requirements on ARCHER

Turner

McIntosh–Smith

2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this short paper we set out to provide a set of modern data on the actual memory per core and memory per node requirements of the most heavily used applications on a contemporary, national-scale supercomputer. This report is based on data from all jobs run on the UK national supercomputing service, ARCHER, a 118,000 core Cray XC30, in the 1 year period from 1 st July 2016 to 30 th June 2017 inclusive. Our analysis shows that 80% of all usage on ARCHER has a maximum memory use of 1 GiB/core or less (24 GiB/node or less) and that there is a trend to larger memory use as job size increases. Analysis of memory use by software application type reveals differences in memory use between periodic electronic structure, atomistic N-body, grid-based climate modelling, and grid-based CFD applications. We present an analysis of these differences, and suggest further analysis and work in this area. Finally, we discuss the implications of these results for the design of future HPC systems, in particular the applicability of high bandwidth memory type technologies.

show abstract

Section: Discussionmentioning

confidence: 99%

A Survey of Application Memory Usage on a National Supercomputer: An Analysis of Memory Requirements on ARCHER

Turner

McIntosh–Smith

2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…To understand cloud workloads' memory bandwidth utilization trend, we show the fleet-wide memory bandwidth consumption increase over the three generations of servers in Figure 4. With the most recent generation (Gen 5), the 1-minute average memory bandwidth utilization shows that the majority of the fleet is While mirco-benchmarks can generally drive memory bandwidth utilization to >80%, we observe that production workloads rarely exceed 60% memory bandwidth utilization, as any further increase results in an exponential increase in memory latency [32]. Thus, we classify workloads with higher than 60% bandwidth utilization as memory bandwidth bound (shaded red in Figure 4).…”

Section: Memory Bandwidth Scaling Challengementioning

confidence: 95%

Workload Behavior Driven Memory Subsystem Design for Hyperscale

Mahar¹,

Wang²,

Shu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Hyperscalars run services across a large fleet of servers, serving billions of users worldwide. However, these services exhibit different behaviors compared to commonly available benchmark suites, leading to server architectures that are suboptimal for cloud workloads. As datacenters emerge as the primary server processor market, optimizing server processors for cloud workloads by better understanding their behavior is an area of interest. To address this, we present MemProf, a memory profiler that profiles the three major reasons for stalls in cloud workloads: code-fetch, memory bandwidth, and memory latency. We use MemProf to understand the behavior of cloud workloads at Meta and propose and evaluate micro-architectural and memory system design improvements that help cloud workloads' performance.MemProf's code analysis shows that cloud workloads at Meta execute the same code across CPU cores. Using this, we propose shared micro-architectural structures-a shared L2 I-TLB and a shared L2 cache. Next, to help with memory bandwidth stalls, using workloads' memory bandwidth distribution, we find that only a few pages contribute to most of the system bandwidth. We use this finding to evaluate a new high-bandwidth, small-capacity memory tier and show that it performs 1.46× better than the current baseline configuration. Finally, we look into ways to improve memory latency for cloud workloads. Profiling using MemProf reveals that L2 hardware prefetchers, which are commonly used to reduce memory latency, have very low coverage and consume a significant amount of memory bandwidth. To help improve future hardware prefetcher performance, we built an efficient memory tracing tool to collect and validate production memory access traces. Our memory tracing tool adds significantly less overhead than DynamoRIO, enabling tracing production workloads.

show abstract

“…The main drawback of this technique is the decreased computing speed that leads to increased application run times. This issue is partially mitigated because several HPC applications and benchmarks are not CPU-bound but present a memory and I/O bottleneck (Marjanović et al, 2014;Radulovic et al, 2015;Zivanovic et al, 2017); reducing the frequency of the computing units used by these jobs does not impact severely their time-to-solution (TtS) (Auweter et al, 2014). While in the rest of the article, we will refer explicitly to frequency scaling, our conclusions can also be applied to systems using Intel's Running Average Power Limit (RAPL) (David et al, 2010), that does not directly change the computing nodes clock frequency but enforces a socket-level power cap.…”

Section: Introductionmentioning

confidence: 99%

Pricing schemes for energy-efficient HPC systems: Design and exploration

Borghesi

Bartolini

Milano

et al. 2018

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Energy efficiency is of paramount importance for the sustainability of HPC systems. Energy consumption limits the peak performance of supercomputers and accounts for a large share of total cost of ownership. Consequently, system owners and final users have started exploring mechanisms to trade off performance for power consumption, for example through frequency and voltage scaling.However, only a limited number of studies have been devoted to explore the economic viability of performance scaling solutions and to devise pricing mechanisms fostering a more energy-conscious usage of resources, without adversely impacting return-of-investment on the HPC facility. We present a parametrized model to analyze the impact of frequency scaling on energy and to assess the potential total cost benefits for the HPC facility and the user. We evaluate four pricing schemes, considering both facility manager and the user perspectives. We then perform a design space exploration considering current and near-future HPC systems and technologies.

show abstract

Another Trip to the Wall

Cited by 39 publications

References 10 publications

A Survey of Application Memory Usage on a National Supercomputer: An Analysis of Memory Requirements on ARCHER

A Survey of Application Memory Usage on a National Supercomputer: An Analysis of Memory Requirements on ARCHER

Workload Behavior Driven Memory Subsystem Design for Hyperscale

Pricing schemes for energy-efficient HPC systems: Design and exploration

Contact Info

Product

Resources

About