GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

Jahre, Magnus; Eeckhout, Lieven

doi:10.1109/hpca.2018.00034

Cited by 16 publications

(28 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To predict the LLC misses for all possible replication degrees we devise a light-weight mechanism we call the Replication Degree Directory (RDD). The RDD is inspired by the Auxiliary Tag Directory (ATD) [55] which is an independent tag directory commonly used to predict per-application cache misses as a function of allocated ways in shared LLCs (see e.g., [27], [76]). Unlike an ATD, the RRD is (i) located within an MCrouter rather than an LLC slice, and (ii) predicts misses across replication degrees and not miss curves.…”

Section: B Predicting Llc Missesmentioning

confidence: 99%

“…Herrero et al [24] use distributed cache partitioning to optimize cache use, and MorphCache [60] dynamically alters the cache topology to enable sharing multiple cache slices between cores. GDP [27] allocates LLC capacity to processes based on slowdown predictions, while Rolan et al [57] propose adaptive set-granular cooperative caching. These works are not directly applicable to GPUs as they exploit that different threads (processes) in multi-threaded (multiprogrammed) workloads have different memory requirements.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (GPUs). To cater to this need, GPU memory systems distribute requests across independent units to provide high bandwidth by servicing requests (mostly) in parallel. We find that this strategy breaks down for shared data structures because the shared Last-Level Cache (LLC) organization used by contemporary GPUs stores shared data in a single LLC slice. Shared data requests are hence serialized-resulting in data-intensive applications not being provided with the bandwidth they require. A private LLC organization can provide high bandwidth, but it is often undesirable since it significantly reduces the effective LLC capacity. In this work, we propose the Selective Replication (SelRep) LLC which selectively replicates shared read-only data across LLC slices to improve bandwidth supply while ensuring that the LLC retains sufficient capacity to keep shared data cached. The compile-time component of SelRep LLC uses dataflow analysis to identify read-only shared data structures and uses a special-purpose load instruction for these accesses. The runtime component of SelRep LLC then monitors the caching behavior of these loads. Leveraging an analytical model, SelRep LLC chooses a replication degree that carefully balances the effective LLC bandwidth benefits of replication against its capacity cost. SelRep LLC consistently provides high performance to replication-sensitive applications across different data set sizes. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the shared LLC baseline and the state-of-the-art Adaptive LLC, respectively.

show abstract

Section: B Predicting Llc Missesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Abstract system-level simulators have long been used in the architecture and design automation communities for performance estimation and analysis [11,19,27,38,39,42,44,49,52]. In particular, [12] used system simulation to evaluate the interaction between the OS and a 10 Gbit/s Ethernet NIC.…”

Section: Related Workmentioning

confidence: 99%

FirePerf

Karandikar

Amid

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

View full text Add to dashboard Cite

Achieving high-performance when developing specialized hardware/software systems requires understanding and improving not only core compute kernels, but also intricate and elusive system-level bottlenecks. Profiling these bottlenecks requires both high-fidelity introspection and the ability to run sufficiently many cycles to execute complex software stacks, a challenging combination. In this work, we enable agile full-system performance optimization for hardware/ software systems with FirePerf, a set of novel out-of-band system-level performance profiling capabilities integrated into the open-source FireSim FPGA-accelerated hardware simulation platform. Using out-of-band call stack reconstruction and automatic performance counter insertion, FirePerf enables introspecting into hardware and software at appropriate abstraction levels to rapidly identify opportunities for software optimization and hardware specialization, without disrupting end-to-end system behavior like traditional profiling tools. We demonstrate the capabilities of FirePerf with a case study that optimizes the hardware/software stack of an open-source RISC-V SoC with an Ethernet NIC to achieve 8× end-to-end improvement in achievable bandwidth for networking applications running on Linux. We also deploy a RISC-V Linux kernel optimization discovered with FirePerf on commercial RISC-V silicon, resulting in up to 1.72× improvement in network performance.

show abstract

“…Enforcing fairness/QoS requires understanding how interference affects the performance of co-running applications. More specifically, we need to predict the performance reduction (slowdown) during multitasking (shared mode) compared to an ideal configuration (private mode) where the application runs alone with exclusive access to all compute and memory system resources [10]. Using shared mode quantities (e.g., shared mode bandwidth utilization) as proxies for private mode quantities (e.g., private mode bandwidth utilization) is typically inaccurate since interference can change application resource consumption significantly.…”

Section: Introductionmentioning

confidence: 99%

“…Broadly speaking, slowdown prediction models can be classified as white-box [10,11,14] versus black-box [15,16]. White-box models are derived from fundamental architectural insights which enables them to, in theory, precisely capture key performance-related behavior.…”

Section: Introductionmentioning

confidence: 99%

HSM

Zhao

Jahre

Eeckhout

2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

Self Cite

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are increasingly widely used in the cloud to accelerate compute-heavy tasks. However, GPU-compute applications stress the GPU architecture in different ways -leading to suboptimal resource utilization when a single GPU is used to run a single application. One solution is to use the GPU in a multitasking fashion to improve utilization. Unfortunately, multitasking leads to destructive interference between co-running applications which causes fairness issues and Quality-of-Service (QoS) violations.We propose the Hybrid Slowdown Model (HSM) to dynamically and accurately predict application slowdown due to interference. HSM overcomes the low accuracy of prior white-box models, and training and implementation overheads of pure black-box models, with a hybrid approach. More specifically, the white-box component of HSM builds upon the fundamental insight that effective bandwidth utilization is proportional to DRAM row buffer hit rate, and the black-box component of HSM uses linear regression to relate row buffer hit rate to performance. HSM accurately predicts application slowdown with an average error of 6.8%, a significant improvement over the current state-of-the-art. In addition, we use HSM to guide various resource management schemes in multitasking GPUs: HSM-Fair significantly improves fairness (by 1.59× on average) compared to even partitioning, whereas HSM-QoS improves system throughput (by 18.9% on average) compared to proportional SM partitioning while maintaining the QoS target for the high-priority application in challenging mixed memory/compute-bound multi-program workloads.

show abstract

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime

Cited by 16 publications

References 63 publications

Selective Replication in Memory-Side GPU Caches

Selective Replication in Memory-Side GPU Caches

FirePerf

HSM

Contact Info

Product

Resources

About