We present a number of power-aware instruction front-end (fetch/decode) throttling methods for high-performance dynamically-scheduled superscalar processors. Our methods reduce power dissipation by selectively turning on and off instruction fetch and decode. Moreover, they have a negligible impact on performance as they deliver instructions just in time for exploiting the available parallelism. Previously proposed front-end throttling methods rely on branch prediction confidence estimation. We introduce a new class of methods that exploit information about instruction flow (rate of instructions passing through stages). We show that our methods can boost power savings over previously proposed methods. In particular, for an 8-way processor a combined method reduces traffic by 14%, 20%, 6% and 6% for the fetch, decode, issue and complete stages respectively while performance remains mostly unaffected. The best previously proposed method reduces traffic by 10%, 15%, 4% and 4% respectively.
We investigate instruction distribution methods for quadclustec dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both non-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions. We find that a relatively simple method that changes clusters every other three instructions offers only a 17% performance slowdown compared to a nonclustered conjguration operating at the same frequency. Moreover; we show that by utilizing adaptive methods it is possible to further reduce this gap down to about 14%. Furthermore, performance appears to be more sensitive to inter-cluster communication latencies rather than to pipeline depth. The best performing method offers a slowdown of about 24% when inter-cluster communication latency is two cycle.This gap is only 20% when two additional stages are introduced in the front-end pipeline.
We investigate instruction distribution methods for quadcluster, dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both non-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions. We find that a relatively simple method that changes clusters every other three instructions offers only a 17% performance slowdown compared to a nonclustered configuration operating at the same frequency. Moreover, we show that by utilizing adaptive methods it is possible to further reduce this gap down to about 14%. Furthermore, performance appears to be more sensitive to inter-cluster communication latencies rather than to pipeline depth. The best performing method offers a slowdown of about 24% when inter-cluster communication latency is two cycle. This gap is only 20% when two additional stages are introduced in the front-end pipeline.
We introduce Branch Predictor Prediction (BPP) as a power-aware branch prediction technique for high performance processors. Our predictor reduces branch prediction power dissipation by selectively turning on and off two of the three tables used in the combined branch predictor. BPP relies on a small buffer that stores the addresses and the sub-predictors used by the most recent branches executed. Later we refer to this buffer to decide if any of the sub-predictors and the selector could be gated without harming performance. In this work we study power and performance trade-offs for a subset of SPEC 2k benchmarks. We show that on the average and for an 8-way processor, BPP can reduce branch prediction power dissipation by 28% and 14% compared to non-banked and banked 32k predictors respectively. This comes with a negligible impact on performance (1% max). We show that BPP always reduces power even for smaller predictors and that it offers better overall power and performance compared to simpler predictors.
In this work we introduce a history-aware, resourcebased dynamic (or simply HARD) scheduler for heterogeneous CMPs. HARD relies on recording application resource utilization and throughput to adaptively change cores for applications during runtime. We show that HARD can be configured to achieve both performance and power improvements. We compare HARD to a complexity-based static scheduler and show that HARD outperforms this alternative.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.