The performance of memory-bound commercial applications such as databases is limited by increasing memory latencies. In this paper, we show that exploiting memory-level parallelism (MLP) is an effective approach for improving the performance of these applications and that microarchitecture has a profound impact on achievable MLP. Using the epoch model of MLP, we reason how traditional microarchitecture features such as out-oforder issue and state-of-the-art microarchitecture techniques such as runahead execution affect MLP. Simulation results show that a moderately aggressive out-of-order issue processor improves MLP over an in-order issue processor by 12-30%, and that aggressive handling of loads, branches and serializing instructions is needed to attain the full benefits of large out-of-order instruction windows. The results also show that a processor's issue window and reorder buffer should be decoupled to exploit MLP more efficiently. In addition, we demonstrate that runahead execution is highly effective in enhancing MLP, potentially improving the MLP of the database workload by 82% and its overall performance by 60%. Finally, our limit study shows that there is considerable headroom in improving MLP and overall performance by implementing effective instruction prefetching, more accurate branch prediction and better value prediction in addition to runahead execution.
The performance of many important commercial workloads, such as on-line transaction processing, is limited by the frequent stalls due to off-chip instruction and data accesses. These applications are characterized by irregular control flow and complex data access patterns that render many low-cost prefetching schemes, such as stream-based and stride-based prefetching, ineffective. For such applications, correlation-based prefetching, which is capable of capturing complex data access patterns, has been shown to be a more promising approach. However, the large instruction and data working sets of these applications require extremely large correlation tables, making these tables impractical to be implemented on-chip. This paper proposes the epoch-based correlation prefetcher, which cost-effectively stores its correlation table in main memory and exploits the concept of epochs to hide the long latency of its correlation table access, and which attempts to eliminate entire epochs instead of individual instruction and data misses. Experimental results demonstrate that the epoch-based correlation prefetcher, which requires minimal on-chip real estate to implement, improves the performance of a suite of important commercial benchmarks by 13% to 31% and significantly outperforms previously proposed correlation prefetchers.40th IEEE/ACM International Symposium on Microarchitecture
In this paper, we study the instruction cache miss behavior of four modern commercial applications (a database workload, TPC-W, SPECjAppServer2002 and SPECweb99). These applications exhibit high instruction cache miss rates for both the L1 and L2 caches, and a sizable performance improvement can be achieved by eliminating these misses.We show that it is important, not only to address sequential misses, but also misses due to branches and function calls. As a result, we propose an efficient discontinuity prefetching scheme that can be effectively combined with traditional sequential prefetching to address all forms of instruction cache misses.Additionally, with the emergence of chip multiprocessors (CMPs), instruction prefetching schemes must take into account their effect on the shared L2 cache. Specifically, aggressive instruction cache prefetching can result in an increase in the number of L2 cache data misses. As a solution, we propose a scheme that does not install prefetches into the L2 cache unless they are proven to be useful.Overall, we demonstrate that the combination of our proposed schemes is successful in reducing the instruction miss rate to only 10%-16% of the original miss rate and results in a 1.08X-1.37X performance improvement for the applications studied.
With processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW) prefetchers, which act to reduce the missing loads observed by an application.This paper analyzes the behavior of cache-missing loads in SPEC CPU2000 and highlights the inability of unit and single non-unit stride prefetchers to correctly prefetch for some commonly occurring streams. In response to this analysis, a novel multi-stride prefetcher, that supports streams with up to four distinct strides, is proposed. Performance analysis for SPEC CPU2000 illustrates that the proposed multistride prefetcher can outperform current stride prefetchers on several benchmarks; most notably on mcf, lucas and facerec, where it achieves an additional performance gain of up to 57%. Performance of the strided HW prefetchers is also contrasted with another recently proposed prefetch scheme, runahead execution (RAE), and the synergy between the schemes is investigated.
This paper presents the concept of dynamic control independence (DCl) and shows how it can be detected and exploited in an out-of-order superscalar processor to reduce the performance penalties of branch mispredictions. We show how DCI can be leveraged during branch misprediction recovery to reduce the number of instructions squashed on a misprediction as well as how it can be used to avoid predicting unpredictable branches by fetching instructions out-of-order A realistic implementation is described and evaluated using six SPECint95 benchmarks. We show that exploiting DCI during branch misprediction recovety improves pe$ormance by 0.9-9.9% on a I-wide processol; by I&11.2% on an b-wide processor and by 1.9-15.3% on a 12-wideprocessol: We also show that using DCI information to fetch instructions out-of-order when an unpredictable branch is encountered potentially improves performance by 0.9-15.2% on a I-wide processol: by 2.0-14.8% on an 8-wide processor and by 2.6-16.2% on a 12-wide processor: Some of the largest performance gains are observed on go and gee, which have traditionally posed the most d@cult challenge to aggressive branch prediction techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.