Brian Fahs scite author profile

The performance of memory-bound commercial applications such as databases is limited by increasing memory latencies. In this paper, we show that exploiting memory-level parallelism (MLP) is an effective approach for improving the performance of these applications and that microarchitecture has a profound impact on achievable MLP. Using the epoch model of MLP, we reason how traditional microarchitecture features such as out-oforder issue and state-of-the-art microarchitecture techniques such as runahead execution affect MLP. Simulation results show that a moderately aggressive out-of-order issue processor improves MLP over an in-order issue processor by 12-30%, and that aggressive handling of loads, branches and serializing instructions is needed to attain the full benefits of large out-of-order instruction windows. The results also show that a processor's issue window and reorder buffer should be decoupled to exploit MLP more efficiently. In addition, we demonstrate that runahead execution is highly effective in enhancing MLP, potentially improving the MLP of the database workload by 82% and its overall performance by 60%. Finally, our limit study shows that there is considerable headroom in improving MLP and overall performance by implementing effective instruction prefetching, more accurate branch prediction and better value prediction in addition to runahead execution.

show abstract

Continuous optimization

Fahs

Rafacz

Patel

et al. 2005

View full text Add to dashboard Cite

show abstract

Improving quasi-dynamic schedules through region slip

Spadini

Fahs

Patel

et al.

View full text Add to dashboard Cite

Modern processors perform dynamic scheduling to achieve better utilization of execution resources. A schedule created at run-time is often better than one created at compile-time as it can dynamically adapt to specific events encountered at execution-time. In this paper, we examine some fundamental impediments to effective static scheduling. More specifically, we examine the question of why schedules generated quasi-dynamically by a low-level runtime optimizer and executed on a statically scheduled machine perform worse than using a dynamically-scheduled approach. We observe that such schedules suffer because of region boundaries and a skewed distribution of parallelism towards the beginning of a region. To overcome these limitations, we investigate a new concept, region slip, in which the schedules of different statically-scheduled regions can be interleaved in the processor issue queue to reduce the region boundary effects that cause empty issue slots.

show abstract

Microarchitecture optimizations for exploiting memory-level parallelism

Chou

Fahs

Abraham

View full text Add to dashboard Cite

Continuous Optimization

Fahs¹,

Rafacz²,

Patel³

et al. 2005

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

This paper presents a hardware-based dynamic optimizer that continuously optimizes an application's instruction stream. In continuous optimization, dataflow optimizations are performed using simple, table-based hardware placed in the rename stage of the processor pipeline. The continuous optimizer reduces dataflow height by performing constant propagation, reassociation, redundant load elimination, store forwarding, and silent store removal. To enhance the impact of the optimizations, the optimizer integrates values generated by the execution units back into the optimization process. Continuous optimization allows instructions with input values known at optimization time to be executed in the optimizer, leaving less work for the out-of-order portion of the pipeline. Continuous optimization can detect branch mispredictions earlier and thus reduce the misprediction penalty. In this paper, we present a detailed description of a hardware optimizer and evaluate it in the context of a contemporary microarchitecture running current workloads. Our analysis of SPECint, SPECfp, and mediabench workloads reveals that a hardware optimizer can directly execute 33% of instructions, resolve 29% of mispredicted branches, and generate addresses for 76% of memory operations. These positive effects combine to provide speed ups in the range 0.99 to 1.27.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Brian Fahs

Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Continuous optimization

Improving quasi-dynamic schedules through region slip

Microarchitecture optimizations for exploiting memory-level parallelism

Continuous Optimization

Contact Info

Product

Resources

About