Precise Runahead Execution

Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven

doi:10.1109/hpca47549.2020.00040

Cited by 16 publications

(14 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When the long-latency operation completes, the instructions are reinserted from the WIB into the issue queue. Runahead execution [14,[25][26][27] removes the blocking cache miss from the instruction window and continues to speculatively prefetch future memory addresses till the blocking miss returns. Continuous Flow Pipelines (CFP) [37] build on top of the CheckPoint and Renaming (CPR) proposal [1], releasing scheduler and register file resources for off-chip load-dependent instruction slices.…”

Section: Related Workmentioning

confidence: 99%

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Lakshminarasimhan

Naithani

Feliu

et al. 2022

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Superscalar out-of-order cores deliver high performance at the cost of increased complexity and power budget. In-order cores, in contrast, are less complex and have a smaller power budget, but offer low performance. A processor architecture should ideally provide high performance in a power- and cost-efficient manner. Recently proposed slice-out-of-order (sOoO) cores identify backward slices of memory operations which they execute out-of-order with respect to the rest of the dynamic instruction stream for increased instruction-level and memory-hierarchy parallelism. Unfortunately, constructing backward slices is imprecise and hardware-inefficient, leaving performance on the table. In this article, we propose Forward Slice Core (FSC ), a novel core microarchitecture that builds on a stall-on-use in-order core and extracts more instruction-level and memory-hierarchy parallelism than slice-out-of-order cores. FSC does so by identifying and steering forward slices (rather than backward slices) to dedicated in-order FIFO queues. Moreover, FSC puts load-consumers that depend on L1 D-cache misses on the side to enable younger independent load-consumers to execute faster. Finally, FSC eliminates the need for dynamic memory disambiguation by replicating store-address instructions across queues. Considering 3-wide pipeline configurations, we find that FSC improves performance by 27.1%, 21.1%, and 14.6% on average compared to Freeway, the state-of-the-art sOoO core, across SPEC CPU2017, GAP, and DaCapo, respectively, while at the same time incurring reduced hardware complexity. Compared to an OoO core, FSC reduces power consumption by 61.3% and chip area by 47%, providing a microarchitecture with high performance at low complexity.

show abstract

Section: Related Workmentioning

confidence: 99%

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Lakshminarasimhan

Naithani

Feliu

et al. 2022

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Precise Runahead Execution (PRE) [63,64], the state-ofthe-art in runahead execution, improves upon standard runahead through three key mechanisms. (1) PRE leverages the available back-end (issue queue and physical register file) resources to speculatively execute instructions in runahead mode, thereby eliminating the need to release and flush processor state when entering and exiting runahead mode.…”

Section: B Limitations Of Runahead Techniquesmentioning

confidence: 99%

“…To achieve the instruction-level visibility necessary to calculate the addresses of complex access patterns seen in today's workloads [3], we conclude that this ideal technique must operate within the core, instead of within the cache. Runahead execution [25,32,34,57,58,64] is the most promising technique to date, where upon a memory stall at the head of the reorder buffer (ROB), execution enters a speculative 'runahead' mode designed to prefetch future memory accesses. In runahead mode, the addresses of future memory accesses are calculated and the memory accesses are speculatively issued.…”

Section: Introductionmentioning

confidence: 99%

“…Second, runahead execution is limited by the processor's front-end (fetch/decode/rename) width: the rate at which runahead execution can generate MLP is slow if there is a large number of instructions between the independent loads in the future instruction stream. Third, the speculation depth of runahead is limited by the amount of available back-end resources (issue queue slots and physical registers) [64].…”

Section: Introductionmentioning

confidence: 99%

“…Third, Vector Runahead issues multiple rounds of these vectorized instructions through vector unrolling and pipelining to speculate even deeper and increase the effective runahead fetch/decode bandwidth even furthere.g., 8 rounds of vector runahead with 8 vector loads each, lead to 64 speculative prefetches that are issued in parallel. We evaluate Vector Runahead through detailed simulation using a variety of graph, database and HPC workloads, and we report that Vector Runahead improves performance by 1.79× compared to a baseline out-of-order processora significant improvement over the state-of-the-art Precise Runahead Execution (PRE) technique [64] which achieves a speedup of 1.20×. The performance speedup results from much higher memory-level parallelism by prefetching loads within dependent load sequences in an accurate and timely manner.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Vector Runahead

Naithani

Ainsworth

Jones

et al. 2021

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchitectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memorylevel parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79× performance speedup on a large out-of-order superscalar system, significantly improving on stateof-the-art techniques.

show abstract

Reliability-Aware Runahead

Naithani

Eeckhout

2022

2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Precise Runahead Execution

Cited by 16 publications

References 85 publications

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Vector Runahead

Reliability-Aware Runahead

Contact Info

Product

Resources

About