Precise Runahead Execution

Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven

doi:10.1109/lca.2019.2910518

Cited by 5 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hardware prefetchers can pick up a variety of memoryaccess patterns, but to achieve the instruction-level visibility necessary to calculate the addresses of complex access patterns in today's workloads [1], one must operate within the core, instead of within the cache. Runahead execution [8,9] is the most promising technique to achieve this.…”

Section: Existing Runahead Techniquesmentioning

confidence: 99%

“…First, by skipping over loads for which the data source is not yet ready, it is unsuitable for today's complex indirection patterns that consist of chains of dependent load misses. Second, conventional runahead is limited by both the processor's front-end (fetch/decode/rename) width and available back-end resources (issue queue slots and physical registers) [9]. What is needed is a technique that can overcome the limitations of a processor's resources to generate massive amounts of memory-level parallelism and follow chains of dependent loads to completion, prefetching all data required for many memory accesses in the future.…”

Section: Existing Runahead Techniquesmentioning

confidence: 99%

“…What is needed is a technique that can overcome the limitations of a processor's resources to generate massive amounts of memory-level parallelism and follow chains of dependent loads to completion, prefetching all data required for many memory accesses in the future. Vector (b) Precise Runahead Execution (PRE) [9] is able to prefetch array elements from A. In contrast, the array elements to B cannot be prefetched during runahead mode as they depend on A.…”

Section: Existing Runahead Techniquesmentioning

confidence: 99%

“…Fig. 2: Vector Runahead versus Precise Runahead Execution (PRE) [9] on an illustrative code example. The loads highlighted in green can only be triggered by stalling on loads highlighted in gray, and those in blue by stalling on gray and green.…”

Section: Existing Runahead Techniquesmentioning

confidence: 99%

“…On a variety of graph, database and high-performance computing workloads, Vector Runahead improves performance by 1.79× compared to a baseline out-of-order processor with a stride prefetcher. Relative to the state-of-the-art Indirect Memory Prefetcher (IMP) [12] and Precise Runahead Execution (PRE) [9], Vector Runahead improves performance by 1.49× on average. The fundamental reason for this significant performance improvement is illustrated in Figure 1: PRE is unable to accurately prefetch the majority of indirect memory accesses, unlike Vector Runahead.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Vector Runahead for Indirect Memory Accesses

et al. 2022

Self Cite

View full text Add to dashboard Cite

Vector Runahead delivers extremely high memorylevel parallelism even for chains of dependent memory accesses with complex intermediate address computation, which conventional runahead techniques fundamentally cannot handle and therefore have ignored. It does this by rearchitecting runahead to use speculative data-level parallelism, rather than work-skipping, as its primary form of extracting more memory-level parallelism in runahead mode than a true execution can, which we hope will bring about an entirely new dimension for high-performance processors.

show abstract

Section: Existing Runahead Techniquesmentioning

confidence: 99%

Section: Existing Runahead Techniquesmentioning

confidence: 99%

Section: Existing Runahead Techniquesmentioning

confidence: 99%

Section: Existing Runahead Techniquesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Vector Runahead for Indirect Memory Accesses

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

Precise Runahead Execution

Naithani

Feliu

Adileh

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

View full text Add to dashboard Cite

Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests. Unfortunately, all prior runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when entering runahead mode and then need to re-fill the pipeline to restart normal operation. Moreover, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that leads to the same long-latency load.We propose precise runahead execution (PRE) which builds on the key observation that when entering runahead mode, the processor has enough issue queue and physical register file resources to speculatively execute instructions. This mitigates the need to release and re-fill processor state in the ROB, issue queue, and physical register file. In addition, PRE preexecutes only those instructions in runahead mode that lead to full-window stalls, using a novel register renaming mechanism to quickly free physical registers in runahead mode, further improving efficiency and effectiveness. Finally, PRE optionally buffers decoded runahead micro-ops in the front-end to save energy. Our experimental evaluation using a set of memoryintensive applications shows that PRE achieves an additional 18.2% performance improvement over the recent runahead proposals while at the same time reducing energy consumption by 6.8%.

show abstract

Vector Runahead

Naithani

Ainsworth

Jones

et al. 2021

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchitectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memorylevel parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79× performance speedup on a large out-of-order superscalar system, significantly improving on stateof-the-art techniques.

show abstract

Precise Runahead Execution

Cited by 5 publications

References 14 publications

Vector Runahead for Indirect Memory Accesses

Vector Runahead for Indirect Memory Accesses

Precise Runahead Execution

Vector Runahead

Contact Info

Product

Resources

About