The load slice core microarchitecture

Carlson, Trevor E.; Heirman, Wim; Allam, Osman; Eeckhout, Lieven

doi:10.1145/2872887.2750407

Cited by 20 publications

(44 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This mode switching happens early enough which is a cycle before the related queue (ROB, IQ, LSQ) is full so that the CPU stall is avoided -we unexpectedly found that a focus on specific out-oforder commit conditions could be an important future direction for high-performance, efficient out-of-order processors; -the potential benefits of out-of-order commit increases with memory latency (relatively more for unsafe) while the benefits of the prefetching strategy that we picked are orthogonal to out-of-order commit benefits. This raises the enticing possibility of reducing systemwide silicon financial cost without compromising performance by coupling dense but higher-latency (slow-but-efficient) DRAM with out-of-order commit cores; -out-of-order commit increases memory hierarchy parallelism [7]; -While it is generally acceptable that by releasing pipeline resources as early as possible, out-of-order commit improves performance in minor and small cores relatively more than in large cores, in this work we show that this is only true for reluctant out-of-order commit. In fact, performance improvement in large out-of-order cores can exceed that of smaller cores if aggressive out-of-order commit is employed; -Our results show the potential for future systems that implement out-of-order commit, and indicate which are the most promising directions (safe vs. unsafe commit, and which of Bell and Lipasti's conditions [5] are most important to support) for future designs.…”

Section: Resultsmentioning

confidence: 99%

“…In this section, we compare in-order commit and out-oforder commit in terms of memory parallelism (both to DRAM (MLP) and within the cache hierarchy (MHP) [7]) by changing the number of L1 MSHRs and observing the effect on performance. To explore these effects, we select three applications that are highly memory-bound [18] (mcf), medium memory-bound (lbm), and largely not memorybound (gcc) in Fig.…”

Section: Memory Parallelismmentioning

confidence: 99%

See 1 more Smart Citation

Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit

Alipour

Carlson

Black-Schaffer

2018

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction execution-in other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Memory Parallelismmentioning

confidence: 99%

Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit

Alipour

Carlson

Black-Schaffer

2018

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

show abstract

“…Other than explicitly launching a helper thread, many proposals have dealt with reducing the chance a conventional microarchitecture is blocked [2], [13], [14], [19], [30], [38], [39], [41], [51], [55], [69], [73], [100]. Many designs share a theme of checkpointing important state, clean up some structures to allow further (speculative) execution.…”

Section: Background and Related Workmentioning

confidence: 99%

R3-DLA (Reduce, Reuse, Recycle): A More Efficient Approach to Decoupled Look-Ahead Architectures

Kondguli

Huang

2019

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Modern societies have developed insatiable demands for more computation capabilities. Exploiting implicit parallelism to provide automatic performance improvement remains a central goal in engineering future general-purpose computing systems. One approach is to use a separate thread context to perform continuous look-ahead to improve the data and instruction supply to the main pipeline. Such a decoupled look-ahead (DLA) architecture can be quite effective in accelerating a broad range of applications in a relatively straightforward implementation. It also has broad design flexibility as the look-ahead agent need not be concerned with correctness constraints. In this paper, we explore a number of optimizations that make the look-ahead agent more efficient and yet extract more utility from it. With these optimizations, a DLA architecture can achieve an average speedup of 1.4 over a state-of-the-art microarchitecture for a broad set of benchmark suites, making it a powerful tool to enhance single-thread performance.

show abstract

“…In h264dec, on the other hand, the input frames are larger and processing the frames requires copying them, which yields sufficient opportunities to benefit from prefetching data to L1 cache through SAHP. For the purpose of determining area and power requirements of WearCore, we follow the strategy as adopted in [17] -we use the area and power numbers of a baseline in-order core (ARM Cortex-A7 in our case) as available publicly, and calculate the overhead of the components added over it to implement WearCore. According to ARM [2], the core area and average power consumption of Cortex-A7 are 0.45 mm 2 and 100mW, respectively, in 28nm.…”

Section: Sb+sahp-l2mentioning

confidence: 99%

“…These are thus not well suited to low-power devices. Recently, Carlson et al [17] have proposed the Load Slice Core Microarchitecture that extracts memory hierarchy parallelism (MHP) by enabling memory accesses along with their addressgenerating instructions to execute while the pipeline is stalled on a long-latency miss. They propose a separate pipeline for independent memory accesses (and their address-generating instructions), and additional hardware that identifies address-generating instructions that lead up to the independent memory accesses.…”

Section: Related Workmentioning

confidence: 99%

WearCore

Mehta

Torrellas

2016

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

View full text Add to dashboard Cite

Lately, the industry has recognized immense potential in wearables (particularly, smartwatches) being an attractive alternative/supplement to the smartphone. To this end, there has been recent activity in making the smartwatch 'self-sufficient' i.e. using it to make/receive calls, etc. independently of the phone. This marked shift in the way wearables will be used in future calls for changes in the core microarchitecture of smartwatch processors.In this work, we first identify ten key target applications for the smartwatch users that the processor must be able to quickly and efficiently execute. We show that seven of these workloads are inherently parallel, and are compute-and data-intensive. We therefore propose to use a multi-core processor with simple outof-order cores (for compute performance) and augment them with a light-weight software-assisted hardware prefetcher (for memory performance). This simple core with the light-weight prefetcher, called WearCore, is 2.9x more energy-efficient and 2.8x more areaefficient over an in-order core. The improvements are similar with respect to an out-of-order core.

show abstract

The load slice core microarchitecture

Cited by 20 publications

References 26 publications

Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit

Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit

R3-DLA (Reduce, Reuse, Recycle): A More Efficient Approach to Decoupled Look-Ahead Architectures

WearCore

Contact Info

Product

Resources

About