2015
DOI: 10.1145/2872887.2750407
|View full text |Cite
|
Sign up to set email alerts
|

The load slice core microarchitecture

Abstract: Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parall… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
42
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 20 publications
(44 citation statements)
references
References 26 publications
2
42
0
Order By: Relevance
“…This mode switching happens early enough which is a cycle before the related queue (ROB, IQ, LSQ) is full so that the CPU stall is avoided -we unexpectedly found that a focus on specific out-oforder commit conditions could be an important future direction for high-performance, efficient out-of-order processors; -the potential benefits of out-of-order commit increases with memory latency (relatively more for unsafe) while the benefits of the prefetching strategy that we picked are orthogonal to out-of-order commit benefits. This raises the enticing possibility of reducing systemwide silicon financial cost without compromising performance by coupling dense but higher-latency (slow-but-efficient) DRAM with out-of-order commit cores; -out-of-order commit increases memory hierarchy parallelism [7]; -While it is generally acceptable that by releasing pipeline resources as early as possible, out-of-order commit improves performance in minor and small cores relatively more than in large cores, in this work we show that this is only true for reluctant out-of-order commit. In fact, performance improvement in large out-of-order cores can exceed that of smaller cores if aggressive out-of-order commit is employed; -Our results show the potential for future systems that implement out-of-order commit, and indicate which are the most promising directions (safe vs. unsafe commit, and which of Bell and Lipasti's conditions [5] are most important to support) for future designs.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…This mode switching happens early enough which is a cycle before the related queue (ROB, IQ, LSQ) is full so that the CPU stall is avoided -we unexpectedly found that a focus on specific out-oforder commit conditions could be an important future direction for high-performance, efficient out-of-order processors; -the potential benefits of out-of-order commit increases with memory latency (relatively more for unsafe) while the benefits of the prefetching strategy that we picked are orthogonal to out-of-order commit benefits. This raises the enticing possibility of reducing systemwide silicon financial cost without compromising performance by coupling dense but higher-latency (slow-but-efficient) DRAM with out-of-order commit cores; -out-of-order commit increases memory hierarchy parallelism [7]; -While it is generally acceptable that by releasing pipeline resources as early as possible, out-of-order commit improves performance in minor and small cores relatively more than in large cores, in this work we show that this is only true for reluctant out-of-order commit. In fact, performance improvement in large out-of-order cores can exceed that of smaller cores if aggressive out-of-order commit is employed; -Our results show the potential for future systems that implement out-of-order commit, and indicate which are the most promising directions (safe vs. unsafe commit, and which of Bell and Lipasti's conditions [5] are most important to support) for future designs.…”
Section: Resultsmentioning
confidence: 99%
“…In this section, we compare in-order commit and out-oforder commit in terms of memory parallelism (both to DRAM (MLP) and within the cache hierarchy (MHP) [7]) by changing the number of L1 MSHRs and observing the effect on performance. To explore these effects, we select three applications that are highly memory-bound [18] (mcf), medium memory-bound (lbm), and largely not memorybound (gcc) in Fig.…”
Section: Memory Parallelismmentioning
confidence: 99%
“…Other than explicitly launching a helper thread, many proposals have dealt with reducing the chance a conventional microarchitecture is blocked [2], [13], [14], [19], [30], [38], [39], [41], [51], [55], [69], [73], [100]. Many designs share a theme of checkpointing important state, clean up some structures to allow further (speculative) execution.…”
Section: Background and Related Workmentioning
confidence: 99%
“…In h264dec, on the other hand, the input frames are larger and processing the frames requires copying them, which yields sufficient opportunities to benefit from prefetching data to L1 cache through SAHP. For the purpose of determining area and power requirements of WearCore, we follow the strategy as adopted in [17] -we use the area and power numbers of a baseline in-order core (ARM Cortex-A7 in our case) as available publicly, and calculate the overhead of the components added over it to implement WearCore. According to ARM [2], the core area and average power consumption of Cortex-A7 are 0.45 mm 2 and 100mW, respectively, in 28nm.…”
Section: Sb+sahp-l2mentioning
confidence: 99%
“…These are thus not well suited to low-power devices. Recently, Carlson et al [17] have proposed the Load Slice Core Microarchitecture that extracts memory hierarchy parallelism (MHP) by enabling memory accesses along with their addressgenerating instructions to execute while the pipeline is stalled on a long-latency miss. They propose a separate pipeline for independent memory accesses (and their address-generating instructions), and additional hardware that identifies address-generating instructions that lead up to the independent memory accesses.…”
Section: Related Workmentioning
confidence: 99%