CASINO Core Microarchitecture: Generating Out-of-Order Schedules Using Cascaded In-Order Scheduling Windows

Jeong, Ipoom; Park, Seihoon; Lee, Changmin; Ro, Won Woo

doi:10.1109/hpca47549.2020.00039

Cited by 10 publications

(9 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We simulate the graph algorithms in two iterations where the first iteration is used to warm up the caches, and we report performance for the second iteration obtained through detailed simulation. Also, we skip the initialization and preprocessing steps during the formation of the graph using an in-built graph generator with a size of 2 18 nodes, formed according to the Kronecker distribution satisfying the Graph500 specifications.…”

Section: Methodsmentioning

confidence: 99%

“…Shioya et al [35] propose the front-end execution architecture which executes instructions that have their operands ready in the front-end of the pipeline; other non-ready instructions are dispatched to the out-of-order back-end. CASINO [18] pursues a similar goal by augmenting an in-order core with an additional speculative queue from which ready instructions are executed ahead of a traditional in-order instruction queue. CASINO adds significant complexity over an in-order core because of the CAM-based selection logic in the speculative queue and dynamic memory disambiguation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Lakshminarasimhan

Naithani

Feliu

et al. 2022

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Superscalar out-of-order cores deliver high performance at the cost of increased complexity and power budget. In-order cores, in contrast, are less complex and have a smaller power budget, but offer low performance. A processor architecture should ideally provide high performance in a power- and cost-efficient manner. Recently proposed slice-out-of-order (sOoO) cores identify backward slices of memory operations which they execute out-of-order with respect to the rest of the dynamic instruction stream for increased instruction-level and memory-hierarchy parallelism. Unfortunately, constructing backward slices is imprecise and hardware-inefficient, leaving performance on the table. In this article, we propose Forward Slice Core (FSC ), a novel core microarchitecture that builds on a stall-on-use in-order core and extracts more instruction-level and memory-hierarchy parallelism than slice-out-of-order cores. FSC does so by identifying and steering forward slices (rather than backward slices) to dedicated in-order FIFO queues. Moreover, FSC puts load-consumers that depend on L1 D-cache misses on the side to enable younger independent load-consumers to execute faster. Finally, FSC eliminates the need for dynamic memory disambiguation by replicating store-address instructions across queues. Considering 3-wide pipeline configurations, we find that FSC improves performance by 27.1%, 21.1%, and 14.6% on average compared to Freeway, the state-of-the-art sOoO core, across SPEC CPU2017, GAP, and DaCapo, respectively, while at the same time incurring reduced hardware complexity. Compared to an OoO core, FSC reduces power consumption by 61.3% and chip area by 47%, providing a microarchitecture with high performance at low complexity.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Lakshminarasimhan

Naithani

Feliu

et al. 2022

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…We conducted simulation using gem5 [6] which is configured with 2-issue, 2.5GHz dual-core processor with 32KB/64KB 2-way set-associative L1 instruction/data caches (2 cycles hit) and a unified 128KB 16-way set-associative L2 cache (20 cycles hit) to model an ARM Cortex-A53 processor [2]. The store buffer size is set to 4 as with the recent work that simulates the Cortex-A53 core [28], and the default CLQ size is 2. According to prior works [67][68][69][70], 300-30 deployed acoustic sensors can achieve 10-30 cycles of the worst-case detection latency (WCDL) with the area cost of less than 1% of die size, and therefore we set the default WCDL to 10 cycles.…”

Section: Implementation and Evaluation 61 Methodologymentioning

confidence: 99%

Turnpike: Lightweight Soft Error Resilience for In-Order Cores

Zeng

Kim

Lee

et al. 2021

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Acoustic-sensor-based soft error resilience is particularly promising, since it can verify the absence of soft errors and eliminate silent data corruptions at a low hardware cost. However, the state-of-theart work incurs a significant performance overhead for in-order cores due to frequent structural/data hazards during the verification. To address the problem, this paper presents Turnpike, a compiler/architecture co-design scheme that can achieve lightweight yet guaranteed soft error resilience for in-order cores. The key idea is that many of the data computed in the core can bypass the soft error verification without compromising the resilience. Along with simple microarchitectural support for realizing the idea, Turnpike leverages compiler optimizations to further reduce the performance overhead. Experimental results with 36 benchmarks demonstrate that Turnpike only incurs a 0-14% run-time overhead on average while the state-of-the-art incurs a 29-84% overhead when the worstcase latency of the sensor based error detection is 10-50 cycles.

show abstract

“…As the OoO queue handles fewer instructions, FIFOrder reduces its depth and width, thus reducing the scheduling energy cost. Another recent architecture, CASINO core [16], also targets ready instructions to simplify instruction scheduling.…”

Section: Energy-efficient Core Designmentioning

confidence: 99%

Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores

Kumar

Alipour

Black-Schaffer

2022

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Exploiting memory-level parallelism (MLP) is crucial to hide long memory and last-level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex and energy-hungry hardware. This work revisits slice-out-of-order (sOoO) cores as an energy-efficient alternative for MLP exploitation. sOoO cores achieve energy efficiency by constructing and executing slices of MLP-generating instructions out-of-order only with respect to the rest of instructions; the slices and the remaining instructions, by themselves, execute in-order. However, we observe that existing sOoO cores miss significant MLP opportunities due to their dependence-oblivious in-order slice execution, which causes dependent slices to frequently block MLP generation. To boost MLP generation, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them from blocking subsequent independent slices and MLP extraction. The proposed core incurs minimal area and power overheads, yet approaches the MLP benefits of fully OoO cores. Our evaluation shows that Freeway delivers 12% better performance than the state-of-the-art sOoO core and is within 7% of the MLP limits of full OoO execution.

show abstract

CASINO Core Microarchitecture: Generating Out-of-Order Schedules Using Cascaded In-Order Scheduling Windows

Cited by 10 publications

References 63 publications

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture

Turnpike: Lightweight Soft Error Resilience for In-Order Cores

Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores

Contact Info

Product

Resources

About