Abstract. In this paper, we examine the trade-offs in performance and area due to customizing the datapath and instruction set architecture of a soft VLIW processor implemented in a high-density FPGA. In addition to describing our processor, we describe a number of microarchitectural optimizations we used to reduce the area of the datapath. We also describe the tools we developed to customize, generate, and program our processor. Our experimental results show that datapath and instruction set customization achieve high levels of performance, and that using onchip resources and implementing microarchitectural optimizations like selective data forwarding help keep FPGA resource utilization in check.
In this paper we examine the performance and area trade-offs resulting from customizing the datapath and instruction set architecture of a soft VLIW processor. In addition to describing the datapath and instruction set architecture of our processor, we describe a number of microarchitectural optimizations we used to reduce the area of the datapath. We also describe the tools we developed and used to customize, generate, implement, and program the processor. Our experimental results show that datapath and instruction set customization achieve high levels of performance, and that microarchitectural optimizations like selective data forwarding help keep FPGA resource utilization in check.
Abstract. Modern processors use speculative execution to improve performance. However, speculative execution requires a checkpoint/restore mechanism to repair the machine's state whenever speculation fails. Existing checkpoint/restore mechanisms do not scale well for processors with relatively large windows (i.e., 128 or more). This work presents Turbo-ROB, a checkpoint/restore recovery accelerator that can complement or replace existing checkpoint/restore mechanisms. We show that the Turbo-ROB improves performance and reduces resource requirements compared to a conventional Re-order Buffer mechanism. For example, on the average, a 64-entry TROB matches the performance of a 512-entry ROB, while a 128-and a 512-entry TROB outperform the 512-entry ROB by 6.8% and 9.1% respectively. We also demonstrate that the TROB improves performance with register alias table checkpoints effectively reducing the need from more checkpoints and the latency and energy increase these would imply.
We present two full-custom implementations of the Register Alias Table (RAT) for a 4-way superscalar dynamically-scheduled processor in a commercial 130nm CMOS technology. The implementations differ in the way they organize the embedded global checkpoints (GCs) which support speculative execution. In the first implementation, representative of early designs, the GCs are organized as shift registers. In the second implementation, representative of more recent proposals, the GCs are organized as random access buffers. We measure the impact of increasing the number of GCs on the latency, energy, and area of the RAT. The results support the importance of recent techniques that reduce the number of GCs while maintaining performance.
Checkpoint prediction and intelligent management have been recently proposed for reducing the number of coarse-grain checkpoints needed to achieve high performance through speculative execution. In this work, we take a closer look at various checkpoint prediction and management alternatives, comparing their performance and requirements as the scheduler window size increases. We also study a few additional design choices. The key contribution of this work is BranchTap, a novel checkpoint-aware speculation strategy that temporarily throttles speculation to reduce recovery cost while allowing speculation to proceed when it is likely to boost performance. BranchTap dynamically adapts to application behavior. We demonstrate that for a 1K-entry window processor with a FIFO of just four checkpoints, our adaptive speculation control mechanism leads to an average performance degradation of just 1.49% compared to a processor that has an infinite number of checkpoints. This represents an improvement of 28.3% over using just predictionbased checkpoint allocation. Average performance degradation without BranchTap is 2.08%. For the same configuration, BranchTap decreases the worst case deterioration from 8.99% to 5.64%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.