Patrick Akl scite author profile

2007

Abstract. In this paper, we examine the trade-offs in performance and area due to customizing the datapath and instruction set architecture of a soft VLIW processor implemented in a high-density FPGA. In addition to describing our processor, we describe a number of microarchitectural optimizations we used to reduce the area of the datapath. We also describe the tools we developed to customize, generate, and program our processor. Our experimental results show that datapath and instruction set customization achieve high levels of performance, and that using onchip resources and implementing microarchitectural optimizations like selective data forwarding help keep FPGA resource utilization in check.

Datapath and ISA Customization for Soft VLIW Processors

Saghir

El-Majzoub

2006

In this paper we examine the performance and area trade-offs resulting from customizing the datapath and instruction set architecture of a soft VLIW processor. In addition to describing the datapath and instruction set architecture of our processor, we describe a number of microarchitectural optimizations we used to reduce the area of the datapath. We also describe the tools we developed and used to customize, generate, implement, and program the processor. Our experimental results show that datapath and instruction set customization achieve high levels of performance, and that microarchitectural optimizations like selective data forwarding help keep FPGA resource utilization in check.

Turbo-ROB: A Low Cost Checkpoint/Restore Accelerator

Moshovos

Abstract. Modern processors use speculative execution to improve performance. However, speculative execution requires a checkpoint/restore mechanism to repair the machine's state whenever speculation fails. Existing checkpoint/restore mechanisms do not scale well for processors with relatively large windows (i.e., 128 or more). This work presents Turbo-ROB, a checkpoint/restore recovery accelerator that can complement or replace existing checkpoint/restore mechanisms. We show that the Turbo-ROB improves performance and reduces resource requirements compared to a conventional Re-order Buffer mechanism. For example, on the average, a 64-entry TROB matches the performance of a 512-entry ROB, while a 128-and a 512-entry TROB outperform the 512-entry ROB by 6.8% and 9.1% respectively. We also demonstrate that the TROB improves performance with register alias table checkpoints effectively reducing the need from more checkpoints and the latency and energy increase these would imply.

On the latency, energy and area of checkpointed, superscalar register alias tables

Safi

Moshovos

et al. 2007

We present two full-custom implementations of the Register Alias Table (RAT) for a 4-way superscalar dynamically-scheduled processor in a commercial 130nm CMOS technology. The implementations differ in the way they organize the embedded global checkpoints (GCs) which support speculative execution. In the first implementation, representative of early designs, the GCs are organized as shift registers. In the second implementation, representative of more recent proposals, the GCs are organized as random access buffers. We measure the impact of increasing the number of GCs on the latency, energy, and area of the RAT. The results support the importance of recent techniques that reduce the number of GCs while maintaining performance.

BranchTap

Moshovos

2006

Checkpoint prediction and intelligent management have been recently proposed for reducing the number of coarse-grain checkpoints needed to achieve high performance through speculative execution. In this work, we take a closer look at various checkpoint prediction and management alternatives, comparing their performance and requirements as the scheduler window size increases. We also study a few additional design choices. The key contribution of this work is BranchTap, a novel checkpoint-aware speculation strategy that temporarily throttles speculation to reduce recovery cost while allowing speculation to proceed when it is likely to boost performance. BranchTap dynamically adapts to application behavior. We demonstrate that for a 1K-entry window processor with a FIFO of just four checkpoints, our adaptive speculation control mechanism leads to an average performance degradation of just 1.49% compared to a processor that has an infinite number of checkpoints. This represents an improvement of 28.3% over using just predictionbased checkpoint allocation. Average performance degradation without BranchTap is 2.08%. For the same configuration, BranchTap decreases the worst case deterioration from 8.99% to 5.64%.