Zhen Luo scite author profile

Martonosi

2000

IEEE Trans. Comput.

ÐThe speed of arithmetic calculations in configurable hardware is limited by carry propagation, even with the dedicated hardware found in recent FPGAs. This paper proposes and evaluates an approach called delayed addition that reduces the carrypropagation bottleneck and improves the performance of arithmetic calculations. Our approach employs the idea used in Wallace trees to store the results in an intermediate form and delay addition until the end of a repeated calculation such as accumulation or dotproduct; this effectively removes carry propagation overhead from the calculation's critical path. We present both integer and floatingpoint designs that use our technique. Our pipelined integer multiply-accumulate (MAC) design is based on a fairly traditional multiplier design, but with delayed addition as well. This design achieves a 72MHz clock rate on an XC4036xla-9 FPGA and 170MHz clock rate on an XV300epq240-8 FPGA. Next, we present a 32-bit floating-point accumulator based on delayed addition. Here, delayed addition requires a novel alignment technique that decouples the incoming operands from the accumulated result. A conservative version of this design achieves a 40 MHz clock rate on an XC4036xla-9 FPGA and 97MHz clock rate on an XV100epq240-8 FPGA. We also present a 32-bit floating-point accumulator design with compiler-managed overflow avoidance that achieves a 80MHz clock rate on an XC4036xla-9 FPGA and 150MHz clock rate on an XCV100epq240-8 FPGA.

Cost-effective multiplication with enhanced adders for multimedia applications

Lee²

Cost IntroductionConsumer multimedia devices such as DVD players and cameras are very cost-sensitive. The MPEG and JPEG type algorithms they use tend to have multiplications by constants, rather than by variables. In this paper, we focus on cost-effective architectural support for such multiplications. We propose using adders enhanced with pre-shifters to perform efficient constant multiplies. This reduces the cost, as we show in the paper that an integer multiplier takes about three times the latency and three to four times the area over our design of a delay/area efficient preshift_adder to perform the preshift_add instructions. However, it is not easy to find the shortest instruction sequence for each constant multiply. We show our methodology for achieving this using a Directed Acyclic Graph (DAG) approach, to generate the shortest or nearly shortest sequence of instructions for every constant multiplier up to 15 bits. These optimal instruction sequences can be substituted by a programmer or compiler when a multiply by a constant is needed. Our performance results show that we can improve the performance while reducing the cost of constant multiplications.We use four fixed-point cases to evaluate the instruction sequences that our algorithm generates. We use CI.F to denote that the positive constant multiplier has I bits of integer and F bits of fraction. The four cases are C8.0, C12.0, C2.10 and C3.12. While we are more interested in cases like C2.10 and C3.12 for our multimedia applications, where constants tend to be fractions with few integer bits, we generate C8.0 and C12.0 sequences, to compare our results with earlier work in [4] on constant multiplication by integers.Sections 2 and 3 describe our DAG-based search algorithm for finding the shortest instruction sequence for C8.0 case and the nearly shortest instruction sequences for C12.0, C2.10 and C3.12 cases. Section 4 presents our performance results, and comparisons to earlier work [4]. Section 5 presents our design of a preshift_adder. Based on the results from section 4 and 5, we discuss the performance/area gain we achieve on an optimized DCT/IDCT algorithm in section 6. 2.Algorithm overview 2

<title>Use of delayed addition techniques to accelerate integer and floating-point calculations in configurable hardware</title>

Martonosi

1998

This paper proposes and evaluates an approach for improving the performance of arithmetic calculations via delayed addition. Our approach employs the idea used in Wallace trees to delay addition until the end ofa repeated calculation such as accumulation or dot-product, this effectively removes cariy propagation overhead from the calculation 's critical path. We present integer and floating-point designs that use this technique. Our pipelined integer multiply-accumulate (MAC) design is based on afairly traditional multiplier design, but with delayed addition as well. This design achieves a 37MHz clock rate on an XC4O36XL-2 FPGA. Next, we present a 32-bitfloating-point accumulator based on delayed addition. Here delayed addition requires a novel alignment technique that decouples the incoming operandsfrom the accumulated result. A conservative version ofthis design achieves a 33 MHz clock rate. Finally, we also present a more aggressive 32-bit floatingpoint accumulator design that achieves a 66MHz clock rate. These designs demonstrate the utility ofdelayed addition for accelerating FPGA calculations in both the integer andfloating-point domains.

An edge-endpoint-based configurable hardware architecture for VLSI CAD layout design rule checking

Martonosi²,

Ashar³

An Edge‐endpoint‐based Configurable HardwareArchitecture for VLSI Layout Design Rule Checking

Martonosi

Ashar³

1999

VLSI Design

Previous efforts to build hardware accelerators for VLSI layout Design Rule Checking (DRC) were hobbled by the fact that it is often impractical to build a different rulechecking ASIC each time design rules or fabrication processes change. In this paper, we propose a configurable hardware approach to DRC. It can garner impressive speedups over software approaches, while retaining the flexibility needed to change the rule checker as rules or processes change.Our work proposes an edge-endpoints-based method for performing Manhattan geometry checking and a general scalable architecture for DRC. We then demonstrate our approach by applying this architecture to a set of design rules for MOSIS SCN4N_SUB process. We have implemented several design rule checks within a single Xilinx XC4013 FPGA and demonstrated overall speedups in excess of 25X over software methods. We have used a Compaq Pamette board to do the hardware prototyping and have achieved a clock rate of 33 MHz.