Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques

Luo, Zhen; Martonosi, Margaret

doi:10.1109/12.841125

Cited by 68 publications

(38 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous attempts to pipeline floating-point accumulation, such as [6], do so only at the expense of assuming associativity and thereby producing non-compliant results. In contrast, the techniques introduced here show how to exploit associativity while obtaining results identical to the sequential ordering specified in the program.…”

Section: Background a Non-associativity Of Floating-point Accumumentioning

confidence: 99%

Optimistic Parallelization of Floating-Point Accumulation

Kapre

DeHon

2007

18th IEEE Symposium on Computer Arithmetic (ARITH '07)

View full text Add to dashboard Cite

Abstract-Floating-point arithmetic is notoriously nonassociative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks.

show abstract

Section: Background a Non-associativity Of Floating-point Accumumentioning

confidence: 99%

Optimistic Parallelization of Floating-Point Accumulation

Kapre

DeHon

2007

18th IEEE Symposium on Computer Arithmetic (ARITH '07)

View full text Add to dashboard Cite

show abstract

“…Thus, power consumption is reduced, but large precision formats require more cycles to complete the operation. The proposal described in Vangal et al [2006] considers delayed addition, as in Luo and Martonosi [2000], for optimizing the delay of large sums. Moreover, authors optimize the SP FP-MAC by adding in 32 base.…”

Section: Related Workmentioning

confidence: 99%

Ultra-low-power adder stage design for exascale floating point units

Barrio

Bagherzadeh

Hermida

2014

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Currently, the most powerful supercomputers can provide tens of petaflops. Future many-core systems are estimated to provide an exaflop. However, the power budget limitation makes these machines still unfeasible and unaffordable. Floating Point Units (FPUs) are critical from both the power consumption and performance points of view of today's microprocessors and supercomputers. Literature offers very different designs. Some of them are focused on increasing performance no matter the penalty, and others on decreasing power at the expense of lower performance. In this article, we propose a novel approach for reducing the power of the FPU without degrading the rest of parameters. Concretely, this power reduction is also accompanied by an area reduction and a performance improvement. Hence, an overall energy gain will be produced. According to our experiments, our proposed unit consumes 17.5%, 23% and 16.5% less energy for single, double and quadruple precision, with an additional 15%, 21.5% and 14.5% delay reduction, respectively. Furthermore, area is also diminished by 4%, 4.5 and 5%.

show abstract

“…They can be used to implement complex applications, including floating-point ones [Scrofano et al 2008;Underwood 2004;Woods and VanCourt 2008;Zhuo and Prasanna 2008]. This has sparked the development of floating-point units targeting FPGAs, first for the basic operators mimicking thoses found in microprocessors [Belanovic and Leeser 2002;Louca et al 1996;Shirazi et al 1995], then more recently for operators which are more FPGA-specific, for instance accumulators [Bodnar et al 2006;de Dinechin et al 2008;Luo and Martonosi 2000;Wang et al 2006] or elementary functions Doss and Riley 2004;Pineiro et al 2004].…”

Section: Introduction 1floating Point Acceleration Using Fpgasmentioning

confidence: 99%

Floating-Point Exponentiation Units for Reconfigurable Computing

Dinechin

Echeverría

López‐Vallejo

et al. 2013

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

The high performance and capacity of current FPGAs makes them suitable as acceleration co-processors. This article studies the implementation, for such accelerators, of the floating-point power function x y as defined by the C99 and IEEE 754-2008 standards, generalized here to arbitrary exponent and mantissa sizes. Last-bit accuracy at the smallest possible cost is obtained thanks to a careful study of the various subcomponents: a floating-point logarithm, a modified floating-point exponential, and a truncated floatingpoint multiplier. A parameterized architecture generator in the open-source FloPoCo project is presented in details and evaluated.

show abstract

Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques

Cited by 68 publications

References 15 publications

Optimistic Parallelization of Floating-Point Accumulation

Optimistic Parallelization of Floating-Point Accumulation

Ultra-low-power adder stage design for exascale floating point units

Floating-Point Exponentiation Units for Reconfigurable Computing

Contact Info

Product

Resources

About