Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance

Underwood, Keith D.; Hemmert, Karl Scott

doi:10.1109/fccm.2004.21

Cited by 113 publications

(90 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The small matrix multiplies are implemented with an array of multiplyaccumulates (MACCs), as described for large matrix multiplies in [Underwood and Hemmert 2004]. In principle, the DGEMM operation is compute bound, since it performs 2N 3 operations over only 3N 2 data, with 4N 2 memory operations.…”

Section: Dense Matrix Multiplymentioning

confidence: 99%

“…This led researchers to begin by focusing on kernel operations that are used in HPC and can be provided through a standard library interface. Operations from BLAS [Underwood and Hemmert 2004;Zhuo and Prasanna 2004;Dou et al 2005;Zhuo and Prasanna 2005a;Zhuo and Prasanna 2005b] to FFTs [Hemmert and Underwood 2005] to the sparse matrix operations at the core of an iterative solver [deLorimier and DeHon 2005;Zhuo and Prasanna 2005c] and even a full CG solver [Morris et al 2006] have been studied. The fundamental challenge for each of these efforts is the communications with the host.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Architectures and APIs: Assessing Requirements for Delivering FPGA Performance to Applications

Underwood¹,

Hemmert²,

Ulmer³

2006

ACM/IEEE SC 2006 Conference (SC'06)

View full text Add to dashboard Cite

Reconfigurable computing leveraging field programmable gate arrays (FPGAs) is one of many accelerator technologies that are being investigated for application to high performance computing (HPC). Like most accelerators, FPGAs are very efficient at both dense matrix multiplication and FFT computations, but two important aspects of how to deliver that performance to applications have received too little attention. First, the standard API for important compute kernels hides parallelism from the system. Second, the issue of system architecture is virtually never addressed. This paper explores both issues and their implications for applications. We find that high bandwidth, low latency connectivity can be important, but the right API can be even more important.

show abstract

Section: Dense Matrix Multiplymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Architectures and APIs: Assessing Requirements for Delivering FPGA Performance to Applications

Underwood¹,

Hemmert²,

Ulmer³

2006

ACM/IEEE SC 2006 Conference (SC'06)

View full text Add to dashboard Cite

show abstract

“…In previous related work, Underwood has performed a study which compared the performance of dot-products in FPGAs and CPUs [6]. In this 2004 paper it was predicted that FPGA-based floating-point operations would overtake CPUs by at least an order of magnitude by 2009.…”

Section: Introductionmentioning

confidence: 99%

A Fused Hybrid Floating-Point and Fixed-Point Dot-Product for FPGAs

Lopes

Constantinides

2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Dot-products are one of the essential and recurrent building blocks in scientific computing, and often take-up a large proportion of the scientific acceleration circuitry. The acceleration of dot-products is very well suited for Field Programmable Gate Arrays (FPGAs) since these devices can be configured to employ wide parallelism, deep pipelining and exploit highly efficient datapaths. In this paper we present a dotproduct implementation which operates using a hybrid floating-point and fixed-point number system. This design receives floating-point inputs, and generates a floating-point output. Internally it makes use of a configurable word-length fixed-point number system. The internal representation can be tuned to match the desired accuracy. Results using a high-end Xilinx FPGA and an order 150 dot-product demonstrate that, for equivalent accuracy metrics, it is possible to utilize 3.8 times fewer resources, operate at 1.62 times faster clock frequency, and achieve a significant reduction in latency when compared to a direct floating-point core based dot-product. Combining these results and utilizing the spare resources to instantiate more units in parallel, it is possible to achieve an overall speed-up of at least 5 times.

show abstract

“…FPGAs are now able to provide high computational parallelism as well as I/O parallelism. They have become an attractive option to accelerate scientific applications [18,20].…”

Section: Introductionmentioning

confidence: 99%