Floating-point sparse matrix-vector multiply for FPGAs

deLorimier, Michael; DeHon, André

doi:10.1145/1046192.1046203

Cited by 117 publications

(97 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using 32 leaf processing FPGAs (512 PEs), we are able to sustain a per leaf processing rate of 3 Gflops. More details on our firstgeneration FPGA-based SMVM implementation are reported in [deLorimier05].…”

Section: Bellman-fordmentioning

confidence: 99%

“…For example, on Sparse Matrix-Vector Multiplication (SMVM), processor-based machines typically achieve only 1-15% of their potential performance [deLorimier05]. While caching, banking, DMA block transfer, and strided prefetch allow these machines to efficiently process dense matrix operations or regular graphs, large data structures coupled with irregular data access defeat these simple optimizations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Design of a Polymorphous Cognitive Agent Architecture (PCAA)

Amduka¹,

Russo²,

Jha³

et al. 2008

View full text Add to dashboard Cite

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Washington Headquarters Service, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and PCAA is a dynamic, adaptive cognitive architecture that makes previously intractable approximation tasks tractable for NP-hard cognitive problems. PCAA consists of: linear composable cognitive agents, a cognitive mark-up language for cognitive behavior definition, a cognitive layer for derivation of cognitive services and specialized cognitive agents, and a next generation polymorphic hardware and software layer for runtime composition and instantiation of cognitive agents. PCAA is a dynamic, adaptive cognitive architecture that makes previously intractable approximation tasks tractable for NP-hard cognitive problems. PCAA consists of: linear composable cognitive agents, a cognitive mark-up language for cognitive behavior definition, a cognitive layer for derivation of cognitive services and specialized cognitive agents, and a next generation polymorphic hardware and software layer for runtime composition and instantiation of cognitive agents. SPONSOR/MONITOR'S ACRONYM(S) 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)AFRL SUBJECT TERMSOur approach included a comprehensive concept study in the context of representative DoD challenge problems that have a clear and well-defined need for ACIP technology. PCAA application experiments demonstrated clear performance improvements over traditional computing architectures for cognitive processing for these applications. Our innovations include:• Dynamically composable hardware and software with linear scalability for cognitive processing across a massively parallel hardware fabric for real time autonomous systems. • A dynamically composed agent architecture that partitions reactive and predefined behaviors into linear lower level cognitive agents that tailor and adapt the overall behavior of the computing architecture to immediate mission needs.• Run-time derived cognitive virtual machines to partition cognitive processing to a new generation of computing run-time configured hardware and software to allow for dynamic cognitive computing reconfiguration required to achieve reactive processing.Our research was driven by DoD applications that have demonstrated needs for diverse cognitive processing that cannot be addressed by current computing hardware and software architectures. We demonstrated our end-to-end approach for two applications with direct DoD relevance: control of autonomous Unmanned Aerial Vehicles and Intelligence Analysis.ii

show abstract

Section: Bellman-fordmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The Design of a Polymorphous Cognitive Agent Architecture (PCAA)

Amduka¹,

Russo²,

Jha³

et al. 2008

View full text Add to dashboard Cite

show abstract

“…This led researchers to begin by focusing on kernel operations that are used in HPC and can be provided through a standard library interface. Operations from BLAS [Underwood and Hemmert 2004;Zhuo and Prasanna 2004;Dou et al 2005;Zhuo and Prasanna 2005a;Zhuo and Prasanna 2005b] to FFTs [Hemmert and Underwood 2005] to the sparse matrix operations at the core of an iterative solver [deLorimier and DeHon 2005;Zhuo and Prasanna 2005c] and even a full CG solver [Morris et al 2006] have been studied. The fundamental challenge for each of these efforts is the communications with the host.…”

Section: Introductionmentioning

confidence: 99%

Architectures and APIs: Assessing Requirements for Delivering FPGA Performance to Applications

Underwood¹,

Hemmert²,

Ulmer³

2006

ACM/IEEE SC 2006 Conference (SC'06)

View full text Add to dashboard Cite

Reconfigurable computing leveraging field programmable gate arrays (FPGAs) is one of many accelerator technologies that are being investigated for application to high performance computing (HPC). Like most accelerators, FPGAs are very efficient at both dense matrix multiplication and FFT computations, but two important aspects of how to deliver that performance to applications have received too little attention. First, the standard API for important compute kernels hides parallelism from the system. Second, the issue of system architecture is virtually never addressed. This paper explores both issues and their implications for applications. We find that high bandwidth, low latency connectivity can be important, but the right API can be even more important.

show abstract

“…To explore this acceleration, a number of different hardware architectures have been investigated. These architectures include, Connection Machines [11], Cell Processors [12], Graphical Processing Units (GPUs) [13] and FPGAs [14]. A widely implemented comparative benchmark for floating-point computations is the General Matrix Multiply (GEMM) subroutine, part of the Basic Linear Algebra Subprograms (BLAS) library [15].…”

Section: Architectures For Scientific Computationmentioning

confidence: 99%

“…Due to the domination of the algorithm by inner-products, known to map well to FPGAs [14][25], CG is well suited, even for small dense systems. The FPGA allows the construction of a data-path specialised not only to the CG algorithm, but to the order of the matrix.…”

Section: Previous Fpga Implementationsmentioning

confidence: 99%

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

Lopes

Constantinides

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this paper we present a widely-parallel and deeply-pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium sized dense systems of linear equations and can be used as a stand alone solver or as building block to solve higher order systems. In this paper it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n 2 ) clock cycles on a micro-processor to Θ(n) on a FPGA. Through deep-pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFLOPS, and results on a Virtex5-330 indicate sustained performance of 35 GFLOPS. A comparison with an optimized software implementation running on a high-end CPU, demonstrate that this FPGA implementation represents a significant speed-up of at least an order of magnitude.

show abstract

Floating-point sparse matrix-vector multiply for FPGAs

Cited by 117 publications

References 13 publications

The Design of a Polymorphous Cognitive Agent Architecture (PCAA)

The Design of a Polymorphous Cognitive Agent Architecture (PCAA)

Architectures and APIs: Assessing Requirements for Delivering FPGA Performance to Applications

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

Contact Info

Product

Resources

About