A high-performance, low-power linear algebra core

Pedram, Ardavan; Gerstlauer, Andreas; Geijn, Robert A.

doi:10.1109/asap.2011.6043234

Cited by 22 publications

(26 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the vector norm, we use the original algorithm as the baseline, which requires 257, 769 or 1025 operations per corresponding vector norm of size k = 64, 128, 256. Since our implementation will result in an effective reduction in the number of actually required computations, the extensions have higher GOPS/W than what is reported as peak GFLOPS/W for the LAC in [5].…”

Section: B Performance and Efficiency Analysismentioning

confidence: 81%

“…The microarchitecture of the Linear Algebra Core (LAC) is illustrated in Figure 1. LAC achieves orders of magnitude better efficiency in power and area consumption compared to conventional general purpose architectures [5]. It is specifically optimized to perform rank-1 updates that form the inner kernels of parallel matrix multiplication.…”

Section: Architecturementioning

confidence: 99%

“…Details of the basic PE and core-level implementation of a LAC in 45nm bulk CMOS technology are reported in [5]. For floating-point units, we use the power and area data from [28].…”

Section: A Area and Power Estimationmentioning

confidence: 99%

“…This exposes the limitations in current architectures that cause additional complexities. We start from a minimal-hardware Linear Algebra Processor (LAP) [5] that was designed for GEMM and other similar matrix-matrix operations (level-3 BLAS [6]). Then, we add minimal, necessary but sufficient logic to avoid the need for running complex computations on a general purpose core.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Floating Point Architecture Extensions for Optimized Matrix Factorization

Pedram

Gerstlauer

Geijn

2013

2013 IEEE 21st Symposium on Computer Arithmetic

View full text Add to dashboard Cite

Abstract-This paper examines the mapping of algorithms encountered when solving dense linear systems and linear leastsquares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding datapaths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs.A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.

show abstract

Section: B Performance and Efficiency Analysismentioning

confidence: 81%

Section: Architecturementioning

confidence: 99%

“…Details of the basic PE and core-level implementation of a LAC in 45nm bulk CMOS technology are reported in [5]. For floating-point units, we use the power and area data from [28].…”

Section: A Area and Power Estimationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Floating Point Architecture Extensions for Optimized Matrix Factorization

Pedram

Gerstlauer

Geijn

2013

2013 IEEE 21st Symposium on Computer Arithmetic

View full text Add to dashboard Cite

show abstract

“…FFT algorithms typically perform poorly on general-purpose platforms, because the power-of-two strides of the FFT algorithm interact poorly with set-associative cache, set-associative address translation mechanism, and power-of-banked memory subsystem [1]. FFTW, developed by M. Frigo et al, is known as the fastest software implementation of the FFT algorithm.…”

Section: Introductionmentioning

confidence: 99%

Transpose-free variable-size FFT accelerator based on-chip SRAM

Guo

Tang

Lei

et al. 2014

IEICE Electron. Express

View full text Add to dashboard Cite

This paper presents a transpose-free variable-size fast fourier transform (FFT) accelerator on a digital signal processing (DSP) chip. Several parallel schemes are utilized to calculate a batch of smallsize FFT algorithms to achieve high performance and throughput. For middle-and large-size of FFT, we propose a transpose-free CooleyTukey scheme that uses the random access feature of on-chip SRAM memory to avoid the DDR access of matrix with column-wise and improves the utilization of DDR bandwidth. Experimental results show that our FFT accelerator, implemented with 65 mn library and run at 500 MHz, can achieve the energy efficiency improvement by two orders of magnitude compared with Intel Xeon CPU and obtain above 50# performance improvement compared with TI TMS320C64X DSP chip.

show abstract

Chisel Usecase: Designing General Matrix Multiply for FPGA

Ferres¹,

Müller²,

Rousseau³

2020

Applied Reconfigurable Computing. Architectures, Tools, and Applications

View full text Add to dashboard Cite

To ease developers work in an industry where FPGA usage is constantly growing, we propose an alternative methodology for architecture design. Targetting FPGA boards, we aim at comparing implementations on multiple criteria. We implement it as a tool flow based on Chisel, taking advantage of high level functionalities to ease circuit design, evolution and reutilization, improving designers productivity. We target a Xilinx VC709 board and propose a case study on General Matrix Multiply implementation using this flow, which demonstrates its usability with performances comparable to the state of the art, as well as the genericity one can benefit from when designing an applicationspecific accelerator. We show that we were able to generate, simulate and synthesize 80 different architectures in less than 24 hours, allowing differents trade-offs to be quickly and easily studied, from the most performant to the less costly, to easily comply with integration constraints.

show abstract

A high-performance, low-power linear algebra core

Cited by 22 publications

References 35 publications

Floating Point Architecture Extensions for Optimized Matrix Factorization

Floating Point Architecture Extensions for Optimized Matrix Factorization

Transpose-free variable-size FFT accelerator based on-chip SRAM

Chisel Usecase: Designing General Matrix Multiply for FPGA

Contact Info

Product

Resources

About