Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Merchant, Farhad; Chattopadhyay, Anupam; Raha, Soumyendu; Nandy, S. K.; Narayan, Ranjani

doi:10.1142/s0129626417500062

Cited by 10 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FPS has several resources to perform computations. In this exposition, we use carefully designed DOT4, a square root, and a divider for realization of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT routines [21] [22]. Logical place of arithmetic units is shown in the figure 12 and structure of DOT4 is shown in figure 13.…”

Section: Custom Realization Of Householder Transform and Resultsmentioning

confidence: 99%

“…Realization of MHT, outperforms realization of DGEMM as shown in figure 14(d). We also show scalability of our solution by attaching PE as a CFU in REDEFINE Due to availability of double precision floating point arithmetic unites like adder, multiplier, square root, and divider, we emphasize on the realization of DGEQR2, and DGEQRF using HT and MHT [21] [22]. Organization of the paper is as follows: In section 2, we briefly discuss about REDEFINE and some of the recent realization of QR factorization.…”

mentioning

confidence: 99%

“…We show that sequential realization in PE and parallel realization of GGR based QR factorization in REDEFINE are scalable. Furthermore, it is shown that the speed-up in parallel realization in REDEFINE over sequential realization in PE is commensurate with the hardware resources employed in REDEFINE and the speed-up asymptotically approaches theoretical peak of REDEFINE CGRA For our implementations in PE and REDEFINE, we have used double precision Floating Point Unit (FPU) presented in [14] with recommendations presented in [15]. Organization of the papers is as follows: In section 2, we discuss about CGR, REDEFINE and some of the FPGA, multicore, and GPGPU based realizations of QR factorization.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

Merchant

Vatwani

Chattopadhyay

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive.Index Terms-Parallel computing, orthogonal transforms, dense linear algebra, multiprocessor system-on-chip, instruction level parallelism ! Ranjani Narayan has over 15 years experience at IISc and 9 years at Hewlett Packard. She has vast work experience in a variety of fields computer architecture, operating systems, and special purpose systems. She has also worked in the Technical University of Delft, The Netherlands, and Massachusetts Institute of Technology, Cambridge, USA. During her tenure at HP, she worked on various areas in operating systems and hardware monitoring and diagnostics systems. She has numerous research publications.She is currently Chief Technology Officer at Morphing Machines he was the chief engineer at the Embedded Systems chair at TU Dortmund. In 2002, he joined RWTH Aachen University as a professor for Software for Systems on Silicon. His research comprises software development tools, processor architectures, and system-level electronic design automation, with focus on application-specific multicore systems. He published numerous books and technical papers and served in committees of the leading international EDA conferences. He received various scientific awards, including Best Paper Awards at DAC and twice at DATE, as well as several industrial awards. Dr. Leupers is also engaged as an entrepreneur and in turning novel technologies into innovations. He holds several patents on system-on-chip design technologies and has been a co-founder of LISATek (now with Synopsys), Silexica, and Secure Elements. He has served as consultant for various companies, as an expert for the European Commission, and in the management boards of large-scale projects like HiPEAC and UMIC. He is the coordinator of EU projects TETRACOM and TETRAMAX on academia/industry technology transfer.

show abstract

Section: Custom Realization Of Householder Transform and Resultsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

Merchant

Vatwani

Chattopadhyay

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…For our experiments, we use Processing Element (PE) design presented in [9]. We optimize Floating Point Unit (FPU) design presented in [13] with recommendations presented in [12] for optimum Instructions Per Cycle (IPC). The paper is organized as follows: In section 2, MFA, KF and REDEFINE are discussed.…”

Section: Introductionmentioning

confidence: 99%

Achieving Efficient Realization of Kalman Filter on CGRA Through Algorithm-Architecture Co-design

Merchant

Vatwani

Chattopadhyay

et al. 2018

Applied Reconfigurable Computing. Architectures, Tools, and Applications

Self Cite

View full text Add to dashboard Cite

In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a platform for experiments since REDEFINE is capable of supporting realization of a set algorithmic compute structures at run-time on a Reconfigurable Data-path (RDP). We perform several hardware and software based optimizations in the realization of KF to achieve 116% improvement in terms of Gflops over the first realization of KF. Overall, with the presented approach for KF, 4-105x performance improvement in terms of Gflops/watt over several academically and commercially available realizations of KF is attained. In REDEFINE, we show that our implementation is scalable and the performance attained is commensurate with the underlying hardware resources 6 .

show abstract

Models for Calculating Pipeline Performance with Data Hazards

Khusainov

2021

Current Problems and Ways of Industry Development: Equipment and Technologies

View full text Add to dashboard Cite

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Cited by 10 publications

References 22 publications

Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

Achieving Efficient Realization of Kalman Filter on CGRA Through Algorithm-Architecture Co-design

Models for Calculating Pipeline Performance with Data Hazards

Contact Info

Product

Resources

About