Efficient Realization of Table Look-Up Based Double Precision Floating Point Arithmetic

Vatwani

IEEE Trans. Parallel Distrib. Syst.

et al. 2018

Self Cite

We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive.Index Terms-Parallel computing, orthogonal transforms, dense linear algebra, multiprocessor system-on-chip, instruction level parallelism ! Ranjani Narayan has over 15 years experience at IISc and 9 years at Hewlett Packard. She has vast work experience in a variety of fields computer architecture, operating systems, and special purpose systems. She has also worked in the Technical University of Delft, The Netherlands, and Massachusetts Institute of Technology, Cambridge, USA. During her tenure at HP, she worked on various areas in operating systems and hardware monitoring and diagnostics systems. She has numerous research publications.She is currently Chief Technology Officer at Morphing Machines he was the chief engineer at the Embedded Systems chair at TU Dortmund. In 2002, he joined RWTH Aachen University as a professor for Software for Systems on Silicon. His research comprises software development tools, processor architectures, and system-level electronic design automation, with focus on application-specific multicore systems. He published numerous books and technical papers and served in committees of the leading international EDA conferences. He received various scientific awards, including Best Paper Awards at DAC and twice at DATE, as well as several industrial awards. Dr. Leupers is also engaged as an entrepreneur and in turning novel technologies into innovations. He holds several patents on system-on-chip design technologies and has been a co-founder of LISATek (now with Synopsys), Silexica, and Secure Elements. He has served as consultant for various companies, as an expert for the European Commission, and in the management boards of large-scale projects like HiPEAC and UMIC. He is the coordinator of EU projects TETRACOM and TETRAMAX on academia/industry technology transfer.

Section: Custom Realization Of Householder Transform and Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

Vatwani

IEEE Trans. Parallel Distrib. Syst.

et al. 2018

Self Cite

“…For our experiments, we use Processing Element (PE) design presented in [9]. We optimize Floating Point Unit (FPU) design presented in [13] with recommendations presented in [12] for optimum Instructions Per Cycle (IPC). The paper is organized as follows: In section 2, MFA, KF and REDEFINE are discussed.…”

Section: Introductionmentioning

confidence: 99%

Achieving Efficient Realization of Kalman Filter on CGRA Through Algorithm-Architecture Co-design

Vatwani

Applied Reconfigurable Computing. Architectures, Tools, and Applications

et al. 2018

Self Cite

In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a platform for experiments since REDEFINE is capable of supporting realization of a set algorithmic compute structures at run-time on a Reconfigurable Data-path (RDP). We perform several hardware and software based optimizations in the realization of KF to achieve 116% improvement in terms of Gflops over the first realization of KF. Overall, with the presented approach for KF, 4-105x performance improvement in terms of Gflops/watt over several academically and commercially available realizations of KF is attained. In REDEFINE, we show that our implementation is scalable and the performance attained is commensurate with the underlying hardware resources 6 .

“…Major reason for centralization of efforts toward software optimizations and efficient exploitation of memory hierarchy is mainly due to several architectural parameters that are not in the control of programmer [16]. For example, the depth of the pipeline (pipeline stages) in the underlying platform [17]. In this paper, we present a theoretical framework that assists in establishing a relation between pipeline depth of different floating point operations with size and type of the workload.…”

Section: Introductionmentioning

confidence: 99%

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Parallel Process. Lett.

Raha

et al. 2017

Self Cite

Abstract-Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm 2 .