IntroductionMatrix-decomposition and -factorization techniques are powerful tools of linear algebra and have applications in topics that span sciences. Specifically, QR, Eigen Value (EVD) and Singular Value (SVD) decomposition are applied in sophisticated digital signal processing for image compression, reconstruction and restoration as well as in decoding, space and noise filtering, parameter estimation and source separation in communications, to name just a few. Interesting and very successful research on systolic architectures for real-time implementations of matrix-decomposition algorithms has been conducted in the 80s and 90s of the last century [1,2]. It was found that in comparison to e.g., Householder-based approaches, Givens-and Jacobi-type algorithms feature a high degree of inherent parallelism and can be implemented by mapping the required rotations onto hardware-efficient Coordinate Rotate Digital Computer (CORDIC) operations [3]. However, during those times VLSI-CMOS technology allowed only for implementation of a single CORDIC processor per chip. Today's very-deep submicron CMOS technologies allow for the realization of high-throughput, low-energy CORDIC macros on a fraction of a square millimeter of silicon area and microwatts of power dissipation [4]. By this, the implementation of high-performance matrix-decomposition modules to be used as number crunching SoC processor sub-macros has become feasible.In this work, new architectures and a new methodology for the quantitative optimization of matrix-decomposition-processor SoCsub-macro implementations for the use in challenging application domains featuring a wide variety of requirements and specifications are elaborated. An attractive target architecture consists of a flexible software-controlled Application Specific Instruction Processor (ASIP) executing the matrix-decomposition algorithms and being accelerated by a large farm of dedicated CORDIC blocks. The architecture template should allow for the decomposition of real-as well as complex valued matrices. Floating-point capability can be achieved by operating the individual CORDIC blocks in block exponent arithmetic. In order to achieve maximum area and especially energy efficiency, the optimization has to be performed concurrently on all levels of CMOS design, from the algorithmic level down to the physical implementation level. Only by that, the interactions between decisions on the different design levels being imposed by the features of today's very-deep submicron CMOS technologies can be properly considered. A key element of this approach is the elaboration of quite accurate, parameterized algebraic cost models for area, throughput, latency and energy of CORDIC macros.
QRD Decomposition AcceleratorsQR decomposition today for instance is applied in multiple-input/ multiple-output (MIMO) channel detection [5][6][7]. Suitable CORDIC implementations were investigated and optimized in a 40 nm CMOS technology [4]. Based on the costs of the elementary arithmetic CORDIC sub-functions, a MATLAB-ba...