Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

Pedram, Ardavan; Geijn, Robert A.; Gerstlauer, Andreas

doi:10.1109/tc.2012.132

Cited by 51 publications

(26 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This question is at least partially answered in [Pedram et al 2012b;Pedram et al 2012a], which examines how to design specialized hardware (both compute core and entire processor) for linear algebra computation. The models used for such purpose have much in common with our model for determining the parameter values for the micro-kernel.…”

Section: Discussionmentioning

confidence: 99%

Analytical Modeling Is Enough for High-Performance BLIS

Low

Igual

Smith

et al. 2016

ACM Trans. Math. Softw.

126

115

View full text Add to dashboard Cite

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine optimal tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).

show abstract

Section: Discussionmentioning

confidence: 99%

Analytical Modeling Is Enough for High-Performance BLIS

Low

Igual

Smith

et al. 2016

ACM Trans. Math. Softw.

126

115

View full text Add to dashboard Cite

show abstract

“…Local storage in each PE consists of a bigger single-ported and a smaller dual-ported memory. An extensive study of memory size trade-offs for the core was presented in our previous work [35]. Typically in dense linear algebra problems, access patterns are predictable and in most cases sequential, and there is no need for complex caching schemes.…”

Section: A Lac Architecturementioning

confidence: 99%

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Pedram

Gerstlauer

Geijn

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Self Cite

View full text Add to dashboard Cite

Abstract-Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75x better power and up to 10x better area efficiency compared to traditional SIMD architectures.

show abstract

“…Our starting point is a Linear Algebra Core (LAC) that we developed in previous work [5]. The core design and its efficiency were originally derived for GEMM operations.…”

Section: Introductionmentioning

confidence: 99%

Transforming a linear algebra core to an FFT accelerator

Pedram

McCalpin²,

Gerstlauer

2013

2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors

Self Cite

View full text Add to dashboard Cite

Abstract-This paper considers the modifications required to transform a highly-efficient, specialized linear algebra core into an efficient engine for computing Fast Fourier Transforms (FFTs). We review the minimal changes required to support Radix-4 FFT computations and propose extensions to the micro-architecture of the baseline linear algebra core. Along the way, we study the critical differences between the two classes of algorithms. Special attention is paid to the configuration of the on-chip memory system to support high utilization. We examine design trade-offs between efficiency, specialization and flexibility, and their effects both on the core and memory hierarchy for a unified design as compared to dedicated accelerators for each application. The final design is a flexible architecture that can perform both classes of applications. Results show that the proposed hybrid FFT/Linear Algebra core can achieve 26.6 GFLOPS/S with a power efficiency of 40 GFLOPS/W, which is up to 100× and 40× more energy efficient than cutting-edge CPUs and GPUs, respectively.

show abstract

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

Cited by 51 publications

References 64 publications

Analytical Modeling Is Enough for High-Performance BLIS

Analytical Modeling Is Enough for High-Performance BLIS

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Transforming a linear algebra core to an FFT accelerator

Contact Info

Product

Resources

About