Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators

2012

IEEE Trans. Comput.

As technology is reaching physical limits, reducing power consumption is the key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Sub-rountines (BLAS). It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying general matrix-matrix multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra subroutines. In addition to indicating the sources of inefficiencies in current CPUs and GPUs, our results show our prototype linear algebra processor (LAP), double-precision GEMM (DGEMM) can achieve 600 GFLOPS consuming less than 25 Watts in standard 45nm technology, which is up to 50× better than CPUs in terms of energy efficiency.

Section: Related Workmentioning

confidence: 99%

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

2012

IEEE Trans. Comput.

“…With increasing memory walls, recent approaches have brought the computation units closer to memory, including hierarchical clustering of such combined tiles [38], [19]. Despite such optimization, utilizations for GEMM range from 60% down to less than 40% with increasing numbers of tiles.…”

Section: B Related Workmentioning

confidence: 99%

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

2012

Abstract-Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75x better power and up to 10x better area efficiency compared to traditional SIMD architectures.

“…Systolic arrays were popularized in the 80s [19]. With increasing memory walls, recent approaches have brought the computation units closer to memory, including hierarchical clustering of such combined tiles [23,16]. Despite such optimization, utilizations for GEMM range from 60% down to less than 40% with increasing number of tiles.…”

Section: Related Workmentioning

confidence: 99%

A high-performance, low-power linear algebra core

ASAP 2011 - 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors

2011

Achieving high-performance while reducing power consumption is the key question as technology scaling is reaching its limits. It is well-accepted that application-specific custom hardware can achieve orders of magnitude improvements in efficiency. The question is whether such efficiency can be maintained while providing enough flexibility to implement a broad class of operations. In this paper, we aim to answer this question for the domain of matrix computations. We propose a design of a novel linear algebra processor and demonstrate that it can achieve orders of magnitude improvements in efficiency for matrix-matrix multiplication, an operation that is indicative for a broad class of matrix computations. A feasibility study shows that 46 double-and 113 single-precision GFLOPS/W can be achieved in 13.6 and 11 GFLOPS/mm 2 , respectively with current components and standard 45nm technology.