General-purpose systolic arrays

Johnson, K.T.; Hurson, Ali R.; Shirazi, Behrooz

doi:10.1109/2.241423

Cited by 73 publications

(27 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Systolic arrays were popular in the 80s [5]. With increasing memory walls, recent approaches have brought the computation units closer to memory, including hierarchical clustering of shared memory tiles [6] or network-on-chip architectures [7].…”

Section: Related Workmentioning

confidence: 99%

A Linear Algebra Core Design for Efficient Level-3 BLAS

Pedram

Gilani

Kim

et al. 2012

2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors

View full text Add to dashboard Cite

Abstract-Reducing power consumption and increasing efficiency is a key concern for many applications. It is well-accepted that specialization and heterogeneity are crucial strategies to improve both power and performance. Yet, how to design highly efficient processing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present the design of a specialized Linear Algebra Core (LAC) for an important class of computational kernels, the level-3 Basic Linear Algebra Subprograms (BLAS). We demonstrate a detailed algorithm/architecture co-design for mapping a number of level-3 BLAS operations onto the LAC.Results show that our prototype LAC achieves a performance of around 64 GFLOPS (double precision) for these operations, while consuming less than 1.3 Watts in standard 45nm CMOS technology. This is on par with a full-custom design and up to 50× and 10× better in terms of power efficiency than CPUs and GPUs.

show abstract

Section: Related Workmentioning

confidence: 99%

A Linear Algebra Core Design for Efficient Level-3 BLAS

Pedram

Gilani

Kim

et al. 2012

2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors

View full text Add to dashboard Cite

show abstract

“…Different optimizations and algorithms for matrix multiplication and more complicated matrix computations are compared and implemented on both 1D [42], [21] and 2D systolic arrays [42], [16], [30]. In [18], the concept of a general systolic array and a taxonomy of systolic array designs is discussed. Systolic arrays pipeline data and have one-sided interfaces.…”

Section: B Related Workmentioning

confidence: 99%

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Pedram

Gerstlauer

Geijn

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Abstract-Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75x better power and up to 10x better area efficiency compared to traditional SIMD architectures.

show abstract

“…Typically, concurrency optimization is performed for a given system architecture [MMV98]. A good example of this is the research on performance analysis and design optimization for systolic processors [JHS93,Kun98]. Commonly considered optimization criteria are the computation time, latency, throughput and the number of processors.…”

Section: Related Workmentioning

confidence: 99%