Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics

Pedram, Ardavan

doi:10.1109/ipdpsw.2013.166

Cited by 4 publications

(2 citation statements)

References 107 publications

(154 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The workload consists of N fine-grained concurrent execution segments, requiring T Re f = W L * 1 cycles to execute on a baseline reference floatingpoint engine capable of performing 1FLOP/cycle (single-precision floating-point operation per cycle). Such a floating-point engine consumes an area of 0.01mm 2 in 45nm [Pedram 2013] and dissipates 10mW [Keckler et al 2011]. With 22nm process technology, the same floating-point engine would consume an area of 0.003mm 2 and dissipate roughly 5mW [Cassidy and Andreou 2012;Keckler et al 2011].…”

Section: Analytic Model and Comparative Analysismentioning

confidence: 99%

GP-SIMD Processing-in-Memory

Morad

Yavits

Ginosar

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

AMIR MORAD, LEONID YAVITS, and RAN GINOSAR, Technion GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

show abstract

Section: Analytic Model and Comparative Analysismentioning

confidence: 99%

GP-SIMD Processing-in-Memory

Morad

Yavits

Ginosar

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Using the silicon area figures for CAM [10], RAM and floating point unit [1], and assuming the number of acceleration modules = 15 and CAM/RAM array height ℎ = 2 , we estimate the area of the CMOS SpMSpV accelerator at 90 in 22nm technology node. As CMOS feature scaling slows down, conventional memory technology experiences scalability problems.…”

Section: Resistive Implementationmentioning

confidence: 99%

Sparse Matrix Multiplication On An Associative Processor

Yavits

Morad

Ginosar

2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication algorithms are explored in this paper, combining AP and baseline CPU processing to various levels. They are evaluated by simulation on a large set of sparse matrices. The computational complexity of sparse matrix multiplication on AP is shown to be an O(nnz) where nnz is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.

show abstract