A Framework for Adding Low-Overhead, Fine-Grained Power Domains to CGRAs

Nayak, Ankita; Zhang, Keyi; Setaluri, Raj; Carsello, Alex; Mann, Makai; Richardson, Stephen; Bahr, Rick; Hanrahan, Pat; Horowitz, Mark; Raina, Priyanka

doi:10.23919/date48585.2020.9116477

Cited by 6 publications

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the extent of computations performed by MAC operations of the PEs considering ML accelerators is well above 99%, it makes sense to target the MAC block inside the PE for various optimizations [20]. UPF enables powerintent specifications distinctly as compared to the logic-level description of a digital design, thus enabling convenient power optimization by offering effective portability of powerintent description for a wide scope of commercial products throughout the whole cycle of the electronic system design [21]. The logic-level optimizations involve decomposing the larger PE multiplier into multiple smaller sub-multipliers in addition to the multiplier carry propagation adders (CPAs) being replaced by carry save adders (CSAs) in each sub-multiplier as well as the PE.…”

Section: Introductionmentioning

confidence: 99%

Power-Intent Systolic Array Using Modified Parallel Multiplier for Machine Learning Acceleration

Inayat

Muslim

Iqbal

et al. 2023

Sensors

View full text Add to dashboard Cite

Systolic arrays are an integral part of many modern machine learning (ML) accelerators due to their efficiency in performing matrix multiplication that is a key primitive in modern ML models. Current state-of-the-art in systolic array-based accelerators mainly target area and delay optimizations with power optimization being considered as a secondary target. Very few accelerator designs directly target power optimizations and that too using very complex algorithmic modifications that in turn result in a compromise in the area or delay performance. We present a novel Power-Intent Systolic Array (PI-SA) that is based on the fine-grained power gating of the multiplication and accumulation (MAC) block multiplier inside the processing element of the systolic array, which reduces the design power consumption quite significantly, but with an additional delay cost. To offset the delay cost, we introduce a modified decomposition multiplier to obtain smaller reduction tree and to further improve area and delay, we also replace the carry propagation adder with a carry save adder inside each sub-multiplier. Comparison of the proposed design with the baseline Gemmini naive systolic array design and its variant, i.e., a conventional systolic array design, exhibits a delay reduction of up to 6%, an area improvement of up to 32% and a power reduction of up to 57% for varying accumulator bit-widths.

show abstract

Section: Introductionmentioning

confidence: 99%