Ultra-Elastic CGRAs for Irregular Loop Specialization

Torng, Christopher; Pan, Peitian; Yu-xiang, OU; Tan, Cheng; Batten, Christopher

doi:10.1109/hpca51647.2021.00042

Cited by 35 publications

(5 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Like recent near-data computing architectures [6,83,105,142,150], täkō adds programmable engines near caches to execute callbacks efficiently. In täkō, engines contain scheduling logic and a spatial dataflow fabric to run callbacks [43,59,103,132,138,143]. With this microarchitectural support, täkō gets close to the performance of fully specialized hardware -software programmability adds little overhead because data movement costs dominate and callbacks are short.…”

Section: Cachementioning

confidence: 99%

täkō

Schwedock

Yoovidhya

Seibert

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Current systems hide data movement from software behind the load-store interface. Software's inability to observe and respond to data movement is the root cause of many inefficiencies, including the growing fraction of execution time and energy devoted to data movement itself. Recent specialized memory-hierarchy designs prove that large data-movement savings are possible. However, these designs require custom hardware, raising a large barrier to their practical adoption.This paper argues that the hardware-software interface is the problem, and custom hardware is often unnecessary with an expanded interface. The täkō architecture lets software observe data movement and interpose when desired. Specifically, caches in täkō can trigger software callbacks in response to misses, evictions, and writebacks. Callbacks run on reconfigurable dataflow engines placed near caches. Five case studies show that this interface covers a wide range of data-movement features and optimizations. Microarchitecturally, täkō is similar to recent near-data computing designs, adding ≈5% area to a baseline multicore. täkō improves performance by 1.4×-4.2×, similar to prior custom hardware designs, and comes within 1.8% of an idealized implementation. CCS CONCEPTS• Computer systems organization → Processors and memory architectures.

show abstract

Section: Cachementioning

confidence: 99%

täkō

Schwedock

Yoovidhya

Seibert

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…Another notable work with architectural support for optimizing loop execution is the ultra-elastic CGRAs (UE-CGRAs) [24] that can efficiently execute loops with irregular control flow and memory accesses and inter-iteration loop dependencies. The solution co-designed across compiler, architecture, and VLSI accelerates true-dependency bottlenecks and reduces energy consumption by supporting fine-grain dynamic voltage and frequency scaling (DVFS) on individual PEs.…”

Section: Related Workmentioning

confidence: 99%

“…Case studies on processor architectures [1,12,15,25] reveal that improved performance and energy efficiency can be attained when loop-specific hardware optimizations are applied. Few recent CGRA architectures have come up with such architectural modifications to better support loop execution and reported good results [2,22,24,26].…”

Section: Introductionmentioning

confidence: 99%

Energy Efficient Hardware Loop Based Optimization for CGRAs

Sunny

Das

Martin

et al. 2022

J Sign Process Syst

View full text Add to dashboard Cite

anced performance, energy efficiency and flexibility bestowed surging popularity on Coarse Grained Reconfigurable Array (CGRA) architectures. To further improve the performance and energy efficiency, several hardware and softwarebased loop optimizations are adopted for CGRAs. In this paper, we propose a centralized hardware-based loop optimization technique to achieve better area and energy results compared to the previously implemented distributed version. Without incurring any performance degradation, area overhead against the reference architecture is reduced down to 1.5% for a 4×2 CGRA configuration. A maximum of 47.3% and an arithmetic mean of 27.2% reduction in energy consumption is attained by the centralized version of hardware loop compared to the baseline model employing software loop. Furthermore, the paper explores the co-existence of CGRA-specific hardware and software optimizations and their impact on loop efficiencies. Enhanced results are obtained by coupling loop unrolling with centralized hardware loop support. The combination allows achieving up to 68.7% reduction in energy consumption and 5.46× speed-up against the baseline model with no optimizations applied.

show abstract

“…With improvements to general purpose processors slowing, reconfigurable accelerators (aka. dataflow accelerators [30, 41, 43, 45, 61ś 63], or CGRAs [16,17,26,34,35,57,60]) have become an increasingly favorable option for meeting the needs of data-processing workloads. Recently, multicore versions of these designs have seen commercial traction, particularly for use in datacenters (e.g.…”

Section: Introductionmentioning

confidence: 99%

TaskStream: accelerating task-parallel workloads by recovering program structure

Dadu

Nowatzki

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Reconfigurable accelerators, like CGRAs and dataflow architectures, have come to prominence for addressing data-processing problems. However, they are largely limited to workloads with regular parallelism, precluding their applicability to prevalent task-parallel workloads. Reconfigurable architectures and task parallelism seem to be at odds, as the former requires repetitive and simple program structure, and the latter breaks program structure to create small, individually scheduled program units.Our insight is that if tasks and their potential for communication structure are first-class primitives in the hardware, it is possible to recover program structure with extremely low overhead. We propose a task execution model for accelerators called TaskStream, which annotates task dependences with information sufficient to recover inter-task structure. TaskStream enables work-aware load balancing, recovery of pipelined inter-task dependences, and recovery of inter-task read sharing through multicasting.We apply TaskStream to a reconfigurable dataflow architecture, creating a seamless hierarchical dataflow model for task-parallel workloads. We compare our accelerator, Delta, with an equivalent static-parallel design. Overall, we find that our execution model can improve performance by 2.2× with only 3.6% area overhead, while alleviating the programming burden of managing task distribution. CCS CONCEPTS• Computer systems organization → Reconfigurable computing; Data flow architectures; Heterogeneous (hybrid) systems.

show abstract

Ultra-Elastic CGRAs for Irregular Loop Specialization

Cited by 35 publications

References 47 publications

täkō

täkō

Energy Efficient Hardware Loop Based Optimization for CGRAs

TaskStream: accelerating task-parallel workloads by recovering program structure

Contact Info

Product

Resources

About