Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays

Kung, H. T.; McDanel, Bradley; Zhang, Sai Qian; Dong, Xin; Chen, Chih Chiang

doi:10.1109/asap.2019.00-31

Cited by 20 publications

(23 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been shown many small systolic arrays may increase the utilization (thus efficiency) at the cost of performance [23]. Maestro [24,34] showed as much but only for short inputs only on BERT-style models. However, even when scaled to 7 nm, Maestro does not compete with modern accelerators like A100 or TPUs.…”

Section: Related Workmentioning

confidence: 99%

ProSE: the architecture and design of a protein discovery engine

Robson

Wills

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Protein language models have enabled breakthrough approaches to protein structure prediction, function annotation, and drug discovery. A primary limitation to the widespread adoption of these powerful models is the high computational cost associated with the training and inference of these models, especially at longer sequence lengths. We present the architecture, microarchitecture, and hardware implementation of a protein design and discovery accelerator, ProSE (Protein Systolic Engine). ProSE has a collection of custom heterogeneous systolic arrays and special functions that process transfer learning model inferences efficiently. The architecture marries SIMD-style computations with systolic array architectures, optimizing coarse-grained operation sequences across model layers to achieve efficiency without sacrificing generality. ProSE performs Protein BERT inference at up to 6.9× speedup and 48× power efficiency (performance/Watt) compared to one NVIDIA A100 GPU. ProSE achieves up to 5.5 × (12.7×) speedup and 173× (249×) power efficiency compared to TPUv3 (TPUv2).

show abstract

Section: Related Workmentioning

confidence: 99%

ProSE: the architecture and design of a protein discovery engine

Robson

Wills

2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…As such, systolic arrays achieve higher power efficiency and peak throughput compared to dataflow architectures. Moreover, recent proposals to couple multiple systolic arrays in a single die (i.e., multi-pod designs [4,29,33]) allows benefiting data and task-level parallelism, further improving the gain from provisioned silicon.…”

Section: Why Scale-out Systolic Arrays?mentioning

confidence: 99%

“…While these multi-pod accelerators achieve much better utilization over their monolithic counterparts with multi-tenancy, variability in array size requirements in workloads remains a fundamental limitation to utilization in a few coarse-grain pods. In contrast, multi-pod designs with minimally sized arrays [33] target maximum utilization. Unfortunately, these designs compromise the inference accelerator's power efficiency by over-provisioning overall on-chip memory [17] (e.g., 8x8 arrays incur 5 − 10× more memory accesses than 128 × 128 arrays).…”

Section: Introductionmentioning

confidence: 99%

“…Much prior work has focused on the use of interconnects in inference accelerators. While many advocate Mesh [9,10,22,51] or H-tree [33,54], these topologies lack sufficient bisection bandwidth to support a large number of pods. Others advocate Benes [45], which requires a long round-trip latency on requests and may adversely impact the overall execution time.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scale-out Systolic Arrays

Yüzügüler¹,

Sönmez²,

Drumond³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt-i.e., throughput/Watt adjusted when accounting for array utilizationposes a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose Scale-out Systolic Arrays, a multi-pod inference accelerator for both single-and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5×.

show abstract

“…Systolic arrays are being explored extensively for improvements in matrix operations that directly relate them to deep neural network implementations. These architectures have been found useful, especially during the inference phase of network processing (Kung et al , 2019).…”

Section: Compression Of Deep Neural Networkmentioning

confidence: 99%

Exploring compression and parallelization techniques for distribution of deep neural networks over Edge–Fog continuum – a review

Nazir

Mir

Qureshi

2020

IJICC

View full text Add to dashboard Cite

PurposeThe trend of “Deep Learning for Internet of Things (IoT)” has gained fresh momentum with enormous upcoming applications employing these models as their processing engine and Cloud as their resource giant. But this picture leads to underutilization of ever-increasing device pool of IoT that has already passed 15 billion mark in 2015. Thus, it is high time to explore a different approach to tackle this issue, keeping in view the characteristics and needs of the two fields. Processing at the Edge can boost applications with real-time deadlines while complementing security.Design/methodology/approachThis review paper contributes towards three cardinal directions of research in the field of DL for IoT. The first section covers the categories of IoT devices and how Fog can aid in overcoming the underutilization of millions of devices, forming the realm of the things for IoT. The second direction handles the issue of immense computational requirements of DL models by uncovering specific compression techniques. An appropriate combination of these techniques, including regularization, quantization, and pruning, can aid in building an effective compression pipeline for establishing DL models for IoT use-cases. The third direction incorporates both these views and introduces a novel approach of parallelization for setting up a distributed systems view of DL for IoT.FindingsDL models are growing deeper with every passing year. Well-coordinated distributed execution of such models using Fog displays a promising future for the IoT application realm. It is realized that a vertically partitioned compressed deep model can handle the trade-off between size, accuracy, communication overhead, bandwidth utilization, and latency but at the expense of an additionally considerable memory footprint. To reduce the memory budget, we propose to exploit Hashed Nets as potentially favorable candidates for distributed frameworks. However, the critical point between accuracy and size for such models needs further investigation.Originality/valueTo the best of our knowledge, no study has explored the inherent parallelism in deep neural network architectures for their efficient distribution over the Edge-Fog continuum. Besides covering techniques and frameworks that have tried to bring inference to the Edge, the review uncovers significant issues and possible future directions for endorsing deep models as processing engines for real-time IoT. The study is directed to both researchers and industrialists to take on various applications to the Edge for better user experience.

show abstract

Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays

Cited by 20 publications

References 11 publications

ProSE: the architecture and design of a protein discovery engine

ProSE: the architecture and design of a protein discovery engine

Scale-out Systolic Arrays

Exploring compression and parallelization techniques for distribution of deep neural networks over Edge–Fog continuum – a review

Contact Info

Product

Resources

About