Deep Learning for Computer Architects

Reagen, Brandon; Adolf, Robert; Whatmough, Paul N.; Wei, Gu-Yeon; Brooks, David

doi:10.2200/s00783ed1v01y201706cac041

Cited by 12 publications

(11 citation statements)

References 79 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…CNN Hardware Accelerators. There is currently huge research interest in the design of high-performance and energy-efficient neural network hardware accelerators, both in academia and industry (Barry et al, 2015;Arm;Nvidia;Reagen et al, 2017a). Some of the key topics that have been studied to date include dataflows (Chen et al, 2016b;Samajdar et al, 2018), optimized data precision (Reagen et al, 2016), systolic arrays (Jouppi et al, 2017), sparse data compression and compute (Han et al, 2016;Albericio et al, 2016;Parashar et al, 2017;Yu et al, 2017;Ding et al, 2017;Whatmough et al, 2018), bit-serial arithmetic (Judd et al, 2016), and analog/mixed-signal hardware (Chen et al, 2016a;LiKamWa et al, 2016;Shafiee et al, 2016;Chi et al, 2016;Kim et al, 2016;Song et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Whatmough¹,

Zhou²,

Hansen³

et al. 2019

Preprint

View full text Add to dashboard Cite

The computational demands of computer vision tasks based on state-of-the-art Convolutional Neural Network (CNN) image classification far exceed the energy budgets of mobile devices. This paper proposes FixyNN, which consists of a fixed-weight feature extractor that generates ubiquitous CNN features, and a conventional programmable CNN accelerator which processes a dataset-specific CNN. Image classification models for FixyNN are trained end-to-end via transfer learning, with the common feature extractor representing the transfered part, and the programmable part being learnt on the target dataset. Experimental results demonstrate FixyNN hardware can achieve very high energy efficiencies up to 26.6 TOPS/W (4.81× better than iso-area programmable accelerator). Over a suite of six datasets we trained models via transfer learning with an accuracy loss of < 1% resulting in up to 11.2 TOPS/W -nearly 2× more efficient than a conventional programmable CNN accelerator of the same area.

show abstract

Section: Related Workmentioning

confidence: 99%

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Whatmough¹,

Zhou²,

Hansen³

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Neural Network Accelerator We develop a systolic arraybased CNN accelerator and integrate it into our evaluation infrastructure. The design is reminiscent of the Google Tensor Processing Unit (TPU) [78], but is much smaller, as befits the mobile budget [97].…”

Section: Hardware Setupmentioning

confidence: 99%

Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision

Zhu

Samajdar

Mattina

et al. 2018

2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

Continuous computer vision (CV) tasks increasingly rely on convolutional neural networks (CNN). However, CNNs have massive compute demands that far exceed the performance and energy constraints of mobile devices. In this paper, we propose and develop an algorithm-architecture co-designed system, Euphrates, that simultaneously improves the energyefficiency and performance of continuous vision tasks.Our key observation is that changes in pixel data between consecutive frames represents visual motion. We first propose an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by continuous vision applications. We co-design a mobile System-ona-Chip (SoC) architecture to maximize the efficiency of the new algorithm. The key to our architectural augmentation is to co-optimize different SoC IP blocks in the vision pipeline collectively. Specifically, we propose to expose the motion data that is naturally generated by the Image Signal Processor (ISP) early in the vision pipeline to the CNN engine. Measurement and synthesis results show that Euphrates achieves up to 66% SoC-level energy savings (4× for the vision computations), with only 1% accuracy loss.

show abstract

“…TCUs come under the guise of different marketing terms, be it NVIDIA's Tensor Cores [18], Google's Tensor Processing Unit [10], Intel KNL's AVX extensions [76], Apple A11's Neural Engine [2], or ARM's Machine Learning Processor [3]. TCUs are designed to accelerate Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Deep Neural Network (DNN) in general TCUs vary in implementation [18,36,40,43,48,54,71,74,75,76,79,87], and are prevalent [1,4,8,9,10,11,24,70] in edge devices, mobile, and the cloud.…”

Section: Introductionmentioning

confidence: 99%

Accelerating reduction and scan using tensor core units

Dakkak

Xiong

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89% − 98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100× for reduction and 3× for scan) than state-of-the-art methods for small segment sizes -common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.

show abstract

Deep Learning for Computer Architects

Cited by 12 publications

References 79 publications

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision

Accelerating reduction and scan using tensor core units

Contact Info

Product

Resources

About