Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

Lym, Sangkug; Behroozi, Armand; Wang, Wen; Li, Ge; Kwon, Yongkee; Erez, Mattan

doi:10.48550/arxiv.1810.00307

Cited by 2 publications

(2 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Eyeriss [50], DaDiannao [162], Tetris [131], and Minerva [163]). WaveCore [164] and Google's TPUv2 [97] support CNN training, but suffer from challenges highlighted in Section 3. EcoFlow solves these issues, while introducing minimal changes to the CNN inference accelerator architecture.…”

Section: Related Workmentioning

confidence: 99%

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions. These algorithms are tailored to execute efficiently on existing low-cost, small-scale spatial architectures and requires minimal changes to the network-on-chip of existing accelerators. At its core, EcoFlow eliminates zero padding through careful dataflow orchestration and data mapping tailored to the spatial architecture. EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference. We evaluate the efficiency of our dataflows on CNN training workloads and Generative Adversarial Network (GAN) training workloads. Experiments in our new cycle-accurate spatial architecture simulator show that EcoFlow 1) reduces end-to-end CNN training time between 7-85%, and 2) improves end-to-end GAN training performance between 29-42%, compared to state-of-the-art CNN inference accelerators. [Open-Source Artifact]We open-source both our Spatial Architecture Simulator for Machine Learning (SASiML) and the SASiML compiler to help enable the development of new dataflows and high-accuracy simulation environments for new spatial architectures and dataflows. This can be freely found at https://github.com/CMU-SAFARI/sasiml.

show abstract

Section: Related Workmentioning

confidence: 99%

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This is mainly because the inputs of each layer at forward propagation should be kept in memory and reused to compute the local gradients in back-propagation. In particular, the total size of all layer inputs linearly increases with mini-batch size [27]. Therefore, small off-chip memory capacity or a large feature size of a CNN can constrain the mini-batch size per accelerator, and hence also the data parallelism of each layer.…”

Section: Cnn Model Trainingmentioning

confidence: 99%

PruneTrain

Lym

Choukse

Zangeneh

et al. 2019

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

State-of-the-art convolutional neural networks (CNNs) used in vision applications have large models with numerous weights. Training these models is very compute-and memory-resource intensive. Much research has been done on pruning or compressing these models to reduce the cost of inference, but little work has addressed the costs of training. We focus precisely on accelerating training. We propose PruneTrain, a cost-efficient mechanism that gradually reduces the training cost during training. PruneTrain uses a structured group-lasso regularization approach that drives the training optimization toward both high accuracy and small weight values. Small weights can then be periodically removed by reconfiguring the network model to a smaller one. By using a structured-pruning approach and additional reconfiguration techniques we introduce, the pruned model can still be efficiently processed on a GPU accelerator. Overall, PruneTrain achieves a reduction of 39% in the end-to-end training time of ResNet50 for ImageNet by reducing computation cost by 40% in FLOPs, memory accesses by 37% for memory bandwidth bound layers, and the inter-accelerator communication by 55%.

show abstract

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

Cited by 2 publications

References 25 publications

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

PruneTrain

Contact Info

Product

Resources

About