Realizing Petabyte Scale Acoustic Modeling

Parthasarathi, Sree Hari Krishnan; Sivakrishnan, Nitin; Ladkat, Pranav; Ström, Nikko

doi:10.1109/jetcas.2019.2912353

Cited by 10 publications

(9 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(i) Reducing time and cost to train DNN models: We believe that communication times will bottleneck training times of distributed systems and this will become even more severe with recent significant improvements in the computational capability of deep learning training hardware. To address this bottleneck, in the past few years, compression techniques have been eagerly researched and implemented in some practical training systems [43]. Meanwhile, we would like to point out, although our compression scheme guarantees theoretical convergence and shows no accuracy loss compared to baseline training over the tested models and applications, there could still be concerns about the impact of lossy gradient compression on neural network convergence performance.…”

Section: Broader Impactmentioning

confidence: 99%

“…Our research results on compression in large-scale distributed training have two broad benefits:(i) Reducing time and cost to train DNN models: We believe that communication times will bottleneck training times of distributed systems and this will become even more severe with recent significant improvements in the computational capability of deep learning training hardware. To address this bottleneck, in the past few years, compression techniques have been eagerly researched and implemented in some practical training systems[43]. Our research results on scalability of gradient compression aim to push this to larger scale distributed training systems, which is needed for the training of expensive and powerful gigantic models.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Ordered Gradient Approach for Communication-Efficient Distributed Learning

Chen

Sadler

Blum

2020

2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

View full text Add to dashboard Cite

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing methods do not scale well to large scale distributed systems (due to gradient build-up) and/or fail to evaluate model fidelity (test accuracy) on large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleCom), that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. Using theoretical analysis, we show that ScaleCom provides favorable convergence guarantees and is compatible with gradient all-reduce techniques. Furthermore, we experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) across a wide range of applications (image, language, and speech) without significant accuracy loss.

show abstract

Section: Broader Impactmentioning

confidence: 99%

mentioning

confidence: 99%

Ordered Gradient Approach for Communication-Efficient Distributed Learning

Chen

Sadler

Blum

2020

2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

View full text Add to dashboard Cite

show abstract

“…Representative work includes [5][6] form Microsoft, [4] from Amazon, [23] from Baidu. The global model updates can be conducted either through gradient aggregation [5][23] or model averaging [6].…”

Section: B Centralized Distributed Trainingmentioning

confidence: 99%

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

Cui,

Zhang,

Finkler

et al. 2020

Preprint

View full text Add to dashboard Cite

The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning. The improvements in performance can be attributed to both improved models and large-scale training data. Key to training such models is the employment of efficient distributed learning techniques. In this article, we provide an overview of distributed training techniques for deep neural network acoustic models for ASR. Starting with the fundamentals of data parallel stochastic gradient descent (SGD) and ASR acoustic modeling, we will investigate various distributed training strategies and their realizations in high performance computing (HPC) environments with an emphasis on striking the balance between communication and computation. Experiments are carried out on a popular public benchmark to study the convergence, speedup and recognition performance of the investigated strategies. 1

show abstract

“…Recently, student-teacher distillation techniques for hybrid HMM-LSTM models have been shown to scale to very large data sets (1 million hours) for models with high capacity [28,27]. The efficacy of model compression using student-teacher distillation is well established [23,35,34].…”

Section: Introductionmentioning

confidence: 99%

Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models

Liu¹,

Swaminathan²,

Parthasarathi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.

show abstract

Realizing Petabyte Scale Acoustic Modeling

Cited by 10 publications

References 39 publications

Ordered Gradient Approach for Communication-Efficient Distributed Learning

Ordered Gradient Approach for Communication-Efficient Distributed Learning

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models

Contact Info

Product

Resources

About