Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

Shi, Yangyang; Wang, Yongqiang; Wu, Chunyang; Yeh, Ching-Feng; Chan, Julian; Zhang, Frank; Le, Duc; Seltzer, Michael L.

doi:10.1109/icassp39728.2021.9414560

Cited by 105 publications

(44 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We focus on building models for low latency streaming ondevice speech recognition using the Emformer [2] transducer. Emformer is an efficient extension of the Augmented Memory Transformer (AM-TRF) [1].…”

Section: Low Latency Emformer Transducermentioning

confidence: 99%

“…Since we perform experiments for low latency conditions, the center chunk size and look-ahead context size in Emformer are set to 160ms and 40ms, respectively. The algorithmic latency [2] of the acoustic encoders is 120 ms.…”

Section: Datasets and Setupmentioning

confidence: 99%

“…Hence, it has become necessary to build speech recognition models of different sizes to accommodate the wide variety of computational resources. Many automatic speech recognition models [1,2,3] usually are proposed in a few different size configurations, which fall into the relative categories of small, medium, and large. Typically, the models of different sizes are trained independently of each other.…”

Section: Introductionmentioning

confidence: 99%

“…A se-* Equal Contribution quence transducer model consists of an acoustic encoder, a predictor, and a joiner network. The commonly used acoustic encoders are multi-layered RNN/LSTM [9,10] and Transformers [1,2,3,11]. In this work, we use the low latency streaming Transformer [12] based encoder proposed in the Emformer [2] architecture.…”

Section: Introductionmentioning

confidence: 99%

“…The commonly used acoustic encoders are multi-layered RNN/LSTM [9,10] and Transformers [1,2,3,11]. In this work, we use the low latency streaming Transformer [12] based encoder proposed in the Emformer [2] architecture.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Collaborative Training of Acoustic Encoders for Speech Recognition

Nagaraja¹,

Shi²,

Venkatesh³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the Lib-riSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.

show abstract

Section: Low Latency Emformer Transducermentioning

confidence: 99%

Section: Datasets and Setupmentioning

confidence: 99%