2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9004027
|View full text |Cite
|
Sign up to set email alerts
|

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

Abstract: In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pretraining and data augmentation methods. In addition, we compressed our models by more than 3.4 times small… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
49
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
1

Relationship

3
6

Authors

Journals

citations
Cited by 60 publications
(56 citation statements)
references
References 18 publications
0
49
0
Order By: Relevance
“…As shown in Fig. 2, we apply the "global" mean and variance normalization as in [17], since the utterance-by-utterance mean and variance normalizations are not easily realizable for streaming speech recognition [18]. Note that mean subtraction must be applied before masking, otherwise, the non-zero values in the masked region will distort the model during the training.…”
Section: Small Energy Masking Algorithmmentioning
confidence: 99%
“…As shown in Fig. 2, we apply the "global" mean and variance normalization as in [17], since the utterance-by-utterance mean and variance normalizations are not easily realizable for streaming speech recognition [18]. Note that mean subtraction must be applied before masking, otherwise, the non-zero values in the masked region will distort the model during the training.…”
Section: Small Energy Masking Algorithmmentioning
confidence: 99%
“…We have tried various types of training strategies for better performance [53,54]. Our MoCha implementation and optimization are described in very detail in our another paper [50]. The structure of our entire end-to-end speech recognition system is shown in Fig.…”
Section: Structure Of the End-to-end Speech Recognition Systemmentioning
confidence: 99%
“…We use an end-to-end attention based ASR model [11,12] with an architecture similar to the one proposed in [10] as depicted in Fig. 1.…”
Section: Asr Modelmentioning
confidence: 99%
“…t . c o h o r t m o d e l s 2 f o r u t t e r a n c e i n v a l i d a t i o n s e t s : c o r r e l a t i o n ( w e r u t t , w e r a v g a l l ) 9 10 # f i l t e r u t t e r a n c e s < m i n c o r r e l and m i n l e n g t h 11 f o r u t t e r a n c e i n v a l i d a t i o n s e t s : 12 i f c o r r e l [ u t t e r a n c e ] > c o r r e l m i n : 13 n e w s e t . a p p e n d ( s a m p l e ) Listing 1: Heuristic to find "condensed" datasets in Fig.…”
Section: Small Dataset Creationmentioning
confidence: 99%