Improving Proper Noun Recognition in End-To-End Asr by Customization of the Mwer Loss Criterion

Peyser, Cal; Sainath, Tara N.; Pundak, Golan

doi:10.1109/icassp40776.2020.9054235

Cited by 11 publications

(6 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To this end, we compute unigram statistics for both corpora and construct a list of unigrams that occur at most five times in the AM data (about three quarters of all words) and at least 150 times in the LM data (about 99% of all To measure tail performance, we target words that have pronunciations that are surprising given the spelling. Unusual pronunciations have been shown to be difficult for ASR systems [25,26,27]. To select examples with surprising utterances, we manually assemble a map from grapheme sequences to corresponding phoneme sequences.…”

Section: Evaluation Setsmentioning

confidence: 99%

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

et al. 2020

Self Cite

View full text Add to dashboard Cite

End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating textonly data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time, it has not yet been explored for very large text corpora, and it has been shown to be very sensitive to hyperparameter settings in the beam search. In this work, we apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2E ASR model. We explore the impact of model size and show that intelligent pruning of the training set can be more effective than increasing the parameter count. Additionally, we show that incorporating the LM in minimum word error rate (MWER) fine tuning makes shallow fusion far less dependent on optimal hyperparameter settings, reducing the difficulty of that tuning problem.

show abstract

Section: Evaluation Setsmentioning

confidence: 99%

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Classic ASR models leverage unpaired text data with a separately trained language model (LM) and second-pass rescoring model [7], but unpaired text data cannot be easily utilized when training E2E models. Although E2E models have overall shown strong results, they have been shown to have difficulty accurately modeling tail phenomena such as proper nouns, numerics, and accented speech [8,9,10,11], due to the requirement that they be trained on paired (speechtranscript) data.…”

Section: Introductionmentioning

confidence: 99%

Language model fusion for streaming end to end speech recognition

Cabrera,

Liu,

Ghodsi

et al. 2021

Preprint

View full text Add to dashboard Cite

Streaming processing of speech audio is required for many contemporary practical speech recognition tasks. Even with the large corpora of manually transcribed speech data available today, it is impossible for such corpora to cover adequately the long tail of linguistic content that's important for tasks such as open-ended dictation and voice search. We seek to address both the streaming and the tail recognition challenges by using a language model (LM) trained on unpaired text data to enhance the end-to-end (E2E) model. We extend shallow fusion and cold fusion approaches to streaming Recurrent Neural Network Transducer (RNNT), and also propose two new competitive fusion approaches that further enhance the RNNT architecture. Our results on multiple languages with varying training set sizes show that these fusion methods improve streaming RNNT performance through introducing extra linguistic features. Cold fusion works consistently better on streaming RNNT with up to a 8.5% WER improvement.

show abstract

“…Proper nouns have been identified as a challenging problem in ASR for a while now [4]. Recently some approaches have arisen to tackle this challenge with E2E ASR using a specialised architecture and losses [5] or using specific data and training procedures to better represent contextual information [6]. Our approach is meant for rare words in general, however in this work we choose rare proper nouns as exemplary data and use few-shot learning to improve performance on them.…”

Section: Introductionmentioning

confidence: 99%

Meta-Learning for Improving Rare Word Recognition in End-to-End ASR

Lux

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work we take on the challenge of rare word recognition in endto-end (E2E) automatic speech recognition (ASR) by integrating a meta learning mechanism into an E2E ASR system, enabling few-shot adaptation. We propose a novel method of generating embeddings for speech, changes to four meta learning approaches, enabling them to perform keyword spotting and an approach to using their outcomes in an E2E ASR system. We verify the functionality of each of our three contributions in two experiments exploring their performance for different amounts of classes (N-way) and examples per class (k-shot) in a few-shot setting. We find that the information encoded in the speech embeddings suffices to allow the modified meta learning approaches to perform continuous signal spotting. Despite the simplicity of the interface between keyword spotting and speech recognition, we are able to consistently improve word error rate by up to 5%.

show abstract

Improving Proper Noun Recognition in End-To-End Asr by Customization of the Mwer Loss Criterion

Cited by 11 publications

References 25 publications

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Language model fusion for streaming end to end speech recognition

Meta-Learning for Improving Rare Word Recognition in End-to-End ASR

Contact Info

Product

Resources

About