ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054235
|View full text |Cite
|
Sign up to set email alerts
|

Improving Proper Noun Recognition in End-To-End Asr by Customization of the Mwer Loss Criterion

Abstract: Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems in that a particular name may appear only rarely during training, and may have a pronunciation similar to that of a more common word. Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations and a language model that can be trained on a large text-only corpus. Past work has addressed this issue by incorporating additional tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2020
2020
2025
2025

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 25 publications
0
6
0
Order By: Relevance
“…To this end, we compute unigram statistics for both corpora and construct a list of unigrams that occur at most five times in the AM data (about three quarters of all words) and at least 150 times in the LM data (about 99% of all To measure tail performance, we target words that have pronunciations that are surprising given the spelling. Unusual pronunciations have been shown to be difficult for ASR systems [25,26,27]. To select examples with surprising utterances, we manually assemble a map from grapheme sequences to corresponding phoneme sequences.…”
Section: Evaluation Setsmentioning
confidence: 99%
“…To this end, we compute unigram statistics for both corpora and construct a list of unigrams that occur at most five times in the AM data (about three quarters of all words) and at least 150 times in the LM data (about 99% of all To measure tail performance, we target words that have pronunciations that are surprising given the spelling. Unusual pronunciations have been shown to be difficult for ASR systems [25,26,27]. To select examples with surprising utterances, we manually assemble a map from grapheme sequences to corresponding phoneme sequences.…”
Section: Evaluation Setsmentioning
confidence: 99%
“…Classic ASR models leverage unpaired text data with a separately trained language model (LM) and second-pass rescoring model [7], but unpaired text data cannot be easily utilized when training E2E models. Although E2E models have overall shown strong results, they have been shown to have difficulty accurately modeling tail phenomena such as proper nouns, numerics, and accented speech [8,9,10,11], due to the requirement that they be trained on paired (speechtranscript) data.…”
Section: Introductionmentioning
confidence: 99%
“…Proper nouns have been identified as a challenging problem in ASR for a while now [4]. Recently some approaches have arisen to tackle this challenge with E2E ASR using a specialised architecture and losses [5] or using specific data and training procedures to better represent contextual information [6]. Our approach is meant for rare words in general, however in this work we choose rare proper nouns as exemplary data and use few-shot learning to improve performance on them.…”
Section: Introductionmentioning
confidence: 99%