Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1345
|View full text |Cite
|
Sign up to set email alerts
|

Improving Performance of End-to-End ASR on Numeric Sequences

Abstract: Recognizing written domain numeric utterances (e.g., I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-ofvocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g., I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 32 publications
(16 citation statements)
references
References 16 publications
0
16
0
Order By: Relevance
“…Therefore, we employ FST-based text normalization methods to automatically normalize written form of text. This is similar to synthetic data generation employed successfully in the past [10]. However, the data prepared in such a way poses a number of problems for modeling ITN:…”
Section: Text Processing Pipelinementioning
confidence: 92%
“…Therefore, we employ FST-based text normalization methods to automatically normalize written form of text. This is similar to synthetic data generation employed successfully in the past [10]. However, the data prepared in such a way poses a number of problems for modeling ITN:…”
Section: Text Processing Pipelinementioning
confidence: 92%
“…Because LM fusion methods require interpolating with an external LM, both the computational cost and footprint are increased, which is not applicable to ASR on devices. With the advance of TTS technologies, a new trend is to adapt E2E models with the synthesized speech generated from the new-domain text [12,156,177,178]. This is especially useful for adapting RNN-T, in which the prediction network works similarly to an LM.…”
Section: B) Domain Adaptationmentioning
confidence: 99%
“…For example, if a system that is intended for use in a voicemail transcription setting achieves 3% overall WER, but it mistranscribes every phone number, that system would almost certainly not be preferred over a system that achieves 3.5% overall WER, but that makes virtually no mistakes on phone numbers. As Peyser et al (2019) show, such examples are far from theoretical; fortunately, as they show, it is also possible to create synthetic test sets using text-to-speech systems to get a sense of WER in a specific context. Standard tools like NIST SCLITE 3 can be used to calculate WER and various additional statistics.…”
Section: Wermentioning
confidence: 99%