Improving Performance of End-to-End ASR on Numeric Sequences

Peyser, Cal; Zhang, Hao; Sainath, Tara N.; Wu, Zelin

doi:10.21437/interspeech.2019-1345

Cited by 32 publications

(16 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we employ FST-based text normalization methods to automatically normalize written form of text. This is similar to synthetic data generation employed successfully in the past [10]. However, the data prepared in such a way poses a number of problems for modeling ITN:…”

Section: Text Processing Pipelinementioning

confidence: 92%

Neural Inverse Text Normalization

Sunkara

Shivade

Bodapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

While there have been several contributions exploring state of the art techniques for text normalization, the problem of inverse text normalization (ITN) remains relatively unexplored. The best known approaches leverage finite state transducer (FST) based models which rely on manually curated rules and are hence not scalable. We propose an efficient and robust neural solution for ITN leveraging transformer based seq2seq models and FST-based text normalization techniques for data preparation. We show that this can be easily extended to other languages without the need for a linguistic expert to manually curate them. We then present a hybrid framework for integrating Neural ITN with an FST to overcome common recoverable errors in production environments. Our empirical evaluations show that the proposed solution minimizes incorrect perturbations (insertions, deletions and substitutions) to ASR output and maintains high quality even on out of domain data. A transformer based model infused with pretraining consistently achieves a lower WER across several datasets and is able to outperform baselines on English, Spanish, German and Italian datasets.

show abstract

Section: Text Processing Pipelinementioning

confidence: 92%

Neural Inverse Text Normalization

Sunkara

Shivade

Bodapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Because LM fusion methods require interpolating with an external LM, both the computational cost and footprint are increased, which is not applicable to ASR on devices. With the advance of TTS technologies, a new trend is to adapt E2E models with the synthesized speech generated from the new-domain text [12,156,177,178]. This is especially useful for adapting RNN-T, in which the prediction network works similarly to an LM.…”

Section: B) Domain Adaptationmentioning

confidence: 99%

Recent Advances in End-to-End Automatic Speech Recognition

Li¹

2021

Preprint

View full text Add to dashboard Cite

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.

show abstract

“…For example, if a system that is intended for use in a voicemail transcription setting achieves 3% overall WER, but it mistranscribes every phone number, that system would almost certainly not be preferred over a system that achieves 3.5% overall WER, but that makes virtually no mistakes on phone numbers. As Peyser et al (2019) show, such examples are far from theoretical; fortunately, as they show, it is also possible to create synthetic test sets using text-to-speech systems to get a sense of WER in a specific context. Standard tools like NIST SCLITE 3 can be used to calculate WER and various additional statistics.…”

Section: Wermentioning

confidence: 99%

Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

2021

View full text Add to dashboard Cite

Where have we been, and where are we going? It is easier to talk about the past than the future. These days, benchmarks evolve more bottom up (such as papers with code). There used to be more top-down leadership from government (and industry, in the case of systems, with benchmarks such as SPEC). Going forward, there may be more top-down leadership from organizations like MLPerf and/or influencers like David Ferrucci, who was responsible for IBM's success with Jeopardy, and has recently written a paper suggesting how the community should think about benchmarking for machine comprehension. Tasks such as reading comprehension become even more interesting as we move beyond English. Multilinguality introduces many challenges, and even more opportunities.

show abstract

Improving Performance of End-to-End ASR on Numeric Sequences

Cited by 32 publications

References 16 publications

Neural Inverse Text Normalization

Neural Inverse Text Normalization

Recent Advances in End-to-End Automatic Speech Recognition

Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Contact Info

Product

Resources

About