Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1184
|View full text |Cite
|
Sign up to set email alerts
|

RECOApy: Data Recording, Pre-Processing and Phonetic Transcription for End-to-End Speech-Based Applications

Abstract: Deep learning enables the development of efficient end-to-end speech processing applications while bypassing the need for expert linguistic and signal processing features. Yet, recent studies show that good quality speech resources and phonetic transcription of the training data can enhance the results of these applications. In this paper, the RECOApy tool is introduced. RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications. The tool implements an e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…An encoder-decoder model with attention (Toshniwal and Livescu 2016) and a convolutional architecture combined with n-grams (Rao et al 2015) achieve similar results when applied to the same dataset. Transformer-based architectures are proposed in Sun et al (2019); Yolchuyeva, Németh, and Gyires-Tóth (2019b); Stan (2020) and slightly improve the error rates. Sun et al (2019) report a WER around 20% obtained with a model enriched through knowledge distillation using unlabelled source words.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…An encoder-decoder model with attention (Toshniwal and Livescu 2016) and a convolutional architecture combined with n-grams (Rao et al 2015) achieve similar results when applied to the same dataset. Transformer-based architectures are proposed in Sun et al (2019); Yolchuyeva, Németh, and Gyires-Tóth (2019b); Stan (2020) and slightly improve the error rates. Sun et al (2019) report a WER around 20% obtained with a model enriched through knowledge distillation using unlabelled source words.…”
Section: Related Workmentioning
confidence: 99%
“…The Transformer’s hyperparameter selection is based on the strategy introduced in (Stan 2020). The set of hyperparameters which were optimised are shown in Table 8.…”
Section: Concurrent Lexical Information Predictionmentioning
confidence: 99%
See 1 more Smart Citation
“…The average number of utterances pertaining to each speaker is 1747, and the total duration of the recordings for all speakers amounts to 59 hours and 39 minutes. Due to the current global pandemic situation, the recordings were performed in the speakers' home environments using the RECOApy tool [21], and were lightly checked for errors by the authors. Some of the issues notices in the recordings refer to the reverberation and background noise presence, as well as some utterances which are chopped either in the beginning or in the end, thus yielding incorrect text-to-audio alignments.…”
Section: Speech Corpusmentioning
confidence: 99%