ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413375
|View full text |Cite
|
Sign up to set email alerts
|

Eat: Enhanced ASR-TTS for Self-Supervised Speech Recognition

Abstract: Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here we propose an enhanced ASR-TTS (EAT) model that incorporates two main features: 1) The ASR→TTS direction is equipped with a language model reward to penalize the ASR hypotheses before forwarding it to TTS. 2) In the TTS→ASR direction, a hyper-parameter is introduced to scale the attention context from synthesized speech before sending it to ASR to handle out-ofdomain data. Training strategies and the effectiveness of the EAT model are… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(9 citation statements)
references
References 19 publications
0
9
0
Order By: Relevance
“…To utilize the speech feature from the frame-by-frame method, researchers try to combine speech synthesis and speech recognition models [25] to extend the data set and correct the inconsistent between text and speech. In the combined model, the speech recognition result is adopted to train the speech synthesis model, and the speech synthesis result is adopted to train the speech recognition model, or intermediate variables in the speech recognition model are adopted to guide the speech synthesis model.…”
Section: Related Workmentioning
confidence: 99%
“…To utilize the speech feature from the frame-by-frame method, researchers try to combine speech synthesis and speech recognition models [25] to extend the data set and correct the inconsistent between text and speech. In the combined model, the speech recognition result is adopted to train the speech synthesis model, and the speech synthesis result is adopted to train the speech recognition model, or intermediate variables in the speech recognition model are adopted to guide the speech synthesis model.…”
Section: Related Workmentioning
confidence: 99%
“…However, collecting human annotated speech data is challenging and expensive. Data augmentation using either labelled or unlabelled data [1,2] has been used to alleviate this data scarcity problem. One promising approach is speech data synthesis, which recently contributed to significant progress in domain adaptation [3], medication names recognition [4], accurate numeric sequences transcription [5], low-resource languages [6], etc.…”
Section: Introductionmentioning
confidence: 99%
“…Semi-supervised learning utilizes labeled data as well as unlabeled (or unpaired) data during model training, where the amount of labeled data is in general much smaller than that of unlabeled data. Some early works for semi-supervised end-to-end ASR are based on a reconstruction framework, including approaches based on a text-to-speech model [18]- [20] or a sequential autoencoder [21]- [23]. Others adopted self-supervised pre-training techniques, such as BERT-like mask prediction [24]- [26], contrastive learning [27]- [29], and feature clustering [30], [31], to boost the performance of downstream ASR tasks.…”
Section: Introductionmentioning
confidence: 99%
“…ξ ▷ online model is returned for final evaluation used to initialize the online and offline models for MPL (lines[13][14][15][16][17][18][19][20][21][22]. The MPL training lasts E mpl epochs.…”
mentioning
confidence: 99%