Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.236
|View full text |Cite
|
Sign up to set email alerts
|

Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

Abstract: Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 32 publications
0
7
0
Order By: Relevance
“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”
Section: Related Workmentioning
confidence: 99%
“…The most straightforward method is to employ them as an acoustic feature encoder and then stack a simple layer of neural network on top of the encoder to do speech recognition [9]. After that, some studies present various cascade methods to concatenate pre-trained language and speech representation learning models for ASR [14,15,17,18]. Although these methods have proven their capabilities and effectiveness on benchmark corpora, their complicated model architectures and/or large-scaled model parameters have usually made them hard to be used in practice.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…For text-only data, text is mainly used to train an external language model (LM) for joint decoding [11,12,13,14,15]. In order to make use of both unpaired speech and text, many methods have recently been proposed, e.g., integration of a pre-trained acoustic model and LM [16,17,18,19], cycle-consistency based dual-training [20,21,22,23], and shared representation learning [24,25,26,27], which rely on hybrid models with multitask training and some of which become less effective in cases with a very limited amount of labeled data. The current mainstream methods that achieve state-of-the-art (SOTA) results in low-resource ASR use unpaired speech and text for pre-training and training a LM for joint decoding, respectively [7,8], and adopt an additional iterative self-training [28].…”
Section: Introductionmentioning
confidence: 99%
“…Several attempts have been made to use pretrained LMs indirectly for improving E2E-ASR, such as N-best hypothesis rescoring (Shin et al, 2019;Salazar et al, 2020;Chiu and Chen, 2021;Futami et al, 2021;Udagawa et al, 2022) and knowledge distillation (Futami et al, 2020;Bai et al, 2021;Kubo et al, 2022). Others have investigated directly unifying an E2E-ASR model with a pre-trained LM, where the LM is fine-tuned to optimize ASR in an end-to-end trainable framework (Huang et al, 2021;Zheng et al, 2021;Deng et al, 2021;Yu et al, 2022).…”
Section: Introductionmentioning
confidence: 99%