2020
DOI: 10.48550/arxiv.2009.09704
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Abstract: An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Inspired by neuroscience, humans have perception systems and cognitive systems to process different information, we propose TED, Transducer-Encoder-Decoder, a unified framework with triple supervision to decouple the end-to-end speechto-text translation task. In addition to the target sentence translation loss, TED includes two auxiliary supervising signals to guide the acoustic transducer … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
8

Relationship

2
6

Authors

Journals

citations
Cited by 14 publications
(6 citation statements)
references
References 44 publications
0
6
0
Order By: Relevance
“…To make a fair comparison with previous approaches on Augmented LibriSpeech corpus, we list both tokenized and detokenized BLEU scores, as illustrated in Table 2. Our transformer-based model achieves superior results versus recent works about knowledge distillation , curriculum pre-training (Wang et al, 2020b), and TED (Dong et al, 2020). Besides, compared with the counterpart Espnet-ST, our basic setting (Transformer ST + asrPT) also outperforms by 1 BLEU.…”
Section: Resultsmentioning
confidence: 80%
See 2 more Smart Citations
“…To make a fair comparison with previous approaches on Augmented LibriSpeech corpus, we list both tokenized and detokenized BLEU scores, as illustrated in Table 2. Our transformer-based model achieves superior results versus recent works about knowledge distillation , curriculum pre-training (Wang et al, 2020b), and TED (Dong et al, 2020). Besides, compared with the counterpart Espnet-ST, our basic setting (Transformer ST + asrPT) also outperforms by 1 BLEU.…”
Section: Resultsmentioning
confidence: 80%
“…Many ST studies conduct experiments on different datasets. evaluate the method on TED English-Chinese; and Dong et al (2020) use Augmented Librispeech English-French and IWSLT2018 English-German dataset; and show the results on CoVoST dataset and the FR/RO portions of MuST-C dataset. Different datasets make it difficult to compare the performance of their approaches.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…They presented a comprehensive study of the impact of existing semi-supervised learning techniques on ST and showed that they greatly reduce the need for additional supervision in the form of labeled ASR or MT parallel data. Moreover, Dong et al [1277] proposed a listen-understand-translate model, in which the proposed framework utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. Le et al [1278] has presented a study of adapters for multilingual ST and shown that language-specific adapters can enable a fully trained multilingual ST model to be further specialized in each language pair.…”
Section: Pre-training With Unlabeled Speech/text Datamentioning
confidence: 99%
“…Despite multiple advantages, the cascade systems suffer from a major drawback: propagating erroneous early decisions into MT models, which then cause degradation in the trans-lation performance. To mitigate this degradation, rather than passing a single ASR output sequence to MT model, other forms such as lattices, n-best hypotheses and continuous representations have been explored in (Anastasopoulos and Chiang, 2018;Zhang et al, 2019;Sperber et al, 2019;Vydana et al, 2021;Dong et al, 2020).…”
Section: Introductionmentioning
confidence: 99%