Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.533
|View full text |Cite
|
Sign up to set email alerts
|

Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation

Abstract: Speech translation (ST) aims to learn transformations from speech in the source language to the text in the target language. Previous works show that multitask learning improves the ST performance, in which the recognition decoder generates the text of the source language, and the translation decoder obtains the final translations based on the output of the recognition decoder. Because whether the output of the recognition decoder has the correct semantics is more critical than its accuracy, we propose to impr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 21 publications
(23 reference statements)
0
7
0
Order By: Relevance
“…e advantage of this method is that it leverages text and audio resources to the greatest possible extent [3,4]. In addition, the cascade model gives appropriate initial parameters for fine-tuning the ST task in the next step [5][6][7][8]. e end-toend structure [9] is a separate subtask containing an ST encoder and an ST decoder.…”
Section: Introductionmentioning
confidence: 99%
“…e advantage of this method is that it leverages text and audio resources to the greatest possible extent [3,4]. In addition, the cascade model gives appropriate initial parameters for fine-tuning the ST task in the next step [5][6][7][8]. e end-toend structure [9] is a separate subtask containing an ST encoder and an ST decoder.…”
Section: Introductionmentioning
confidence: 99%
“…[10] were trained with additional Spanish text resources (LDC96L16). "Word Embedding" [6] used FastText that was pre-trained on Wikipedia word embedding.…”
Section: Resultsmentioning
confidence: 99%
“…Source speech recognition be-comes a sub-task, and helps to derive more phonetically informative embeddings of the source speech. Moreover, the jointly trained model can provide good initial model parameters for subsequent fine-tuning of the ST-only task [5,6,7,8]. On the other hand, in the cascading approach, it has been proved that the phone sequence of the source speech derived from the word sequence output of a well-trained ASR system has better performance than the word sequence in ST [9,10].…”
Section: Introductionmentioning
confidence: 99%
“…; Liu, Spanakis, and Niehues (2020) optimize the decoding strategy to achieve low-latency end-to-end speech translation. (Chuang et al 2020;Salesky and Black 2020;Salesky, Sperber, and Black 2019) explore additional features to enhance end-to-end models.…”
Section: Related Workmentioning
confidence: 99%