Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021) 2021
DOI: 10.18653/v1/2021.iwslt-1.11
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021

Abstract: This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation system, which combines pre-trained models (Wav2Vec 2.0 and mBART) with coupling modules between the encoder and deco… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
31
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 18 publications
(31 citation statements)
references
References 30 publications
(32 reference statements)
0
31
0
Order By: Relevance
“…With respect to the other segmentation methods, we obtain improvements of 4.5 BLEU from the classical pause-based approach and 2 BLEU from the hybrid approaches of [10] and [11]. Compared to the hybrid method of [14], which also employs pre-trained wav2vec 2.0 models, SHAS achieves an increase of 1.2 in BLEU or 3.8% closer to the BLEU of the manual segmentation. Additional improvements can be observed for all the language pairs, apart from the most high-resourced ende, by using a multilingual SFC.…”
Section: Resultsmentioning
confidence: 91%
See 3 more Smart Citations
“…With respect to the other segmentation methods, we obtain improvements of 4.5 BLEU from the classical pause-based approach and 2 BLEU from the hybrid approaches of [10] and [11]. Compared to the hybrid method of [14], which also employs pre-trained wav2vec 2.0 models, SHAS achieves an increase of 1.2 in BLEU or 3.8% closer to the BLEU of the manual segmentation. Additional improvements can be observed for all the language pairs, apart from the most high-resourced ende, by using a multilingual SFC.…”
Section: Resultsmentioning
confidence: 91%
“…On the other side, the first layers are better, but still not optimal, probably due to the low contextual information. We find that middle layers (11)(12)(13)(14)(15)(16)(17)(18)(19)(20) have the most informative representation that can be used by the classifier. More specifically, the output of the 14th layer, achieves the best segmentation, retaining almost 95% of the manual BLEU score.…”
Section: A Appendixmentioning
confidence: 99%
See 2 more Smart Citations
“…In recent studies, many speech segmentation methods based on VAD have been proposed for ST. Gaido et al [12] and Inaguma et al [13] used the heuristic concatenation of VAD segments up to a fixed length to address the over-segmentation problem. Gállego et al [14] used a pre-trained ASR model called wav2Vec 2.0 [15] for silence detection. Yoshimura et al [16] used an RNN-based ASR model to consider consecutive blank symbols (" ") as a segment boundary in decoding using connectionist temporal classification (CTC).…”
Section: Related Workmentioning
confidence: 99%