Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022) 2022
DOI: 10.18653/v1/2022.iwslt-1.11
|View full text |Cite
|
Sign up to set email alerts
|

The YiTrans Speech Translation System for IWSLT 2022 Offline Shared Task

Abstract: This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task, which translates from English audio to German, Chinese, and Japanese. The YiTrans system is built on large-scale pre-trained encoder-decoder models. More specifically, we first design a multistage pre-training strategy to build a multimodality model with a large amount of labeled and unlabeled data. We then fine-tune the corresponding components of the model for the downstream speech transla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 25 publications
0
9
0
Order By: Relevance
“…Previous, well-performing systems submitted to the IWLST offline and low-resource speech translation tracks made use of various methods to improve the performance of their cascade system. For the ASR component, many submissions used a combination of transformer and conformer models (Zhang et al, 2022;Li et al, 2022;Nguyen et al, 2021) or fine-tuned existing models (Zhang and Ao, 2022;Zanon Boito et al, 2022;Denisov et al, 2021). They managed to increase ASR performance by voice activity detection for segmentation (Zhang et al, 2022;Ding and Tao, 2021), training the ASR on synthetic data with added punctuation, noise-filtering and domain-specific finetuning (Zhang and Ao, 2022;Li et al, 2022) or adding an intermediate model that cleans the ASR output in terms of casing and punctuation (Nguyen et al, 2021).…”
Section: Previous Iwslt Approaches Formentioning
confidence: 99%
See 3 more Smart Citations
“…Previous, well-performing systems submitted to the IWLST offline and low-resource speech translation tracks made use of various methods to improve the performance of their cascade system. For the ASR component, many submissions used a combination of transformer and conformer models (Zhang et al, 2022;Li et al, 2022;Nguyen et al, 2021) or fine-tuned existing models (Zhang and Ao, 2022;Zanon Boito et al, 2022;Denisov et al, 2021). They managed to increase ASR performance by voice activity detection for segmentation (Zhang et al, 2022;Ding and Tao, 2021), training the ASR on synthetic data with added punctuation, noise-filtering and domain-specific finetuning (Zhang and Ao, 2022;Li et al, 2022) or adding an intermediate model that cleans the ASR output in terms of casing and punctuation (Nguyen et al, 2021).…”
Section: Previous Iwslt Approaches Formentioning
confidence: 99%
“…For the ASR component, many submissions used a combination of transformer and conformer models (Zhang et al, 2022;Li et al, 2022;Nguyen et al, 2021) or fine-tuned existing models (Zhang and Ao, 2022;Zanon Boito et al, 2022;Denisov et al, 2021). They managed to increase ASR performance by voice activity detection for segmentation (Zhang et al, 2022;Ding and Tao, 2021), training the ASR on synthetic data with added punctuation, noise-filtering and domain-specific finetuning (Zhang and Ao, 2022;Li et al, 2022) or adding an intermediate model that cleans the ASR output in terms of casing and punctuation (Nguyen et al, 2021). The MT components were mostly transformer-based (Zhang et al, 2022;Nguyen et al, 2021;Bahar et al, 2021) or fine-tuned on preexisting models (Zhang and Ao, 2022).…”
Section: Previous Iwslt Approaches Formentioning
confidence: 99%
See 2 more Smart Citations
“…• YI (Zhang and Ao, 2022)) submitted primary end-to-end and cascaded systems for all three language directions using large-scale pre-trained models. Starting from pre-trained speech and language models, the authors investigated a multi-stage pre-training and the use of a task dependent fine-tuning for ASR, MT and speech translation.…”
Section: Submissionsmentioning
confidence: 99%