Findings of the Iwslt 2021 Evaluation Campaign

Anastasopoulos, Antonios; Bojar, Ondřej; Bremerman, Jacob; Cattoni, Roldano; Elbayad, Maha; Federico, Marcello; Ma, Xutai; Nakamura, Satoshi; Negri, Matteo; Niehues, Jan; Pino, Juan; Salesky, Elizabeth; Stüker, Sebastian; Sudoh, Katsuhito; Turchi, Marco; Waibel, Alexander; Wang, Changhan; Wiesner, Matthew

doi:10.18653/v1/2021.iwslt-1.1

Cited by 57 publications

(60 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the past few years, self-supervised speech representation learning has been shown very successful on ASR [10][11][12][13] and ST [14][15][16][17][18] tasks. Recently, it has been expanded to also learn from text, with the emergence of semi-supervised speech-text joint representation learning [19][20][21].…”

Section: Related Workmentioning

confidence: 99%

Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation

Jia

Johnson

Macherey

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

144

137

View full text Add to dashboard Cite

End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora of speech and translated transcript pairs. Previous studies have proposed the use of pre-trained components and multi-task learning in order to benefit from weakly supervised training data, such as speech-totranscript or text-to-foreign-text pairs. In this paper, we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly supervised data into speech-to-translation pairs for ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance. Finally, we discuss methods for avoiding overfitting to synthetic speech with a quantitative ablation study.

show abstract

Section: Related Workmentioning

confidence: 99%

Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation

Jia

Johnson

Macherey

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

144

137

View full text Add to dashboard Cite

show abstract

“…We described FBK's participation in the IWSLT2021 Offline Speech Translation task (Anastasopoulos et al, 2021). Our work focused on a multi-step training pipeline involving data augmentation (SpecAugment and MT-based synthetic data), multi-domain transfer learning (KD training first and then fine-tuning on synthetic and native data) and ad-hoc fine-tuning on randomly segmented data.…”

Section: Discussionmentioning

confidence: 99%

“…Unlike simultaneous ST, where the audio is translated as soon as it is produced, in the offline setting the audio is entirely available and translated at once. In continuity with the last two rounds of the IWSLT evaluation campaign (Niehues et al, 2019;Ansari et al, 2020), the IWSLT2021 Offline Speech Translation task (Anastasopoulos et al, 2021) focused on the translation into German of English audio data extracted from TED talks. Participants could approach the task either with a cascade architecture or with a direct end-to-end system.…”

Section: Introductionmentioning

confidence: 99%

Dealing with training and test segmentation mismatch: FBK@IWSLT2021

Papi¹,

Gaido²,

Negri³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentencelike segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.

show abstract

“…We also evaluated our models on the Business Scene Dialogue Corpus (Rikters et al, 2019) to check whether they worked on conversations. We also added test sets from shared tasks: WMT 2020, 2021 news translation shared tasks (Barrault et al, 2020;Akhbardeh et al, 2021), WMT 2019, 2020 robustness shared tasks (Li et al, 2019;Specia et al, 2020), and the IWSLT 2021 simultaneous translation task (Anastasopoulos et al, 2021). Although some of the test sets are intended for specific translation directions (e.g., En→Ja), we used them for both En→Ja and Ja→En directions for reference.…”

Section: Test Setsmentioning

confidence: 99%

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Murata¹,

Chousa²,

Suzuki³

et al. 2022

Preprint

View full text Add to dashboard Cite

Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available, compared to such resource-rich languages as English-German. It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0. Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus. Through experiments, we empirically show how our new corpus boosts the accuracy of machine translation models on various domains. The JParaCrawl v3.0 corpus will eventually be publicly available online for research purposes.

show abstract

Findings of the Iwslt 2021 Evaluation Campaign

Cited by 57 publications

References 59 publications

Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation

Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation

Dealing with training and test segmentation mismatch: FBK@IWSLT2021

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Contact Info

Product

Resources

About