Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021) 2021
DOI: 10.18653/v1/2021.iwslt-1.2
|View full text |Cite
|
Sign up to set email alerts
|

The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021

Abstract: This paper describes USTC-NELSLIP's submissions to the IWSLT2021 Simultaneous Speech Translation task. We proposed a novel simultaneous translation model, Cross Attention Augmented Transducer (CAAT), which extends conventional RNN-T to sequence-tosequence tasks without monotonic constraints, e.g., simultaneous translation. Experiments on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT achieves better quality-latency trade-offs compared to wait-k, one of the previous state-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 21 publications
0
7
0
Order By: Relevance
“…6,7,8] exploited transfer learning from ASR and MT showing, for instance, that pre-training the ST encoder on ASR data can yield significant improvements. On the data side, the most promising approach is data augmentation, which has been experimented via knowledge distillation from a neural MT (NMT) model [9], synthesizing monolingual MT data in the source language [10], multilingual training [11], or translating monolingual ASR data into the target language [10,12,13]. Nevertheless, despite some claims of big industrial players operating in rich data conditions [10], top results at recent shared tasks [13] show that effectively exploiting the scarce training data available still remains a crucial issue to reduce the performance gap with cascade ST solutions.…”
Section: Introductionmentioning
confidence: 99%
“…6,7,8] exploited transfer learning from ASR and MT showing, for instance, that pre-training the ST encoder on ASR data can yield significant improvements. On the data side, the most promising approach is data augmentation, which has been experimented via knowledge distillation from a neural MT (NMT) model [9], synthesizing monolingual MT data in the source language [10], multilingual training [11], or translating monolingual ASR data into the target language [10,12,13]. Nevertheless, despite some claims of big industrial players operating in rich data conditions [10], top results at recent shared tasks [13] show that effectively exploiting the scarce training data available still remains a crucial issue to reduce the performance gap with cascade ST solutions.…”
Section: Introductionmentioning
confidence: 99%
“…Lastly, we compare our policy with the two winning systems of the last two years (2021, and 2022). The 2021 winner (Liu et al, 2021a) was based on an architecture named Cross Attention Augmented Transducer (CAAT), which was specifically tailored for the SimulST task (Liu et al, 2021b) and still represents the state of the art in terms of low latency (considering ideal AL only). The 2022 winner (CUNI-KIT (Polák et al, 2022)) was based on the wav2vec 2.0 + mBART50 offline architecture reported in Table 1, row 4.…”
Section: Simultaneous Translationmentioning
confidence: 99%
“…Computational costs are also inflated by the common practice of simulating the simultaneous test conditions by providing partial input during training to avoid the quality drops caused by the mismatch between training and test conditions (Ren et al, 2020;Ma et al, 2020bZeng et al, 2021;Liu et al, 2021a;Zaidi et al, 2021Zaidi et al, , 2022. This practice is independent of the decision policy adopted, and typically requires dedicated trainings for each latency regime.…”
Section: Introductionmentioning
confidence: 99%