Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021) 2021
DOI: 10.18653/v1/2021.iwslt-1.10
|View full text |Cite
|
Sign up to set email alerts
|

ESPnet-ST IWSLT 2021 Offline Speech Translation System

Abstract: This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architect… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 29 publications
0
7
0
Order By: Relevance
“…This is usually sub-optimal as speakers place pauses inside sentences, not necessarily between them (e.g., hesitations before words with high information content, Goldman-Eisler, 1958). To this end, researchers tried considering not only the presence of speech but also its length (Potapczyk and Przybysz, 2020;Inaguma et al, 2021;. Later studies tried to avoid VAD and focused on more linguisticallymotivated approaches, e.g., ASR CTC to predict voiced regions Gállego et al (2021) or directly modeling the sentence segmentation (Tsiamas et al, 2022b;Fukuda et al, 2022).…”
Section: Long-form Offline Stmentioning
confidence: 99%
“…This is usually sub-optimal as speakers place pauses inside sentences, not necessarily between them (e.g., hesitations before words with high information content, Goldman-Eisler, 1958). To this end, researchers tried considering not only the presence of speech but also its length (Potapczyk and Przybysz, 2020;Inaguma et al, 2021;. Later studies tried to avoid VAD and focused on more linguisticallymotivated approaches, e.g., ASR CTC to predict voiced regions Gállego et al (2021) or directly modeling the sentence segmentation (Tsiamas et al, 2022b;Fukuda et al, 2022).…”
Section: Long-form Offline Stmentioning
confidence: 99%
“…Merging the short segments helps the ST model utilize the context information. So we follow the algorithm in (Inaguma et al, 2021) to merge the short segments after the segmentation.…”
Section: Speech Segmentationmentioning
confidence: 99%
“…The pyannote toolkit improve the performance significantly compared to the given segmentation. The merge algorithm from Inaguma et al (2021) further decreases the WER. We adjust two parameters of merge algorithm, M dur and M int .…”
Section: Cascade Speech Translationmentioning
confidence: 99%
See 1 more Smart Citation
“…In recent studies, many speech segmentation methods based on VAD have been proposed for ST. Gaido et al [12] and Inaguma et al [13] used the heuristic concatenation of VAD segments up to a fixed length to address the over-segmentation problem. Gállego et al [14] used a pre-trained ASR model called wav2Vec 2.0 [15] for silence detection.…”
Section: Related Workmentioning
confidence: 99%