2022
DOI: 10.48550/arxiv.2202.04774
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Abstract: Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a meth… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(9 citation statements)
references
References 13 publications
0
9
0
Order By: Relevance
“…Recently, Tsiamas et al (2022) presented a novel Supervised Hybrid Audio Segmentation (SHAS) with excellent results in limiting the translation quality drop. SHAS adopts a probabilistic version of the Divide-and-Conquer algorithm by Potapczyk and Przybysz (2020) that progressively splits the audio at the frame with highest probability of being a splitting point until all segments are below a specified length.…”
Section: Data Filteringmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, Tsiamas et al (2022) presented a novel Supervised Hybrid Audio Segmentation (SHAS) with excellent results in limiting the translation quality drop. SHAS adopts a probabilistic version of the Divide-and-Conquer algorithm by Potapczyk and Przybysz (2020) that progressively splits the audio at the frame with highest probability of being a splitting point until all segments are below a specified length.…”
Section: Data Filteringmentioning
confidence: 99%
“…In the context of this competition, however, these limitations do not represent a significant issue. Tsiamas et al (2022) compare SHAS with previous segmentation methods only using models trained on well-formed sentence-utterance pairs. In this work, we validate their findings also on models fine-tuned on randomly segmented data to check: i) whether this fine-tuning brings benefits also with audio segmented with SHAS, and ii) whether the gap between SHAS and other segmentation is closed or not by the fine-tuning.…”
Section: Data Filteringmentioning
confidence: 99%
See 1 more Smart Citation
“…As pointed out in (Tsiamas et al, 2022), the quality of audio segmentation has a big impact on the performance of the speech translation models, which are trained on utterances corresponding to full sentences, often manually aligned, and this rarely happens with an automatic segmentation system.…”
Section: Speech Segmentationmentioning
confidence: 99%
“…• The audio segmentation component is changed into a full neural-based solution combined with pretraining (Tsiamas et al, 2022). The new solution is not only more accurate, but also directly optimized on TED Talks giving the translation model more precise and complete segmentations compared to the generic voice activity detectors.…”
Section: Introductionmentioning
confidence: 99%