Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-59
|View full text |Cite
|
Sign up to set email alerts
|

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Abstract: Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZEROS-WOT, a method for zero-shot ST that bridges the modality gap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(22 citation statements)
references
References 63 publications
1
21
0
Order By: Relevance
“…The inference method consists of segmentation, Bandpass filter, Short Time Fourier Transform (STFT), Mel spectrogram, neural network, and thresholding, as shown in 1) Segmentation: This process slices the audio into smaller segments to reduce the processing load and fasten the response for each segment. A good segmentation process is required due to the important information of audio, mostly not at the same part of a segment [23]. Thus, this research conducts two different parameters: segment duration and overlap ratio.…”
Section: B Inferencementioning
confidence: 99%
“…The inference method consists of segmentation, Bandpass filter, Short Time Fourier Transform (STFT), Mel spectrogram, neural network, and thresholding, as shown in 1) Segmentation: This process slices the audio into smaller segments to reduce the processing load and fasten the response for each segment. A good segmentation process is required due to the important information of audio, mostly not at the same part of a segment [23]. Thus, this research conducts two different parameters: segment duration and overlap ratio.…”
Section: B Inferencementioning
confidence: 99%
“…To build models with manageable size and computation, following Radford et al (2023), we segment the merged-channel conversations into chunks of up to 30 seconds. For this step, we first used an off-the-shelf VAD-based segmentation tool, SHAS (Tsiamas et al, 2022), but we realized that the resulting duration histogram is almost uniform and far from the natural segmentation. Hence, we decided to rely on the manual time annotations as follows.…”
Section: Conversational Multi-turn and Multi-speaker Stmentioning
confidence: 99%
“…A common practice for translating long-form audio files is to first segment them into smaller chunks based on voice activity detection (VAD). We compare our MT-MS segmentation approach with two popular VAD-based audio segmenters, i.e., WebRTC (Blum et al, 2021) and SHAS (Tsiamas et al, 2022), on the channel-merged Fisher-CALLHOME test sets. 10 When the audio and reference translation segments are not aligned, like in the case of VADbased segmentation, the standard process is to first concatenate translation hypotheses and then align and re-segment the conversation-level translation based on the segmented reference translation.…”
Section: Mt-ms Vs Vad Segmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…It only approximates the realistic setup where the segmentation would be provided by an automatic system, e.g. Tsiamas et al (2022), and may be partially incorrect and cause more translation errors than the gold segmentation. The simultaneous mode in Simultaneous Translation Task means that the source is provided gradually, one audio chunk at a time.…”
Section: Iwslt22 En-de Simultaneous Translation Taskmentioning
confidence: 99%