CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022

Polák, Peter; Pham, Ngoc-Quan; Nguyen, Tuan Nam; Liu, Danni; Mullov, Carlos; Niehues, Jan; Bojar, Ondřej; Waibel, Alexander

doi:10.18653/v1/2022.iwslt-1.24

Cited by 6 publications

(9 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, [6] proposed a hold-n policy that removes the last n tokens from the model output, and local agreement that finds the longest common prefix of outputs obtained for two consecutive input contexts. Moreover, [7] showed that varying the chunk size can also be effectively applied along these policies.…”

Section: Latency-quality Trade-offmentioning

confidence: 99%

“…The advantage of the standard beam search is that the model can generate a complete translation for current speech input. However, it is also prone to overgeneration and low-quality translations toward the end of the context [7]. In ASR, the standard beam search with attentional models shown poor length generalization [27].…”

Section: Latency-quality Trade-offmentioning

confidence: 99%

“…All models are evaluated using Simuleval [19] toolkit. For the translation quality, we report detokenized case-sensitive BLEU [37], and for the latency, we report length-aware average lagging (LAAL) [7,38]. In all our experiments, we use beam search with size 6.…”

Section: Modelsmentioning

confidence: 99%

“…However, wait-k cannot directly use beam search, and its direct application to speech input is also complicated [5]. Alternatively, we can leave the model to generate the whole translation for the current context and heuristically decide which portion of the translation is reliable [6,7]. However, relying on attention can lead to over-generation and low-quality translation [7,8].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Polák¹,

Yan²,

Watanabe³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -this scheme cannot directly show a single incremental translation to users. Further, this method lacks mechanisms for controlling the quality vs. latency tradeoff. We propose a modified incremental blockwise beam search incorporating local agreement or hold-n policies for qualitylatency control. We apply our framework to models trained for online or offline translation and demonstrate that both types can be effectively used in online mode.Experimental results on MuST-C show 0.6-3.6 BLEU improvement without changing latency or 0.8-1.4 s latency improvement without changing quality.

show abstract

Section: Latency-quality Trade-offmentioning

confidence: 99%

Section: Latency-quality Trade-offmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Polák¹,

Yan²,

Watanabe³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

“…Therefore, we proposed an improved version of the AL metric, which was later independently proposed under name length-adaptive average lagging (LAAL; Papi et al, 2022). To remedy the over-generation problem, we proposed an improved version of the beam search algorithm in Polák et al (2023b). While this led to significant improvements in the quality-latency tradeoff, the decoding still relied on label-synchronous decoding.…”

Section: Quality-latency Tradeoff In Sstmentioning

confidence: 99%

Long-form Simultaneous Speech Translation: Thesis Proposal

Polák

2023

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacifi

View full text Add to dashboard Cite

Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-toend (E2E) systems. Nevertheless, a major limitation of most approaches to E2E SST reported in the current literature is that they assume that the source speech is pre-segmented into sentences, which is a significant obstacle for practical, real-world applications. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting, i.e., without pre-segmentation. We present a survey of the latest advancements in E2E SST, assess the primary obstacles in SST and its relevance to long-form scenarios, and suggest approaches to tackle these challenges. * The literature on simultaneous speech translation often uses the word "streaming" as an equivalent of "simultaneous" to refer to the translation of an unfinished utterance. In other literature, however, the term "streaming" refers to input spanning several sentences. To avoid confusion, we use "simultaneous" to refer to the translation of an unfinished utterance and "long-form" to refer to input spanning several sentences. 1 We consider only the speech-to-text variant in this work.

show abstract

Subtitle Synchronization Using Whisper ASR Model

Azneed,

Sanas

et al. 2024

2024 International Conference on Power, Energy, Control and Transmission Systems (ICPECTS)

View full text Add to dashboard Cite

CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022

Cited by 6 publications

References 60 publications

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Long-form Simultaneous Speech Translation: Thesis Proposal

Subtitle Synchronization Using Whisper ASR Model

Contact Info

Product

Resources

About