An Empirical Study of End-To-End Simultaneous Speech Translation Decoding Strategies

Nguyen, Ha Thanh; Estève, Yannick; Besacier, Laurent

doi:10.1109/icassp39728.2021.9414276

Cited by 20 publications

(17 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also Zeng et al (2021) integrate the beam search in the decoding strategy, developing the wait-k-stride-N strategy. In particular, the authors bypass output speculation by directly applying beam search, after waiting for k words, on a word stride of size N (i.e., on N words at a time) instead of one single word as prescribed by the standard wait-k. Nguyen et al (2021a) analyzed several decoding strategies relying on different output token granularities, such as characters and Byte Pair Encoding (BPE), showing that the latter yields lower latency.…”

Section: Encodingmentioning

confidence: 99%

“…An alternative approach to simultaneous training is the offline (or full-sentence) training of the system and its subsequent use as a simultaneous one. Nguyen et al (2021a) explored this solution with an LSTM-based direct ST system, analyzing the effectiveness of different decoding strategies. Interestingly, the offline approach does not only preserve overall performance despite the switch of modality, it also improves system's ability to generate well-formed sentences.…”

Section: Encodingmentioning

confidence: 99%

“…This small increase in latency, however, allows the model to perform beam search on the stride, which has been shown to be effective in improving translation quality (Sutskever et al, 2014). Decoding more than one word at a time is the approach also employed by Nguyen et al (2021a), who showed that emitting two words increases the quality of the translation without any relevant impact on latency. Another way of applying the wait-k strategy was proposed by Chen et al (2021), where a streaming ASR system is used to guide the direct ST decoding.…”

Section: Another Point Of Viewmentioning

confidence: 99%

See 2 more Smart Citations

Visualization: the missing factor in Simultaneous Speech Translation

Papi¹,

Negri²,

Turchi³

2021

Preprint

View full text Add to dashboard Cite

Simultaneousspeech translation (SimulST) is the task in which output generation has to be performed on partial, incremental speech input. In recent years, SimulST has become popular due to the spread of multilingual application scenarios, like international live conferences and streaming lectures, in which on-the-fly speech translation can facilitate users' access to audio-visual content. In this paper, we analyze the characteristics of the SimulST systems developed so far, discussing their strengths and weaknesses. We then concentrate on the evaluation framework required to properly assess systems' effectiveness. To this end, we raise the need for a broader performance analysis, also including the user experience standpoint. We argue that SimulST systems, indeed, should be evaluated not only in terms of quality/latency measures, but also via task-oriented metrics accounting, for instance, for the visualization strategy adopted. In light of this, we highlight which are the goals achieved by the community and what is still missing.

show abstract

Section: Encodingmentioning

confidence: 99%

Section: Encodingmentioning

confidence: 99%

Section: Another Point Of Viewmentioning

confidence: 99%

See 1 more Smart Citation

Visualization: the missing factor in Simultaneous Speech Translation

Papi¹,

Negri²,

Turchi³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The benefits of training a system in similar conditions to the inference setting have been given for granted so far. Although in the literature there are works employing models trained in offline, this has always been motivated by computational limits (Nguyen et al, 2021;. Being aware of the social and environmental impact caused by the high computational costs of the SimulST models (Schwartz et al, 2020), in this work we question this standard approach and ask: Does simultaneous speech translation actually need a simultaneoustrained model?…”

Section: Introductionmentioning

confidence: 99%

Does Simultaneous Speech Translation need Simultaneous Models?

Papi¹,

Gaido²,

Negri³

et al. 2022

Preprint

View full text Add to dashboard Cite

In simultaneous speech translation (SimulST), finding the best trade-off between high translation quality and low latency is a challenging task. To meet the latency constraints posed by the different application scenarios, multiple dedicated SimulST models are usually trained and maintained, generating high computational costs. In this paper, motivated by the increased social and environmental impact caused by these costs, we investigate whether a single model trained offline can serve not only the offline but also the simultaneous task without the need for any additional training or adaptation. Experiments on en→{de, es} indicate that, aside from facilitating the adoption of well-established offline techniques and architectures without affecting latency, the offline solution achieves similar or better translation quality compared to the same model trained in simultaneous settings, as well as being competitive with the SimulST state of the art.

show abstract

“…Most of the previous work used fixed policies. Some of them take fixed-length policy (Nguyen et al, 2021;Ma et al, 2020b that splits speech at a fixed frequency, for example, to generate one target word every T s ms (Figure 1 (a)). Other work adopts word-based policy that splits the speech into words and generates one target word whenever a new source word is detected, which calls for an auxiliary source word detector (Ren et al, 2020;Elbayad et al, 2020;Ma et al, 2020b;Zeng et al, 2021;Chen et al, 2021), see Figure 1 (b).…”

Section: Introductionmentioning

confidence: 99%

Learning Adaptive Segmentation Policy for End-to-End Simultaneous Translation

Zhang¹,

He²,

Wang³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

End-to-end simultaneous speech-to-text translation aims to directly perform translation from streaming source speech to target text with high translation quality and low latency. A typical simultaneous translation (ST) system consists of a speech translation model and a policy module, which determines when to wait and when to translate. Thus the policy is crucial to balance translation quality and latency. Conventional methods usually adopt fixed policies, e.g. segmenting the source speech with a fixed length and generating translation. However, this method ignores contextual information and suffers from low translation quality. This paper proposes an adaptive segmentation policy for end-toend ST. Inspired by human interpreters, the policy learns to segment the source streaming speech into meaningful units by considering both acoustic features and translation history, maintaining consistency between the segmentation and translation. Experimental results on English-German and Chinese-English show that our method achieves a good accuracylatency trade-off over recently proposed stateof-the-art methods.

show abstract

An Empirical Study of End-To-End Simultaneous Speech Translation Decoding Strategies

Cited by 20 publications

References 12 publications

Visualization: the missing factor in Simultaneous Speech Translation

Visualization: the missing factor in Simultaneous Speech Translation

Does Simultaneous Speech Translation need Simultaneous Models?

Learning Adaptive Segmentation Policy for End-to-End Simultaneous Translation

Contact Info

Product

Resources

About