SimulSpeech: End-to-End Simultaneous Speech to Text Translation

Ren, Yi; Liu, Jinglin; Tan, Xu; Zhang, Chen; Qin, Tao; Zhao, Zhou; Liu, Tie-Yan

doi:10.18653/v1/2020.acl-main.350

Cited by 141 publications

(254 citation statements)

References 19 publications

Supporting

Mentioning

251

Contrasting

Order By: Relevance

“…We use a simplified version of [6] as our baseline model. While in [6], simultaneous policy on word boundaries which generated by a seperate model, we simply utilize a fixeddecision module introduced by [7]. Our choice is motivated by the fact that in [7], a fixed chunk size gave similar qualitylatency trade-offs as word boundaries.…”

Section: Methodsmentioning

confidence: 99%

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Wang

Dousti

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation, the task of generating translations from partial audio input, ignores the time spent in generating the translation when analyzing the latency. With this assumption, a system may have good latency quality trade-offs but be inapplicable in real-time scenarios. In this paper, we focus on the task of streaming simultaneous speech translation, where the systems are not only capable of translating with partial input but are also able to handle very long or continuous input. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder, which has shown great success on the streaming automatic speech recognition task with hybrid or transducer-based models. We conduct an empirical evaluation of the proposed model on segment, context and memory sizes and we compare our approach to a transformer with a unidirectional mask. 1

show abstract

Section: Methodsmentioning

confidence: 99%

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Wang

Dousti

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Note that the above-mentioned latency metrics are all proposed for text-to-text simultaneous translation and we use AL in the text track for latency evaluation. Some work extended AP and AL to speech translation (Ren et al, 2020;Ma et al, 2020), but we don't use them because they measure real-time latency, while some submissions calling remote services contain network delay. It is unreasonable to use real-time latency metrics for both the local-running systems and remote-running systems.…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…For example, a pre-defined translation of a named entity can be introduced to the MT module. However, controllability is not easy to be guaranteed for end-to-end simultaneous translation systems (Ren et al, 2020;Ma et al, 2020). It remains a challenge to correct a translation without an intermediate ASR result.…”

Section: Applicationsmentioning

confidence: 99%

Findings of the Second Workshop on Automatic Simultaneous Translation

Zhang¹,

Zhang²,

He³

et al. 2021

Proceedings of the Second Workshop on Automatic Simultaneous Translation

View full text Add to dashboard Cite

This paper presents the results of the shared task of the 2nd Workshop on Automatic Simultaneous Translation (AutoSimTrans). The task includes two tracks, one for text-to-text translation and one for speech-to-text, requiring participants to build systems to translate from either the source text or speech into the target text. Different from traditional machine translation, the AutoSimTrans shared task evaluates not only translation quality but also latency. We propose a metric "Monotonic Optimal Sequence" (MOS) considering both quality and latency to rank the submissions. We also discuss some important open issues in simultaneous translation.

show abstract

“…Simultaneous translation, the task of generating translations before reading the entire text or speech source input, has become an increasingly popular topic for both text and speech translation (Grissom II et al, 2014;Cho and Esipova, 2016;Gu et al, 2017;Alinejad et al, 2018;Arivazhagan et al, 2019;Ma et al, 2020;Ren et al, 2020). Simultaneous models are typically evaluated from quality and latency perspective.…”

Section: Introductionmentioning

confidence: 99%

“…While the translation quality is usually measured by BLEU (Papineni et al, 2002;Post, 2018), a wide variety of latency measurements have been introduced, such as Average Proportion (AP) (Cho and Esipova, 2016), Continues Wait Length (CW) (Gu et al, 2017), Average Lagging (AL) , Differentiable Average Lagging (DAL) (Cherry and Foster, 2019), and so on. Unfortunately, the latency evaluation processes across different works are not consistent: 1) the latency metric definitions are not precise enough with respect to text segmentation; 2) the definitions are also not precise enough with respect to the speech segmentation, for example some models are evaluated on speech segments (Ren et al, 2020) while others are evaluated on time duration (Ansari et al, 2020); 3) little prior work has released implementations of the decoding process and latency measurement. The lack of clarity and consistency of the latency evaluation process makes it challenging to compare different works and prevents tracking the scientific progress of this field.…”

Section: Introductionmentioning

confidence: 99%

SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation

Dousti

Wang

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

View full text Add to dashboard Cite

Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario where the model starts translating before reading the complete source input. Evaluating simultaneous translation models is more complex than offline models because the latency is another factor to consider in addition to translation quality. The research community, despite its growing focus on novel modeling approaches to simultaneous translation, currently lacks a universal evaluation procedure. Therefore, we present SIMULEVAL, an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation. A server-client scheme is introduced to create a simultaneous translation scenario, where the server sends source input and receives predictions for evaluation and the client executes customized policies. Given a policy, it automatically performs simultaneous decoding and collectively reports several popular latency metrics. We also adapt latency metrics from text simultaneous translation to the speech task. Additionally, SIMULEVAL is equipped with a visualization interface to provide better understanding of the simultaneous decoding process of a system. SIMULEVAL has already been extensively used for the IWSLT 2020 shared task on simultaneous speech translation. Code will be released upon publication. 1

show abstract

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

Cited by 141 publications

References 19 publications

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Findings of the Second Workshop on Automatic Simultaneous Translation

SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation

Contact Info

Product

Resources

About