“…End-to-end simultaneous speech translation (SimulST) (Fügen et al, 2007;Oda et al, 2014;Ren et al, 2020;Zeng et al, 2021;Zhang et al, 2022a) outputs translation when receiving the streaming speech inputs, and is widely used in realtime scenarios such as international conferences, live broadcasts and real-time subtitles. Compared with the offline speech translation waiting for the complete speech inputs Wang et al, 2020), SimulST needs to segment the streaming speech inputs and synchronously translate based on the current received speech, aiming to achieve high translation quality under low latency (Hamon et al, 2009;Cho and Esipova, 2016;Ma et al, 2020b;Zhang and Feng, 2022c).…”