“…So far, the adoption of direct ST architectures to address the automatic subtitling task has only been explored in (Papi et al, 2022a). As a matter of fact, all previous works on the topic (Piperidis et al, 2004;Melero et al, 2006;Matusov et al, 2019;Koponen et al, 2020;Bojar et al, 2021) rely on cascade architectures that usually involve an ASR component to transcribe the input speech, a subtitle segmenter that segments the transcripts into subtitles, a timestamp estimator that predicts the start and times of each subtitle, and an MT model that translates the subtitle transcripts. Cascaded architectures, however, cannot access information contained in the speech, such as prosody, which related works proved to be an important source of information for the segmentation into subtitles (Öktem et al, 2019;Virkar et al, 2021;Tam et al, 2022).…”