“…To this end, researchers tried considering not only the presence of speech but also its length (Potapczyk and Przybysz, 2020;Inaguma et al, 2021;. Later studies tried to avoid VAD and focused on more linguisticallymotivated approaches, e.g., ASR CTC to predict voiced regions Gállego et al (2021) or directly modeling the sentence segmentation (Tsiamas et al, 2022b;Fukuda et al, 2022).…”