“…Such distance metrics include the euclidean Tang et al, 2021a), cosine (Chuang et al, 2020), and contrastive Ouyang et al, 2023), but they typically require some transformation in the representations, such as mean-pooling, while our approach optimizes a distance that does not alter the representation space. Methods to reduce the length discrepancy usually include sub-sampling the speech representation using convolutional length adaptors Gállego et al, 2021;Fang et al, 2022;Zhao et al, 2022) or character/phonemebased CTC compression Xu et al, 2021a). Several methods have also used phonemized text in order to better match the representations in both length and content (Tang et al, 2021a(Tang et al, , 2022Le et al, 2023), but potentially limiting the quality of the text branch due to noise.…”