“…Recently, there has been much interest in end-to-end speech translation (ST) models [1,2,3,4,5,6,7], which, compared to traditional cascaded models, are simpler and computationally more efficient, can preserve more acoustic information such as prosody and can avoid propagating errors from the speech recognition component. Large amounts of annotated data are usually required for achieving a good performance for such systems, but supervised training data for ST remain very limited.…”