“…They managed to increase ASR performance by voice activity detection for segmentation (Zhang et al, 2022;Ding and Tao, 2021), training the ASR on synthetic data with added punctuation, noise-filtering and domain-specific finetuning (Zhang and Ao, 2022;Li et al, 2022) or adding an intermediate model that cleans the ASR output in terms of casing and punctuation (Nguyen et al, 2021). The MT components were mostly transformer-based (Zhang et al, 2022;Nguyen et al, 2021;Bahar et al, 2021) or fine-tuned on preexisting models (Zhang and Ao, 2022). Additional methods used to improve MT performance were multi-task learning (Denisov et al, 2021), backtranslation (Ding and Tao, 2021;Zhang et al, 2022;Zhang and Ao, 2022), domain adaption (Nguyen et al, 2021;Zhang et al, 2022), knowledge distillation (Zhang et al, 2022), making the MT component robust by training it on noisy ASR output data (Nguyen et al, 2021;Zhang et al, 2022;Zhang and Ao, 2022), re-ranking and de-noising techniques (Ding and Tao, 2021).…”