“…Method Model Adaptation Setting Language [23], [25], [26] Teacher-Student Hard and soft labels Conformer RNN-T [27] Transformer CTC RNN-T [19] News speech, Voice search, Far-field, Telephony, YouTube English [4], [5] Teacher-Student Soft labels TDNN-LSTM [28] Noise, Far-field English [29] Teacher-Student Hard and soft labels NiN-CNN [30] Dialects Children speech Japanese [31] Teacher-Student Soft labels Streaming RNN-T [32] Multilingual English, Brazilian Portuguese, Russian, , Turkish, Nordic/Germanic [6], [33], [34] Domain Adversarial Training TDNN Kaldi [35], [36] DNN-HMM DNN-HMM Noise, Channel English [37] Domain Adversarial Training RNN-CTC [38] Far-field English [8], [39 1) Inspired by recent advances on UDA for Natural Language Processing systems [45], we propose a finetuning strategy for speech models, where the self-supervised objective is based on a contrastive loss in Section III. Contrary to prior works, who leverage only in-domain self-supervision, we find that in this contrastive setting this leads to mode-collapse of the latent representations, and mixed source and target domain self-supervision is essential.…”