“…(1) Parallel corpus of different accents of the same speaker using source and target speech content and time alignment (Finkelstein et al, 2022;Liu et al, 2022;Hida et al, 2022;Toda et al, 2007;Oyamada et al, 2017). (2) Non-parallel corpus of * Corresponding author multiple speakers with multiple accents using inconsistent source and target speech content (Wang et al, 2021;Zhao et al, 2018Zhao et al, , 2019Kaneko et al, 2019Kaneko et al, , 2020aKaneko et al, , 2021Finkelstein et al, 2022) used a multi-stage trained tts model to achieve transfer of North American accents, Australian accents, and British accents, and used a CHiVE-BERT pre-training model to enhance the audio effect of accent generation.…”