TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion

Park, Hyun Joon; Yang, Suk Woo; Kim, Jin Sob; Shin, Woo-Seok; Han, Sung Won

doi:10.1109/icassp49357.2023.10096642

Cited by 4 publications

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Ko,

Kim,

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice while preserving linguistic information of the source speech. The existing VC methods typically use mel-spectrogram as both input and output, so a separate vocoder is required to transform mel-spectrogram into waveform. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. Unlike other methods, WaveVC does not require a separate vocoder and can perform VC directly on raw audio waveform using 1D convolution. This eliminates the issue of performance degradation caused by the train-test mismatch of the vocoder. In the training phase, WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech using the pre-trained networks. WaveVC is capable of converting voices while maintaining consistency in speech and fundamental frequency. In the test phase, the F0 feature of the source speech is concatenated with a content embedding vector to ensure the converted speech follows the fundamental frequency flow of the source speech. WaveVC achieves higher performances than baseline methods in both many-to-many VC and any-to-any VC. The converted samples are available online.

show abstract