Multimodal exemplar-based voice conversion using lip features in noisy environments

Masaka, Kenta; Aihara, Ryo; Takiguchi, Tetsuya; Ariki, Yasuo

doi:10.21437/interspeech.2014-295

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2015

Publication Types

Select...

Article1

Other1

Relationship

Self Cite1

Independent1

Authors

Journals

Cited by 2 publications

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Multimodal voice conversion based on non-negative matrix factorization

Masaka

Aihara

Takiguchi

et al. 2015

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audiovisual exemplars. Using the joint audiovisual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.

show abstract