ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413391
|View full text |Cite
|
Sign up to set email alerts
|

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Abstract: Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
55
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 109 publications
(55 citation statements)
references
References 26 publications
0
55
0
Order By: Relevance
“…We use a multi-speaker emotional speech dataset, ESD [14], to conduct all the experiments. ESD consists of multi-lingual and multi-speaker parallel emotional speech data with five emotions (neutral, happy, sad, angry and surprise), and has been used in emotional voice conversion [49,53] and emotional text-to-speech [54].…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We use a multi-speaker emotional speech dataset, ESD [14], to conduct all the experiments. ESD consists of multi-lingual and multi-speaker parallel emotional speech data with five emotions (neutral, happy, sad, angry and surprise), and has been used in emotional voice conversion [49,53] and emotional text-to-speech [54].…”
Section: Methodsmentioning
confidence: 99%
“…Many studies [46,47] have shown that the deep features learned by DNN are more effective and thus more suitable for SER. Meanwhile, recent speech synthesis studies [48,49] also propose to leverage those deep emotional features to characterize different emotional styles over a continuum [50]. These successful attempts have served as the source of motivation for this paper.…”
Section: Speaker-dependent Emotional Stylementioning
confidence: 99%
See 1 more Smart Citation
“…We first give a comprehensive overview of recent studies on emotional voice conversion in Section 2. We discuss the 15 existing databases [68] in Section 3. We then formulate the design of a novel ESD database for speaker-independent emotional voice conversion, that is also suitable for other speech synthesis tasks, such as mono-lingual or cross-lingual speaker voice conversion and emotional text-to-speech.…”
Section: Speaker a (Happy)mentioning
confidence: 99%
“…The (Kun et al, 2021). The RAVDESS corpus is an audio-visual database of emotional speech and songs.…”
Section: Formation Of An Emotional Corpus Of Speech Signalsmentioning
confidence: 99%