2020
DOI: 10.1109/access.2020.2988781
|View full text |Cite
|
Sign up to set email alerts
|

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Abstract: We propose a hybrid network-based learning framework for speaker-adaptive vocal emotion conversion, tested on three different datasets (languages), namely, EmoDB (German), IITKGP (Telugu), and SAVEE (English). The optimized learning model introduced is unique because of its ability to synthesize emotional speech with an acceptable perceptive quality while preserving speaker characteristics. The multilingual model is extremely beneficial in scenarios wherein emotional training data from a specific target speake… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 19 publications
(6 citation statements)
references
References 67 publications
0
6
0
Order By: Relevance
“…The results section highlights the important feature analysis conducted as part of the dataset creation and validation Subjective Mean Opinion Score Participants evaluate random speech samples Ranges from 1-5, (MOS) [77] from dataset for similarity 1-No similarity, 5-Exactly similar [78], [79] pipeline. The audio signals from English and Indic languages are subjected to dynamic time warping before the feature extraction process.…”
Section: Feature Analysis and Discussionmentioning
confidence: 99%
“…The results section highlights the important feature analysis conducted as part of the dataset creation and validation Subjective Mean Opinion Score Participants evaluate random speech samples Ranges from 1-5, (MOS) [77] from dataset for similarity 1-No similarity, 5-Exactly similar [78], [79] pipeline. The audio signals from English and Indic languages are subjected to dynamic time warping before the feature extraction process.…”
Section: Feature Analysis and Discussionmentioning
confidence: 99%
“…Speaker recognition under noisy conditions using pipelined manners [19], is subject to computer vision techniques using CNN for confirming the speaker using facial recognition on the videos. Speech Emotion Recognition (SER) is approached from various perspectives, incorporating techniques such as gated recurrent unit (GRU) and multi-head attention [37][38][39]. These methods have been applied to the IEMOCAP and Emo-DB corpora, resulting in improved performance.…”
Section: For Sermentioning
confidence: 99%
“…In order to effectively describe the acoustic pattern information of speech in the time-frequency domain, model the long-term correlation of signals, and improve the naturalness of translated speech, a convolutional recurrent neural network with continuous wavelet transform is proposed in this paper. is CRNN model combined with the advantages of neural network, signal processing theory, and depth can use signal processing methods to obtain more suitable for the acoustic characteristics of the task and to make full use of the depth of the neural network nonlinear description ability to the words the local characteristics of spectrum and long correlation model, so as to achieve better performance of discourse transformation [18].…”
Section: Model Theorymentioning
confidence: 99%