Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Vekkot, Susmitha; Gupta, Deepa; Zakariah, Mohammed; Alotaibi, Yousef Ajami

doi:10.1109/access.2020.2988781

Cited by 19 publications

(6 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results section highlights the important feature analysis conducted as part of the dataset creation and validation Subjective Mean Opinion Score Participants evaluate random speech samples Ranges from 1-5, (MOS) [77] from dataset for similarity 1-No similarity, 5-Exactly similar [78], [79] pipeline. The audio signals from English and Indic languages are subjected to dynamic time warping before the feature extraction process.…”

Section: Feature Analysis and Discussionmentioning

confidence: 99%

Dementia Speech Dataset Creation and Analysis in Indic Languages—A Pilot Study

Vekkot,

Prakash,

Reddy

et al. 2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

The paper describes the creation, analysis and validation of a multilingual Dementia Speech dataset for Indic languages. Three popular Indian languages viz. Telugu, Tamil and Hindi are considered for the pilot study. Dementia and associated Alzheimers disease affect a large section of Asian population. Though there are promising studies in dementia detection focussed on Western ethnicity, the absence of a clinical dementia dataset for Indian languages forms the primary motivation for this study. This pilot study aims to overcome the challenges associated with data collection and validation in a clinical setting and deal with situations wherein clinical data is not readily available. The Indic dementia dataset is an enacted non-clinical dataset created from the manual translations of the benchmark clinical English DementiaBank dataset. The dataset created is validated using features extracted from the benchmark. The feature evaluation revealed a similarity of 92.6% for silences, 92% for mean pitch (Hz), 84.7% for jitter and 90.3% for shimmer. Subjective evaluation was also conducted based on clarity and similarity of utterances with DementiaBank data. An average MOS of 3.9 for clarity of speech and 3.76 for similarity with respect to DementiaBank was obtained across all three languages. A baseline classification using stateof-art deep network architecture gave a maximum of 78% accuracy in dementia detection using the Indic dementia dataset. The pilot experimentation in this work gives promising insights into the development of a multilingual dataset for analysis of clinical speech patterns in early dementia in the Indian population.

show abstract

Section: Feature Analysis and Discussionmentioning

confidence: 99%

Dementia Speech Dataset Creation and Analysis in Indic Languages—A Pilot Study

Vekkot,

Prakash,

Reddy

et al. 2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speaker recognition under noisy conditions using pipelined manners [19], is subject to computer vision techniques using CNN for confirming the speaker using facial recognition on the videos. Speech Emotion Recognition (SER) is approached from various perspectives, incorporating techniques such as gated recurrent unit (GRU) and multi-head attention [37][38][39]. These methods have been applied to the IEMOCAP and Emo-DB corpora, resulting in improved performance.…”

Section: For Sermentioning

confidence: 99%

A New Approach for Speech Emotion Recognition Using Single Layered Convolutional Neural Network

J, V Vinoth Kumar,

(Corresponding Author),

Khan

et al. 2024

MJCS

View full text Add to dashboard Cite

Creating a computational device to identify human emotions via voice analysis represents a notable achievement in the sector of human-computer interaction, especially within the healthcare domain. We propose a new light-weight model for addressing challenges of emotions recognition. The model works based on CNN with change of kernel processing. The proposed model performs a direct matching to recognize speech emotions of different eight categories using a statistical model named Analysis of Variance (ANOVA) as kernel for features extraction and Cosine Similarity Measurement (CSM) as activation function for CNN model. This proposed model contains eight-folded single-layered intermediate neurons, and each neuron can segregate speech emotion pattern using CSM from the voice convergence matrix to explore a part of the solution from the whole solution. Experiment results demonstrates that the proposed model outperforms compared with multiple layered existing CNN methods in identifying the emotional state of a speaker.

show abstract

“…In order to effectively describe the acoustic pattern information of speech in the time-frequency domain, model the long-term correlation of signals, and improve the naturalness of translated speech, a convolutional recurrent neural network with continuous wavelet transform is proposed in this paper. is CRNN model combined with the advantages of neural network, signal processing theory, and depth can use signal processing methods to obtain more suitable for the acoustic characteristics of the task and to make full use of the depth of the neural network nonlinear description ability to the words the local characteristics of spectrum and long correlation model, so as to achieve better performance of discourse transformation [18].…”

Section: Model Theorymentioning

confidence: 99%

Research on Discourse Transfer Analysis Based on Deep Learning of Cross-language Transfer

Shen

2022

Scientific Programming

View full text Add to dashboard Cite

With the current exchange and communication between different countries becoming more and more frequent, the language conversion of different countries has become a difficult problem. The analysis of a series of problems in cross-language discourse conversion, the study of the discourse conversion path, and innovation motivation based on the deep learning theory of cross-language transfer, it has theoretical and practical significance. This paper aims at the technical difficulties in speech conversion methods to effectively utilize the local mode information of signal time spectrum and the long-term correlation of speech signal. A discourse conversion method based on convolutional recurrent neural network model is proposed. In the model, the extended convolutional neural network is used to model the long-term correlation of speech signals. In the part of speech fundamental frequency estimation, the prosodic information generated by the decomposition of the fundamental frequency by continuous wavelet transform is used as the training target of the fundamental frequency estimation model. The experimental results show that the speech transformation method based on the convolutional cyclic network model proposed in this paper has better quality and intelligibility than the speech transformed by the contrast method.

show abstract

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Cited by 19 publications

References 67 publications

Dementia Speech Dataset Creation and Analysis in Indic Languages—A Pilot Study

Dementia Speech Dataset Creation and Analysis in Indic Languages—A Pilot Study

A New Approach for Speech Emotion Recognition Using Single Layered Convolutional Neural Network

Research on Discourse Transfer Analysis Based on Deep Learning of Cross-language Transfer

Contact Info

Product

Resources

About