EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Cui, Chenye; Ren, Yi; Liu, Jinglin; Chen, Feiyang; Huang, Rongjie; Lei, Ming; Zhao, Zhou

doi:10.48550/arxiv.2106.09317

Cited by 3 publications

(3 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Environmental sound categorization research is scarce compared to other machine learning sound-and image-processing challenges. Recent works have demonstrated this concept's utility in a variety of contexts, such as virtual assistants [22], automatic voice recognition [23], and text-to-speech applications [24]. The classifier employed in these investigations divides these recent works into two kinds.…”

Section: Related Workmentioning

confidence: 99%

Environmental Sound Classification Based on Transfer-Learning Techniques with Multiple Optimizers

et al. 2022

View full text Add to dashboard Cite

The last decade has seen increased interest in environmental sound classification (ESC) due to the increased complexity and rich information of ambient sounds. The state-of-the-art methods for ESC are based on transfer learning paradigms that often utilize learned representations from common image-classification problems. This paper aims to determine the effectiveness of employing pre-trained convolutional neural networks (CNNs) for audio categorization and the feasibility of retraining. This study investigated various hyper-parameters and optimizers, such as optimal learning rate, epochs, and Adam, Adamax, and RMSprop optimizers for several pre-trained models, such as Inception, and VGG, ResNet, etc. Firstly, the raw sound signals were transferred into an image format (log-Mel spectrogram). Then, the selected pre-trained models were applied to the obtained spectrogram data. In addition, the effect of essential retraining factors on classification accuracy and processing time was investigated during CNN training. Various optimizers (such as Adam, Adamax, and RMSprop) and hyperparameters were utilized for evaluating the proposed method on the publicly accessible sound dataset UrbanSound8K. The proposed method achieves 97.25% and 95.5% accuracy on the provided dataset using the pre-trained DenseNet201 and the ResNet50V2 CNN models, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Environmental Sound Classification Based on Transfer-Learning Techniques with Multiple Optimizers

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Training TTS and SVS systems both require a significant amount of annotated data [9,10,15]. The rapid increase in the amount of multimedia content on the Internet in recent years makes data much more important.…”

Section: Datasetmentioning

confidence: 99%

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Huang

Chen

Ren

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective.3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline. Samples are available at https://Multi-Singer.github.io/ CCS CONCEPTS• Applied computing → Sound and music computing; • Computing methodologies → Natural language generation.

show abstract

“…4) FG-TransformerTTS(Chen & Rudnicky, 2021): The finegrained style control on auto-regressive model Transformer-TTS. 5) Expressive FastSpeech 2(Ren et al, 2020): The combination of both multi-speaker(Chen et al, 2020b) and muli-emotion(Cui et al, 2021) FastSpeech 2, which adds the speaker and emotion d-vectors extracted by the pretrained discriminative models to the backbone. 6) Meta-StyleSpeech(Min et al, 2021): The finetuned multi-speaker text-to-speech model with meta-learning.…”

mentioning

confidence: 99%

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Huang¹,

Ren²,

Liu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a textto-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the styleagnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener-Speech performs robustly in the few-shot data setting. Audio samples are available at https: //GenerSpeech.github.io/.

show abstract

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Cited by 3 publications

References 44 publications

Environmental Sound Classification Based on Transfer-Learning Techniques with Multiple Optimizers

Environmental Sound Classification Based on Transfer-Learning Techniques with Multiple Optimizers

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Contact Info

Product

Resources

About