The estimation of fundamental frequency (F0) from audio is a necessary step in many speech processing tasks such as speech synthesis, that require to accurately analyze big datasets, or realtime voice transformations, that require low computation times. New approaches using neural networks have been recently proposed for F0 estimation, outperforming previous approaches in terms of accuracy. The work presented here aims at bringing some more improvements over such CNN-based state-of-theart approaches, especially when targeting speech data. More specifically, we first propose to use the recent PaN speech synthesis engine in order to generate a high-quality speech database with a reliable ground truth F0 annotation. Then, we propose 3 variants of a new fully-convolutional network (FCN) architecture that are shown to perform better than other similar datadriven methods, with a significantly reduced computational load making them more suitable for real-time purposes.
Glottal Closure Instants (GCI) detection consists in automatically detecting temporal locations of most significant excitation of the vocal tract from the speech signal. It is used in many speech analysis and processing applications, and various algorithms have been proposed for this purpose. Recently, new approaches using convolutional neural networks have emerged , with encouraging results. Following this trend, we propose a simple approach that performs a regression from the speech waveform to a target signal from which the GCI are easily obtained by peak-picking. However, the ground truth GCI used for training and evaluation are usually extracted from EGG signals, which are not reliable and often not available. To overcome this problem, we propose to train our network on high-quality synthetic speech with perfect ground truth. The performances of the proposed algorithm are compared with three other state-of-the-art approaches using publicly available datasets, and the impact of using controlled synthetic or real speech signals in the training stage is investigated. The experimental results demonstrate that the proposed method obtains similar or better results than other state-of-the-art algorithms and that using large synthetic datasets with many speaker offers better generalization ability than using a smaller database of real speech and EGG signals.
This article presents the results of collaboration between a composer and researchers in the context of vocal roughness and composing for voice. Our research focused on parametric control of distortion. Specifically, we present a software device that supports the manipulation and control of vocal roughness in real time, using a method based on amplitude modulation and filtering. The compositional interest in working with classically trained opera singers and with vocal distortion led us to initiate research in the signal-processing domain. Our goal was to develop a tool that could facilitate the production of distorted sounds without direct effort on the part of the singer. In this way, the singer can perform a nondistorted or lightly distorted sound, and the software tool will generate or magnify the distortion in real time.
In singing voice, the fundamental frequency (F0) carries not only melody, but also music style, personal expressivity and other characteristics specific to voice production mechanism. The F0 modeling is therefore critical for a natural-sounding and expressive synthesis. In addition, for artistic purposes, composers also need to have control over expressive parameters of the F0 curve, which is missing in many current approaches. This paper presents a novel parametric F0 model for singing voice synthesis with intuitive control of expressive parameters. The proposed approach considers the various F0 variations of the singing voice as separate layers using B-splines to model the melodic component. This model has been implemented in a concatenative singing voice synthesis system and its perceived naturalness has been evaluated through listening tests. The validity of each layer is first evaluated independently, and the full model is then compared to real F0 curves from professional singers. The results of these tests suggest that the model is suitable to produce natural and expressive F0 contours.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.