“…Some works used the system to get a spoken version of the song and transform it into singing by incorporating a signal processing stage. For instance, in [22], the synthetic speech was converted into singing according to a MIDI file input, using STRAIGHT to perform the analysis, transformation and synthesis. In [17], an HMM-based TTS synthesiser for Basque was used to generate a singing voice.…”
Section: Singing Synthesismentioning
confidence: 99%
“…The audios generated for one of the five scores have been provided as Additional files 1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, and 24. Forty-nine Spanish native speakers took part in the test.…”
Section: Subjective Evaluation 431 Mushra Test Setupmentioning
confidence: 99%
“…As an alternative, we could take advantage of those approaches which focus on the production of singing from speech following the so-called speechto-singing (STS) conversion [19][20][21]. These techniques can be applied to the output of a TTS system to transform speech to singing by maintaining the identity of the speaker [18,22]. However, this straightforward approach has been proved suboptimal in terms of flexibility and computational costs [18].…”
Section: Introductionmentioning
confidence: 99%
“…(2019) 2019: 22 Page 13 of 14 ting vibrato to the singing expression control generation module. Furthermore, other signal-processing techniques could be considered for the transformation module to better cope with the challenge of generating singing from neutral speech.…”
Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, timescale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.
“…Some works used the system to get a spoken version of the song and transform it into singing by incorporating a signal processing stage. For instance, in [22], the synthetic speech was converted into singing according to a MIDI file input, using STRAIGHT to perform the analysis, transformation and synthesis. In [17], an HMM-based TTS synthesiser for Basque was used to generate a singing voice.…”
Section: Singing Synthesismentioning
confidence: 99%
“…The audios generated for one of the five scores have been provided as Additional files 1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, and 24. Forty-nine Spanish native speakers took part in the test.…”
Section: Subjective Evaluation 431 Mushra Test Setupmentioning
confidence: 99%
“…As an alternative, we could take advantage of those approaches which focus on the production of singing from speech following the so-called speechto-singing (STS) conversion [19][20][21]. These techniques can be applied to the output of a TTS system to transform speech to singing by maintaining the identity of the speaker [18,22]. However, this straightforward approach has been proved suboptimal in terms of flexibility and computational costs [18].…”
Section: Introductionmentioning
confidence: 99%
“…(2019) 2019: 22 Page 13 of 14 ting vibrato to the singing expression control generation module. Furthermore, other signal-processing techniques could be considered for the transformation module to better cope with the challenge of generating singing from neutral speech.…”
Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, timescale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.
“…In recent years, speech synthesis techniques have been well developed; e.g., text reading system, speech-oriented guidance system, and synthesizing singing voices [1][2][3]. However, in these systems only the linguistic informat ion is synthesized, and they cannot handle human emotion.…”
We propose Gaussian Mixture Model (GMM)-based emotional voice conversion using spectrum and prosody features. In recent years, speech recognition and synthesis techniques have been developed, and an emotional voice conversion technique is required for synthesizing more exp ressive voices. The common emotional conversion was based on transformation of neutral prosody to emotional prosody by using huge speech corpus. In this paper, we convert a neutral voice to an emot ional vo ice using GMMs. GMM-based spectrum conversion is widely used to modify non linguistic informat ion such as voice characteristics while keep ing linguistic information unchanged. Because the conventional method converts either prosody or voice quality (spectrum), so me emot ions are not converted well. In our method, both prosody and voice quality are used for converting a neutral voice to an emotional voice, and it is able to obtain more expressive voices in comparison with conventional methods, such as prosody or spectrum conversion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.