This paper describes a speech-to-singing synthesis system that can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based on the speech manipulation system STRAIGHT and comprises three models controlling three acoustic features unique to singing voices: the fundamental frequency (F0), phoneme duration, and spectrum. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four types of F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results show that the proposed system can convert speaking voices into singing voices whose naturalness is almost the same as actual singing voices.
In this paper, we propose a novel area of research referred to as singing information processing. To shape the concept of this area, we first introduce singing understanding systems for synchronizing between vocal melody and corresponding lyrics, identifying the singer name, evaluating singing skills, creating hyperlinks between phrases in the lyrics of songs, and detecting breath sounds. We then introduce music information retrieval systems based on similarity of vocal melody timbre and vocal percussion, and singing synthesis systems. Common signal processing techniques for modeling singing voices that are used in these systems, such as techniques for extracting the vocal melody from polyphonic music recordings and modeling the lyrics by using phoneme HMMs for singing voices, are discussed.Index Terms-Music, singing information processing, singing voice modeling, vocal melody INTRODUCTIONAs research on music information processing [1, 2, 3], including research on music information retrieval [4], has continued to rapidly expand, research activities related to singing have also become more vigorous. Such activities are attracting attention not only from a scientific point of view, but also from the standpoint of industrial applications. Singing-related research is highly diverse, ranging from basic research on the features unique to singing to applied research such as that on the synthesis of singing voices, lyrics recognition, singer identification, retrieval of singing voices, and singing-skill evaluation. In this paper, we refer to this broad range of singing-related studies as singing information processing and introduce examples of these studies with the focus on signal processing techniques for modeling singing voices.Singing possesses aspects of both speech and music, and there are many unsolved research problems from the viewpoint of either field. For example, singing voices generally fluctuate more than speaking voices, and musical accompaniment, which is closely interlinked with singing, is usually included at a relatively high volume. Because of these characteristics, the automatic recognition of singing is the most difficult class of speech recognition from a technical point of view. In fact, the automatic recognition of lyrics in vocals has not yet been fully achieved. Furthermore, from the viewpoint of music recognition and understanding, large fluctuations and variations in singing cause various difficulties compared to a similar analysis of musical instruments. Technically speaking, there are many difficult and deeply interesting problems in this regard. Similarly, in the research on singing synthesis, many problems still exist, since, in addition to conveying content in the form of language as in speaking, singing synthesis requires dynamic, complex, and expressive changes in the voice pitch, intensity, and timbre of singing. In this way, the study of singing information processing is a genuine frontier of science.Moreover, while music is an important type of content from the viewpoints o...
A singing-voice synthesis method that can be transformed from a speaking voice into a singing voice using STRAIGHT is proposed. This method comprises three sections: the F0 control model, spectral sequence control model, and duration control model. These models were constructed by analyzing characteristics of each acoustical feature that affects singing-voice perception through psychoacoustic experiments. The F0 control model generates a singing-voice F0 contour through consideration of four F0 fluctuations: overshoot, vibrato, preparation, and fine (unsteady) fluctuation that affect the naturalness of a singing voice. The spectral sequence control model modifies the speaking-voice spectral shape into a singing-voice spectral shape by controlling a singer’s formant, which is a remarkable peak of a spectral envelope at around 3 kHz, and amplitude modulation of formants synchronized with vibrato. The duration control model stretches the speaking-voice phoneme duration into a singing-voice phoneme duration based on note duration. Results show that the proposed method can synthesize a natural singing voice, whose sound quality resembles that of an actual singing voice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.