Crowdsourcing speech recordings provides unique opportunities and challenges for personalized speech synthesis as it allows gathering of large quantities of data but with a huge variety in quality. Manual methods for data selection and cleaning quickly become infeasible, especially when producing larger quantities of voices. We present and analyze approaches for data selection and augmentation to cope with this. For differently-sized training sets, we assess speaker adaptation by transfer learning, including layer freezing, and sentence selection using maximum likelihood of forced alignment. The methodological framework utilizes statistical parametric speech synthesis based on Deep Neural Networks (DNNs). We compare objective scores for 576 voice models, representing all condition combinations. For a constrained set of conditions we also present results from a subjective listening test. We show that speaker adaptation improves overall quality in nearly all cases, sentence selection helps detecting recording errors, and layer freezing proves to be ineffective in our system. We also found that while Mel-Cepstral Distortion (MCD) does not correlate with listener preference across the range of values, the most preferred voices also exhibited the lowest values for MCD. These findings have implications on scalable methods of customized voice building and clinical applications with sparse data.
In this paper we evaluate how speaker familiarity influences the engagement times and performance of blind children and young adults when playing audio games made with different synthetic voices. We also show how speaker familiarity influences speaker and synthetic speech recognition. For the first experiment we develop synthetic voices of school children, their teachers and of speakers that are unfamiliar to them and use each of these voices to create variants of two audio games: a memory game and a labyrinth game. Results show that pupils have significantly longer engagement times and better performance when playing games that use synthetic voices built with their own voices. These findings can be used to improve the design of audio games and lecture books for blind and visually impaired children and young adults. In the second experiment we show that blind children and young adults are better in recognising synthetic voices than their visually impaired companions. We also show that the average familiarity with a speaker and the similarity between a speaker's synthetic and natural voice are correlated to the speaker's synthetic voice recognition rate.
In this paper we analyse the effect of speech corpus and compression method on the intelligibility of synthesized speech at fast rates. We recorded English and German language voice talents at a normal and a fast speaking rate and trained an HSMMbased synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to generated speech. Word recognition results for the English voices show that generating speech at normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, for both blind and sighted participants. These results indicate that using fast speech data does not necessarily create more intelligible voices and that linear compression can more reliably provide higher intelligibility, particularly at higher rates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.