Exemplar-based Speech Waveform Generation

Watts, Oliver; Valentini-Botinhao, Cassia; Espic, Felipe; King, Simon

doi:10.21437/interspeech.2018-1857

Cited by 3 publications

(9 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We found that this simple search was sufficient as units are too short to deviate from the target sequence in the course of a single unit [8].…”

Section: Unit Searchmentioning

confidence: 97%

“…From these indices a sequence of higher dimension acoustic features is created and used for waveform reconstruction. In this section we will summarise the waveform generation method proposed in [8] that forms the basis of the hybrid TTS framework proposed in this paper.…”

Section: Proposed Text-to-speech System With Examplar-based Speech Wamentioning

confidence: 99%

“…As in the case of other examplar-based approaches, the proposed method selects a sequence of speech segments under two types of constraint: that each unit should be acoustically close to its target (divergence is penalised with a target cost), and that the end of each unit in the sequence should be acoustically similar to the start of the following unit, so that they can be joined without audible artefacts (implemented with a join cost). As mentioned in [8], the target and join components of the combined cost can be regarded as measures of fidelity and fluency respectively, the first scoring how faithfully the desired message is encoded and the second, how fluently it is rendered. In the following subsections we detail how the database of natural speech waveform units is created and how to generate new waveforms from this database.…”

Section: Proposed Text-to-speech System With Examplar-based Speech Wamentioning

confidence: 99%

“…We then extract spectral features characterising the signal around each of these pitchmarks, through a pitch-synchronous analysis. Following [8], the term frame from now on denotes a pitchmark-centred acoustic feature vector.…”

Section: Acoustic Feature Extractionmentioning

confidence: 99%

“…To resolve these issues the current work proposes a hybrid TTS that uses an examplar-based waveform generation method based on smaller units which are determined without phonetic annotation. This waveform generation system was first proposed in [8]; in this paper we integrate it with a TTS acoustic model and present its halfphone variant that was used in [9]. Similar small unit systems have been proposed before, where units are determined without phonetic Links to audio samples and code for recreating the systems described here can be found at https://github.com/CSTR-Edinburgh/ snickery.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Examplar-Based Speechwaveform Generation for Text-To-Speech

Valentini-Botinhao

Watts

Espic

et al. 2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

This paper presents a hybrid text-to-speech framework that uses a waveform generation method based on examplars of natural speech waveform. These examplars are selected at synthesis time given a sequence of acoustic features generated from text by a statistical parametric speech synthesis model. In order to match the expected degradation of these target synthesis features, the database of units is constructed such that the units' target representations are generated from the same parametric model. We evaluate two variants of this framework by modifying the size of the examplar: a small unit variant (where unit boundaries are determined by pitch mark location) and a halfphone variant (where unit boundaries are determined by subphone state forced alignment). We found that for a larger dataset (around four hours of training data) the examplar-based waveform generation variants are rated higher than the vocoder-based system.

show abstract

“…We found that this simple search was sufficient as units are too short to deviate from the target sequence in the course of a single unit [8].…”

Section: Unit Searchmentioning

confidence: 97%

Section: Proposed Text-to-speech System With Examplar-based Speech Wamentioning

confidence: 99%

Section: Proposed Text-to-speech System With Examplar-based Speech Wamentioning

confidence: 99%

Section: Acoustic Feature Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Examplar-Based Speechwaveform Generation for Text-To-Speech

Valentini-Botinhao

Watts

Espic

et al. 2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

show abstract

Speech Waveform Reconstruction Using Convolutional Neural Networks with Noise and Periodic Inputs

Watts

Valentini-Botinhao

King

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper presents a method for upsampling and transforming a compact representation of acoustics into a corresponding speech waveform. Similar to a conventional vocoder, the proposed system takes a pulse train derived from fundamental frequency and a noise sequence as inputs and shapes them to be consistent with the acoustic features. However, the filters that are used to shape the waveform in the proposed system are learned from data, and take the form of layers in a convolutional neural network. Because the network performs the transformation simultaneously for all waveform samples in a sentence, its synthesis speed is comparable with that of conventional vocoders on CPU, and many times faster on GPU. It is trained directly in a fast and straightforward manner, using a combined time-and frequency-domain objective function. We use publicly available data and provide code to allow our results to be reproduced.

show abstract

Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Valentini-Botinhao¹,

Ribeiro²,

Watts³

et al. 2022

Interspeech 2022

Self Cite

View full text Add to dashboard Cite

Automatically predicting the outcome of subjective listening tests is a challenging task. Ratings may vary from person to person even if preferences are consistent across listeners. While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text. We propose a model based on anti-symmetric twin neural networks, trained on pairs of waveforms and their corresponding preference scores. We explore both attention and recurrent neural nets to account for the fact that stimuli in a pair are not time aligned. To obtain a large training set we convert listeners' ratings from MUSHRA tests to values that reflect how often one stimulus in the pair was rated higher than the other. Specifically, we evaluate performance on data obtained from twelve MUSHRA evaluations conducted over five years, containing different TTS systems, built from data of different speakers. Our results compare favourably to a state-of-the-art model trained to predict MOS scores.

show abstract

Exemplar-based Speech Waveform Generation

Cited by 3 publications

References 14 publications

Examplar-Based Speechwaveform Generation for Text-To-Speech

Examplar-Based Speechwaveform Generation for Text-To-Speech

Speech Waveform Reconstruction Using Convolutional Neural Networks with Noise and Periodic Inputs

Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Contact Info

Product

Resources

About