2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016
DOI: 10.1109/icassp.2016.7472736
|View full text |Cite
|
Sign up to set email alerts
|

A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

Abstract: In the state-of-the-art statistical parametric speech synthesis system, a speech analysis module, e.g. STRAIGHT spectral analysis, is generally used for obtaining accurate and stable spectral envelopes, and then low-dimensional acoustic features extracted from obtained spectral envelopes are used for training acoustic models. However, a spectral envelope estimation algorithm used in such a speech analysis module includes various processing derived from human knowledge. In this paper, we investigate a deep auto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

1
17
0
1

Year Published

2017
2017
2018
2018

Publication Types

Select...
7
3

Relationship

4
6

Authors

Journals

citations
Cited by 33 publications
(19 citation statements)
references
References 21 publications
(22 reference statements)
1
17
0
1
Order By: Relevance
“…modelling [5,6], duration modelling [7], feature extraction [8], and text analysis [9] having been investigated by various groups. It has been reported that DNN-based techniques have improved the quality of synthetic speech significantly; cf.…”
Section: Introductionmentioning
confidence: 99%
“…modelling [5,6], duration modelling [7], feature extraction [8], and text analysis [9] having been investigated by various groups. It has been reported that DNN-based techniques have improved the quality of synthetic speech significantly; cf.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, the adoption of non-parametric models has opened up possibilities for using higher-dimensional, correlated representations. In [12] a restricted Boltzmann machine (RBM) was used to model the spectral envelope distribution, and the reintroduction of neural networks [13] subsequently lead to work modelling higher dimensional representations [14,15] or modelling a conventional cepstral representation whilst optimising a cost function in the the waveform domain [16,17].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, various neural-networks-based models have been proposed to better map the textual features into acoustic ones [2,3]. There are also neural networks that directly model spectral features to avoid artifacts caused by vocoders [4,5].…”
Section: Introductionmentioning
confidence: 99%