2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015
DOI: 10.1109/icassp.2015.7178816
|View full text |Cite
|
Sign up to set email alerts
|

Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis

Abstract: Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the concerns for applying them to text-to-speech applications is its effect on latency. To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTMRNNs with a recurrent output layer. The use of unidirectional RNN architecture allows frame-synchronous streamin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
208
0
2

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 247 publications
(210 citation statements)
references
References 28 publications
0
208
0
2
Order By: Relevance
“…The specifics of the training method will be discussed in section 4. Note that output layers are also recurrent, so that dynamic features are not computed because feedback connections within the layer keep track of the dynamic evolution of outputs [7]. The intuition behind this architecture is that, whilst every output branch is trained, it shares the first linguistic mappings with other branches.…”
Section: Proposed Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…The specifics of the training method will be discussed in section 4. Note that output layers are also recurrent, so that dynamic features are not computed because feedback connections within the layer keep track of the dynamic evolution of outputs [7]. The intuition behind this architecture is that, whilst every output branch is trained, it shares the first linguistic mappings with other branches.…”
Section: Proposed Architecturementioning
confidence: 99%
“…In the case of speech synthesis, many works included DNNs and DBNs to perform acoustic mappings and prosody prediction [2,3,4], Also, Recurrent Neural Networks (RNNs) and their variants, like the Long Short Term Memory (LSTM) architecture [5], have leveraged completely the sequences processing and prediction problem, which makes them lead to interesting results in the speech synthesis field, where an acoustic signal of variable length has to be generated out of a set of textual entities. Some example works using this structures can be seen in [6,7,8,9]. Previous to deep learning, existing text to speech technologies included the unit selection speech synthesis [10] and the statistical parametric speech synthesis (SPSS) [11].…”
Section: Introductionmentioning
confidence: 99%
“…(24) corresponds to the sum of squares of the inverse system output. 4 The definition of the linguistic feature vector used in this paper can be found in [6] and [19]. Log likelihoods of trained LSTM-RNNs over both training and development subsets (60,000 samples).…”
Section: By Assumingmentioning
confidence: 99%
“…The training and development data sets consisted of 34,632 and 100 utterances, respectively. A speakerdependent unidirectional LSTM-RNN [19] was trained. From the speech data, its associated transcriptions, and automatically derived phonetic alignments, sample-level linguistic features included 535 linguistic contexts, 50 numerical features for coarsecoded position of the current sample in the current phoneme, and one numerical feature for duration of the current phoneme.…”
Section: Experimental Conditionsmentioning
confidence: 99%
“…The disadvantage of using such networks is they cannot directly model the dependent nature of each sequence of parameters with the former, which is desirable to mimic the production of human speech. To solve this problem, it has been suggested to include RNN [21] [22] in which there is feedback from some of the neurons in the network, backwards or to themselves, forming a kind of memory that retains previous states.…”
Section: Long Short-term Memory Recurrent Neural Networkmentioning
confidence: 99%