ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413513
|View full text |Cite
|
Sign up to set email alerts
|

Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis

Abstract: Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 27 publications
(31 reference statements)
0
8
0
Order By: Relevance
“…The frame-by-frame encoding approach encodes each speech frame to a vector. Various methods are used to extract the features of speech frames, such as Convolutional Layer, LSTM, Transformer [16], or build a graph in speech frames to get more detailed features [17]. Then the encoded speech frames can be used in many tasks, such as speech recognition [18] and direct speech translation [19].…”
Section: Related Workmentioning
confidence: 99%
“…The frame-by-frame encoding approach encodes each speech frame to a vector. Various methods are used to extract the features of speech frames, such as Convolutional Layer, LSTM, Transformer [16], or build a graph in speech frames to get more detailed features [17]. Then the encoded speech frames can be used in many tasks, such as speech recognition [18] and direct speech translation [19].…”
Section: Related Workmentioning
confidence: 99%
“…In addition, the generation model benefits NAR neural TTS, constructing a NAR TTS utilizing a deep variational autoencoder with a residual attention mechanism that subtly refines the text-to-sound alignment (Liu et al 2021). Excessive smoothing is a severe problem for the NAR TTS model, which harms the performance of the NAR TTS model (Ren et al 2022).…”
Section: Non-autoregressive Neural Ttsmentioning
confidence: 99%
“…Traditional neural TTS models usually use phoneme sequences as input which haven't fully utilized the contextual semantic information of the target sentence. Therefore, many works improve the expressiveness of TTS by introducing syntactic information Liu, Sisman, and Li 2021), those methods by explicitly associating input phoneme embedding with syntactic relations. A word-level semantic representation method is proposed in (Zhou et al 2021) which is based on dependent structure and pre-trained BERT.…”
Section: Introductionmentioning
confidence: 99%
“…Results showed the effectiveness of so-called attentional GNNs in transferring the metric representation learned from training classes to novel classes. Liu et al [30] demonstrated the application of GNNs to neural speech synthesis, for which they are used to encode explicitly the syntactic relationship of the different elements within a sentence. Jung et al [31] showed how GATs can be used to learn utterance-level relationships between speakers and how a GAT architecture with residual connections can be adapted to compute utterance-level similarity scores for speaker verification.…”
Section: Related Workmentioning
confidence: 99%