11th ISCA Speech Synthesis Workshop (SSW 11) 2021
DOI: 10.21437/ssw.2021-36
|View full text |Cite
|
Sign up to set email alerts
|

Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…Growing research interest in expressive speech synthesis for conversational speech has led towards efforts solving two subproblems. On the one hand, recent research has looked into building context-aware neural TTS models by incorporating and conditioning contextual information such as audio and linguistic features during training [6,7]. This has also led to newly proposed evaluation paradigms, which aim to move from rating naturalness of speech in isolation to rating appropriateness of speech in context [8].…”
Section: Related Workmentioning
confidence: 99%
“…Growing research interest in expressive speech synthesis for conversational speech has led towards efforts solving two subproblems. On the one hand, recent research has looked into building context-aware neural TTS models by incorporating and conditioning contextual information such as audio and linguistic features during training [6,7]. This has also led to newly proposed evaluation paradigms, which aim to move from rating naturalness of speech in isolation to rating appropriateness of speech in context [8].…”
Section: Related Workmentioning
confidence: 99%
“…Context is a broad term and has different effects on an utterance e.g., pragmatic context [6], entrainment with a speaking partner or utterance position in a turn [7]. The effect of context on speech, especially its prosodic realisation, is poorly understood and can be data-dependent [8]. Prosodic cues also show inter-speaker variability [9].…”
Section: Introductionmentioning
confidence: 99%
“…Apart from the textual context, it has been reported that acoustic features from the previous sentence can also lead to improvement of sentence-based TTS [43]. Following this result, a comparison investigation on multiple context representation types of the previous sentence was studied [44], including textual and acoustic features, utterance-level and word-level features, and representations extracted with a large pre-trained model and learned jointly with the TTS training. Our work differs from these in two folds: (i) the contextual information used in these models was derived from either isolation sentences or consecutive sentences with a predefined length, while in our work, the contextual information is extracted from a variable length paragraph which is a self-contained unit of discourse and composed of several connective sentences; (ii) crosssentence linguistic context were used to improve sentencelevel or conversation-level speech synthesis in their models.…”
Section: Introductionmentioning
confidence: 99%