Multimodal Speech Synthesis

Schweitzer, Antje; Braunschweiler, Norbert; Dogil, Grzegorz; Klankert, Tanja; Möbius, Bernd; Möhler, Gregor; Morais, Edmilson; Säuberlich, Bettina; Thomae, Matthias

doi:10.1007/3-540-36678-4_27

Cited by 4 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, in our corpus (Schweitzer et al 2004) fewer than 20% of the 1 Related work by Xanthos (Goldsmith and Xanthos 2009) explores a range of methods for establishing whether it is possible to automatically infer whether segments in a data sample are vowels or consonants (in addition to examining vowel harmony and phonotactic induction). Segmentation, however, is not the focus.…”

Section: Approaches To Segmentation Based On Representations Of the Amentioning

confidence: 99%

“…We also use the IMS unit selection corpus (Schweitzer et al 2004), a corpus of German speech, recorded by a professional male and a professional female speaker, and sampled at 16,000 Hz. 13-dimensional MFCCs were computed for the 2776 original speech files of the male speaker-part by means of the MATLAB Auditory Toolbox (Slaney 1998).…”

Section: Ims Unit Selection Corpusmentioning

confidence: 99%

See 1 more Smart Citation

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Duran

Schütze

Möbius

et al. 2010

Res on Lang and Comput

Self Cite

View full text Add to dashboard Cite

In this paper, we develop a new conceptual framework for an important problem in language acquisition, the correspondence problem: the fact that a given utterance has different manifestations in the speech and articulation of different speakers and that the correspondence of these manifestations is difficult to learn. We put forward the Correspondence-by-Segmentation Hypothesis, which states that correspondence is primarily learned by first segmenting speech in an unsupervised manner and then mapping the acoustics of different speakers onto each other. We show that a rudimentary segmentation of speech can be learned in an unsupervised fashion. We then demonstrate that, using the previously learned segmentation, different instances of a word can be mapped onto each other with high accuracy when trained on utterance-label pairs for a small set of words.

show abstract

Section: Approaches To Segmentation Based On Representations Of the Amentioning

confidence: 99%

Section: Ims Unit Selection Corpusmentioning

confidence: 99%

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Duran

Schütze

Möbius

et al. 2010

Res on Lang and Comput

Self Cite

View full text Add to dashboard Cite

show abstract

“…The prosody-syntax interface has been a theme of great interest in the scientific community, as illustrated by the proposals of several algorithms that explain part of the variance of both prosodic constituency and prominence [1,6,8,9]. All these algorithms have been integrated into text-to-speech synthesis systems in order to automatically generate the prosodic information necessary to produce a natural-sounding speech.…”

Section: Introductionmentioning

confidence: 99%

“…Three depths of syntactic analysis prior to the obtention of prosodic structure can be identified in these models. Some use a comprehensive parser to analyse the sentences [9], others use a partial syntactic analysis [1], and the last ones use a minimal amount of syntactic information [6,8]. All the algorithms use a set of heuristic rules to obtain prosodic constituents of similar size.…”

Section: Introductionmentioning

confidence: 99%

A dynamical model for generating prosodic structure

Barbosa

2006

Speech Prosody 2006

View full text Add to dashboard Cite

The performance of the Monnin-Grosjean (MG) algorithm for predicting prosodic structure is compared with that of a system of dependency-grammar-based local markers (the DG system). Analyses of Brazilian Portuguese paragraphs read by five speakers reveal that the MG algorithm performs as well as the DG system when V-to-V normalised durations at word and phrase stress boundaries are used as indexes of prominence. These two procedures, however, have proved unsuccessful in dealing with individual variability. To overcome such a limitation, a dynamical model is proposed. By coupling syntactic and regularity constraints the main advantage of the model is the plausible simulation of speaker variability. Seven simulations were caried out by changing three model parameters: coupling strength, conditional probability of phrase stress placement, and V-to-V duration mean.

show abstract

Natural Language Generation with Fully Specified Templates

Becker

2006

SmartKom: Foundations of Multimodal Dialogue Systems

View full text Add to dashboard Cite

Multimodal Speech Synthesis

Cited by 4 publications

References 21 publications

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

A dynamical model for generating prosodic structure

Natural Language Generation with Fully Specified Templates

Contact Info

Product

Resources

About