Emphatic Visual Speech Synthesis

Melenchón, Javier; Martínez, Elisa; Torre, Fernando De la; Montero, José Antonio

doi:10.1109/tasl.2008.2010213

Cited by 10 publications

(4 citation statements)

References 47 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The suggested structure of corpus of Castilian Spanish used by Melenchón, Martínez, De La Torre, and Montero (2009) consisted of /CVCV/. Their purpose for using such structure was supported by a strong statement, which claims more than 80% of Castilian Spanish words flow /CV/ structure (de Vega, Álvarez, & Carreiras, 1992).…”

Section: Corpus Designmentioning

confidence: 98%

A Novel Approach for Allocating Mathematical Expressions to Visual Speech Signals

2015

View full text Add to dashboard Cite

In this article, visual speech information modeling analysis by explicit mathematical expressions coupled with words' phonemic structure is presented. The visual information is obtained from deformation of lips' dimensions during articulation of a set of words that is called visual speech sample set. The continuous interpretation of the lips' movement has been provided using Barycentric Lagrange Interpolation producing a unique mathematical expression named visual speech signal. Hierarchical analysis of the phoneme sequences has been applied for words' categorization to organize the database properly. The visual samples were extracted from three visual feature points chosen on the lips via an experiment in which two individuals pronounced the aforementioned words. The simulation results show that each individual word can be represented by a mathematical expression or visual speech signal whereas the sample sets can also be derived from the same mathematical expression, and this is a significant improvement over the popular statistical methods.

show abstract

Section: Corpus Designmentioning

confidence: 98%

A Novel Approach for Allocating Mathematical Expressions to Visual Speech Signals

2015

View full text Add to dashboard Cite

show abstract

“…The required input information might be derived from an automatic speech recognition (ASR) system if the corresponding acoustic speech is available, or existing (acoustic) text-to-speech synthesis rules can generate the required phoneme and timing sequence. Trajectory formation models have included concatenation [22], [24], [28]- [30], [43]- [46]; interpolation [3], [25], [26], [47]- [49]; probabilistic approaches [20], [23], [50]; and hybrid approaches [27], [51].…”

Section: Related Workmentioning

confidence: 99%

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Theobald

Matthews

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

show abstract

“…Multimodal affective recognition and synthesis deal with the determination and the simulation of multimodal expressiveness, respectively [1]. In MMHCI, the latter is typically conducted by talking-heads, whose research is mainly focused on the generation of realistic-looking (i.e., human-like) affective avatars [2]- [4]. Talking-heads may be used as a front-end in multimedia applications such as virtual operators, help desks, education tutors, etc.…”

Section: Introductionmentioning

confidence: 99%

“…Unit-selection text-to-speech (US-TTS) synthesis [6], which is based on the selection and concatenation of prerecorded speech units coming from a large speech database, is one of the dominant speech synthesis techniques [7]. Although there are sev- eral talking-heads including US-TTS (e.g., [4], [5]), there is still no significant research on including large affective speech corpora for the generation of their synthetic speech (e.g., affective speech is obtained from a diphone TTS by prosodic transformation rules [3] or through interactive control from only 1 h speech corpus containing read text [2]). One of the main reasons to this fact is the difficulty of obtaining accurate and reliable labels when dealing with large speech corpora, which become crucial to achieve high-quality synthetic speech [8], [9].…”

Section: Introductionmentioning

confidence: 99%

Reliable Pitch Marking of Affective Speech at Peaks or Valleys Using Restricted Dynamic Programming

Álías

Munné

2010

IEEE Trans. Multimedia

View full text Add to dashboard Cite

The affective communication channel plays a key role in multimodal human-computer interaction. In this context, the generation of realistic talking-heads expressing emotions both in appearance and speech is of great interest. The synthetic speech of talking-heads is generally obtained from a text-to-speech (TTS) synthesizer. One of the dominant techniques for achieving high-quality synthetic speech is unit-selection TTS (US-TTS) synthesis. Affective US-TTS systems are driven by affective annotated speech databases. Since affective speech involves higher acoustic variability than neutral speech, achieving trustworthy speech labeling is a more challenging task. To that effect, this paper introduces a methodology for achieving reliable pitch marking on affective speech. The proposal adjusts the pitch marks at the signal peaks or valleys after applying a three-stage restricted dynamic programming algorithm. The methodology can be applied as a post-processing of any pitch determination and pitch marking algorithm (with any local criterion for locating pitch marks), or their merging. The experiments show that the proposed methodology significantly improves the results of the input state-of-the-art markers on affective speech.Index Terms-Affective speech, dynamic programming, pitch marking, speech analysis, unit-selection text-to-speech synthesis.

show abstract

Emphatic Visual Speech Synthesis

Cited by 10 publications

References 47 publications

A Novel Approach for Allocating Mathematical Expressions to Visual Speech Signals

A Novel Approach for Allocating Mathematical Expressions to Visual Speech Signals

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Reliable Pitch Marking of Affective Speech at Peaks or Valleys Using Restricted Dynamic Programming

Contact Info

Product

Resources

About