Trainable Articulatory Control Models for Visual Speech Synthesis

Beskow, Jonas

doi:10.1023/b:ijst.0000037076.86366.8d

Cited by 32 publications

(34 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The speech files of the 40 sentences were force-aligned using an HMM aligner [29] to guide the talking head lips movement generation procedure [30]. The audio was processed using a 4-channel noise excited vocoder [31] to reduce intelligibility.…”

Section: Methods and Setupmentioning

confidence: 99%

Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Moubayed

Beskow

Granström

et al. 2011

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eyegaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.

show abstract

Section: Methods and Setupmentioning

confidence: 99%

Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Moubayed

Beskow

Granström

et al. 2011

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…The speech files of the 40 sentences were force-aligned using an HMM aligner [29] to guide the talking head lip movement generation procedure [30]. The audio was processed using a 4-channel noise excited vocoder [31] to reduce intelligibility.…”

Section: Methods and Setupmentioning

confidence: 99%

Auditory visual prominence

Moubayed¹,

Beskow²,

Granström³

2009

J Multimodal User Interfaces

Self Cite

View full text Add to dashboard Cite

Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.

show abstract

“…As an alternative to the rule-based control model, we have investigated several data-driven (trainable) methods of generating articulatory parameter trajectories to control the face model [12]. The data-driven models are trained on a corpus of articulatory movements recorded from a human speaker, and learn to reproduce the articulatory patterns.…”

Section: Articulatory Control Modelsmentioning

confidence: 99%

SYNFACE – A Talking Head Telephone for the Hearing-Impaired

Beskow

Karlsson

Kewley

et al. 2004

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the first user trials have just started.

show abstract

Trainable Articulatory Control Models for Visual Speech Synthesis

Cited by 32 publications

References 27 publications

Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence

Auditory visual prominence

SYNFACE – A Talking Head Telephone for the Hearing-Impaired

Contact Info

Product

Resources

About