Data-driven multimodal synthesis

Carlson, Rolf; Granström, Björn

doi:10.1016/j.specom.2005.02.015

Cited by 6 publications

(4 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For intelligibility assessment, the type of stimuli used varies from isolated bisyllabic words [42], [64], multi-syllable nonsense words [65], isolated real words [39], [66], or sentences [2], [39]. Accuracy can be measured using the word, syllable or phone recognition rate, or it might involve keyword spotting in synthesized sentences [55]. The advantage of shorter stimuli for intelligibility testing is that the accuracy of particular speech gestures can be measured, but it is difficult to gauge the accuracy of modelling the longer term aspects of speech articulation.…”

Section: A Evaluating Visual Speech Synthesizersmentioning

confidence: 99%

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Theobald

Matthews

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

show abstract

Section: A Evaluating Visual Speech Synthesizersmentioning

confidence: 99%

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Theobald

Matthews

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…For instance, in [29,31] the participants were asked about how natural and realistic are the animations in a five-point scale, and in [17] the individuals were asked to choose the more realistic animation between two different ones, generated by different methods. Other kind of perceptual evaluations are reported in [32,33], where the contribution of the animated avatar to intelligibility of speech in noisy conditions is analyzed. This evaluation approach has the advantage of objectively quantify the perceived quality of the animation.…”

Section: Introductionmentioning

confidence: 99%

A comprehensive system for facial animation of generic 3D head models driven by speech

Terissi

Cerda

Gómez

et al. 2013

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

A comprehensive system for facial animation of generic 3D head models driven by speech is presented in this article. In the training stage, audiovisual information is extracted from audiovisual training data, and then used to compute the parameters of a single joint audiovisual hidden Markov model (AV-HMM). In contrast to most of the methods in the literature, the proposed approach does not require segmentation/classification processing stages of the audiovisual data, avoiding the error propagation related to these procedures. The trained AV-HMM provides a compact representation of the audiovisual data, without the need of phoneme (word) segmentation, which makes it adaptable to different languages. Visual features are estimated from the speech signal based on the inversion of the AV-HMM. The estimated visual speech features are used to animate a simple face model. The animation of a more complex head model is then obtained by automatically mapping the deformation of the simple model to it, using a small number of control points for the interpolation. The proposed algorithm allows the animation of 3D head models of arbitrary complexity through a simple setup procedure. The resulting animation is evaluated in terms of intelligibility of visual speech through perceptual tests, showing a promising performance. The computational complexity of the proposed system is analyzed, showing the feasibility of its real-time implementation.

show abstract

“…Ως γνωστόν, αποτέλεσε πρόδρομο της μεθόδου σύνθεσης μέσω γραμμικής πρόβλεψης (Linear Prediction Coding (LPC) synthesis) που αν και τα formants δύναται να προσδιορίζονται αυτόματα, η τελική σύνθεση είναι χαμηλής ποιότητας ενώ παραμένει το πρόβλημα της μη αυτόματης εξαγωγής κανόνων. Η έρευνα στρέφεται κυρίως γύρω από συνιστώσες που αφορούν την επαρκή μοντελοποίηση των παραμέτρων της πηγής και των formants [Frölich, 2001;Vincent, 2005], όσο και σε θέματα υβριδικής χρήσης μεταξύ μηχανών σύνθεσης formant και επιλογής ακουστικών μονάδων [Carlson, 2005;Hertz, 2002]. Η εγγενείς δυσκολία στην πρωταρχική μορφή σύνθεσης με formants αφορά όχι τόσο στην δυνατότητα παραγωγής του σήματος φωνής από την παραμετρική του αναπαράσταση, αλλά από την παραγωγή και τον χειρισμό των ίδιων των παραμέτρων με τους κανόνες έτσι ώστε να πληρούν τις προδιαγραφές που θέτει το κείμενο.…”

Section: σύνθεση με κανόνεςunclassified

“…Οι υβριδικές τεχνικές αναφέρονται σε προσπάθειες αποδοτικού συνδυασμού των υπαρχόντων προσεγγίσεων με στόχο την εκμετάλλευση των πλεονεκτημάτων που προσφέρει η καθεμία. Οι γνωστότερες υβριδικές τεχνικές αφορούν προσπάθειες ενοποίησης: α) της σύνθεσης με formants με την βοήθεια HMM [Acero, 1999], της σύνθεσης με formants και της σύνθεσης με επιλογή και συρραφή ακουστικών μονάδων [Hertz, 2002;Carlson, 2005], γ) της articulatory σύνθεσης με την βοήθεια HMM , Toda, 2008] και δ) της σύνθεσης με ΗΜΜ και της σύνθεσης με επιλογή και συρραφή ακουστικών μονάδων ].…”

Section: υβριδικές τεχνικέςunclassified

Βελτίωση Της Ποιότητας Συνθετικής Φωνής Και Εφαρμογή Σε Σύγχρονα Τηλεπικοινωνιακά Περιβάλλοντα Και Υπηρεσίες

Karabetsos¹,

Καραμπέτσος²

View full text Add to dashboard Cite

show abstract

Data-driven multimodal synthesis

Cited by 6 publications

References 44 publications

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

A comprehensive system for facial animation of generic 3D head models driven by speech

Βελτίωση Της Ποιότητας Συνθετικής Φωνής Και Εφαρμογή Σε Σύγχρονα Τηλεπικοινωνιακά Περιβάλλοντα Και Υπηρεσίες

Contact Info

Product

Resources

About