Proceedings of the 25th International Conference on Auditory Display (ICAD 2019) 2019
DOI: 10.21785/icad2019.032
|View full text |Cite
|
Sign up to set email alerts
|

Text-driven Mouth Animation for Human Computer Interaction With Personal Assistant

Abstract: Personal assistants are becoming more pervasive in our envi-ronments but still do not provide natural interactions. Their lack of realism in term of expressiveness and their lack of visual feedback can create frustrating experiences and make users lose patience. In this sense, we propose an end-to-end trainable neural architecture for text-driven 3D mouth animations. Previous works showed such architectures provide better realism and could open the door for integrated affective Human Computer Interface (HCI). … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…Similar to some recent text-driven talking head generation methods [5], [6], [11], our method uses TTS to synthesize the audio track. The goal of a TTS system is to synthesize human-like speech from a natural language text input.…”
Section: A Text-to-speech Synthesismentioning
confidence: 99%
See 3 more Smart Citations
“…Similar to some recent text-driven talking head generation methods [5], [6], [11], our method uses TTS to synthesize the audio track. The goal of a TTS system is to synthesize human-like speech from a natural language text input.…”
Section: A Text-to-speech Synthesismentioning
confidence: 99%
“…Therefore, facial landmarks can be used to represent the facial-related characteristics, e.g., face shapes, head poses, and mouth shapes, and it is easy to build mapping relation between facial landmarks and the facial expression in a photo. Recently, the facial landmarks have been popular intermediate representations to bridge the gap between the raw audio signal and photo-realistic videos in recent research [42], [5], [43], [10], [44]. However, these methods suffer from several limitations, such as only can be used for the person that used for the training data [42] and can not be used for arbitrary persons, depending on reference videos to provide pose information [42], [44], or no head movement is predicted and can only present static head pose [10].…”
Section: B Audio-driven Talking Head Generationmentioning
confidence: 99%
See 2 more Smart Citations