HEMVIP: Human Evaluation of Multiple Videos in Parallel

Jonell, Patrik; Yoon, Youngwoo; Wolfert, Pieter; Kucherenko, Taras; Henter, Gustav Eje

doi:10.1145/3462244.3479957

Cited by 20 publications

(19 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used a MUSHRA‐like (MUltiple Stimuli with Hidden Reference and Anchor) [ITU15] interface based on [JYW*21]. We had a total of 131 participants, with a minimum of 20 per study (ages 20‐55 years μ = 33.6, σ = 8.1).…”

Section: Methodsmentioning

confidence: 99%

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Ghorbani

Ferstl

Holden

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles. Our code and data are publicly available at https:// github.com/ ubisoft/ ubisoft-laforge-ZeroEGGS.

show abstract

Section: Methodsmentioning

confidence: 99%

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Ghorbani

Ferstl

Holden

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

“…We use a MUSHRA-like [21] (MUltiple Stimuli with Hidden Reference and Anchor) interface commonly used for subjective evaluation of speech-synthesis [44], but here adapted for video interfaces, since such setups have been found to work well for evaluating head motion and hand gestures [7,23,33]. On a single test page, participants are presented with videos of generated gesture-speech from all evaluated models on the same input text sentence.…”

Section: Perceptual Evaluation Methodsmentioning

confidence: 99%

Integrated Speech and Gesture Synthesis

Wang

Alexanderson

Gustafson

et al. 2021

Proceedings of the 2021 International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications -speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem. CCS CONCEPTS• Information systems → Multimedia content creation.

show abstract

“…One important aspect to evaluate for gesture‐generation systems is the human‐likeness of the generated gestures, which is measured and compared through human perceptual studies, often with comparable stimuli presented side by side as in e.g. [JYW*21, KJY*21, WGKB21]. On the other hand, evaluating the other aspects such as the appropriateness and/or specificity of generated gestures in the context of speech and other multimodal grounding information (see Section 6.4) is quite challenging, especially since differences in the human‐likeness of the motions being compared tends to interfere with perceived gesture appropriateness (cf.…”

Section: Key Challenges Of Gesture Generationmentioning

confidence: 99%

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Nyatsanga

Kucherenko²,

Ahuja³

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co‐speech gestures is a long‐standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non‐periodic nature of human co‐speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep‐learning‐based generative models that benefit from the growing availability of data. This review article summarizes co‐speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule‐based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non‐linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human‐like motion; grounding the gesture in the co‐occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

show abstract

HEMVIP: Human Evaluation of Multiple Videos in Parallel

Cited by 20 publications

References 12 publications

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Integrated Speech and Gesture Synthesis

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Contact Info

Product

Resources

About