2019 International Conference on 3D Vision (3DV) 2019
DOI: 10.1109/3dv.2019.00084
|View full text |Cite
|
Sign up to set email alerts
|

Language2Pose: Natural Language Grounded Pose Forecasting

Abstract: Figure 1: Overview of our model which uses joint multimodal space of language and pose to generate an animation conditioned on the input sentence. AbstractGenerating animations from natural language sentences finds its applications in a a number of domains such as movie script visualization, virtual human animation and, robot motion planning. These sentences can describe different kinds of actions, speeds and direction of these actions, and possibly a target destination. The core modeling challenge in this lan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
153
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 185 publications
(154 citation statements)
references
References 38 publications
1
153
0
Order By: Relevance
“…For all our experiments, we use CMU MoCap database 1 . CMU dataset is a high-quality dataset acquired using optical motion capture systems, containing 2605 trials in 6 categories and 23 subcategories.…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For all our experiments, we use CMU MoCap database 1 . CMU dataset is a high-quality dataset acquired using optical motion capture systems, containing 2605 trials in 6 categories and 23 subcategories.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…In [35], the control signal becomes the 3d human pose predicted by neural nets as a reference for an agent to imitate. In [1], the authors co-embed the language and corresponding motions to a share manifold, ignoring the fact that language-to-motion is a one-to-many mapping. Even with a specific control signal, like 2D human skeleton, one can still expect that there are different motions or different pose corresponding to the same control signal [29], essentially indicating the multi-modality nature of human motion dynamics.…”
Section: Deterministic Human Motion Prediction and Synthesismentioning
confidence: 99%
“…This characteristic provides an ability to manage "unseen words" that do not appear in the paired dataset. Some studies generated actions from descriptions represented by pre-trained word embeddings [2][3] [6][11] [16]. Zhong et al [11], Matthews et al [3], and Lunch et al [2], in particular, generated actions from commands including unseen words.…”
Section: A Action Generation Using Pre-trained Word Embeddingsmentioning
confidence: 99%
“…Pioneer efforts such as [1,18,24] mainly resort to encoder-decoder RNN architecture for languageto-pose translation. The work of [2] learns a joint embedding space between sentences and human pose sequences. More recently, [26] applies more sophisticated neural translation network equipped with GANs for text-to-sign prediction.…”
Section: Related Workmentioning
confidence: 99%