Language2Pose: Natural Language Grounded Pose Forecasting

Ahuja, Chaitanya; Morency, Louis–Philippe

doi:10.1109/3dv.2019.00084

Cited by 185 publications

(154 citation statements)

References 38 publications

Supporting

Mentioning

153

Contrasting

Order By: Relevance

“…For all our experiments, we use CMU MoCap database 1 . CMU dataset is a high-quality dataset acquired using optical motion capture systems, containing 2605 trials in 6 categories and 23 subcategories.…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…In [35], the control signal becomes the 3d human pose predicted by neural nets as a reference for an agent to imitate. In [1], the authors co-embed the language and corresponding motions to a share manifold, ignoring the fact that language-to-motion is a one-to-many mapping. Even with a specific control signal, like 2D human skeleton, one can still expect that there are different motions or different pose corresponding to the same control signal [29], essentially indicating the multi-modality nature of human motion dynamics.…”

Section: Deterministic Human Motion Prediction and Synthesismentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Future Net

Chen

Wang

Shao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Section: Deterministic Human Motion Prediction and Synthesismentioning

confidence: 99%

Dynamic Future Net

Chen

Wang

Shao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…This characteristic provides an ability to manage "unseen words" that do not appear in the paired dataset. Some studies generated actions from descriptions represented by pre-trained word embeddings [2][3] [6][11] [16]. Zhong et al [11], Matthews et al [3], and Lunch et al [2], in particular, generated actions from commands including unseen words.…”

Section: A Action Generation Using Pre-trained Word Embeddingsmentioning

confidence: 99%

Embodying Pre-Trained Word Embeddings Through Robot Actions

Toyoda

Suzuki

Mori

et al. 2021

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

We propose a promising neural network model with which to acquire a grounded representation of robot actions and the linguistic descriptions thereof. Properly responding to various linguistic expressions, including polysemous words, is an important ability for robots that interact with people via linguistic dialogue. Previous studies have shown that robots can use words that are not included in the action-description paired datasets by using pre-trained word embeddings. However, the word embeddings trained under the distributional hypothesis are not grounded, as they are derived purely from a text corpus. In this paper, we transform the pre-trained word embeddings to embodied ones by using the robot's sensory-motor experiences. We extend a bidirectional translation model for actions and descriptions by incorporating non-linear layers that retrofit the word embeddings. By training the retrofit layer and the bidirectional translation model alternately, our proposed model is able to transform the pre-trained word embeddings to adapt to a paired action-description dataset. Our results demonstrate that the embeddings of synonyms form a semantic cluster by reflecting the experiences (actions and environments) of a robot. These embeddings allow the robot to properly generate actions from unseen words that are not paired with actions in a dataset.

show abstract

“…Pioneer efforts such as [1,18,24] mainly resort to encoder-decoder RNN architecture for languageto-pose translation. The work of [2] learns a joint embedding space between sentences and human pose sequences. More recently, [26] applies more sophisticated neural translation network equipped with GANs for text-to-sign prediction.…”

Section: Related Workmentioning

confidence: 99%