2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021
DOI: 10.1109/cvprw53098.2021.00177
|View full text |Cite
|
Sign up to set email alerts
|

Practical Cross-modal Manifold Alignment for Robotic Grounded Language Learning

Abstract: We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items. Our approach learns these embeddings by sampling triples of anchor, positive, and negative data points from RGB-depth images and their natural language descriptions. We show that our approach can benefit from, but does not require, post-processing steps such as Procrustes analysis, in contrast to some of our baselines which requ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 28 publications
0
7
0
Order By: Relevance
“…We adjusted the base VAE network by appending an image encoder built on a pretrained ResNet-50 model (Figure 11, bottom left) [77]. We adopted a technique from cross-modal manifold alignment [78] to align the movements embeddings z m and face embeddings z f in the latent space. We used a triplet loss function that attracts embeddings of the same emotion (e.g.…”
Section: Emotion Modificationmentioning
confidence: 99%
See 1 more Smart Citation
“…We adjusted the base VAE network by appending an image encoder built on a pretrained ResNet-50 model (Figure 11, bottom left) [77]. We adopted a technique from cross-modal manifold alignment [78] to align the movements embeddings z m and face embeddings z f in the latent space. We used a triplet loss function that attracts embeddings of the same emotion (e.g.…”
Section: Emotion Modificationmentioning
confidence: 99%
“…Reversing the task to generate images given text descriptions is a more complex task, but recent state-of-the-art techniques are capable of incredibly realistic generated samples [189]. Nguyen et al performed manifold alignment on a paired image and text dataset for robot understanding [78]. These techniques are well-matched to the inherently multimodal affordances of robots.…”
Section: Multimodal Machine Learningmentioning
confidence: 99%
“…From image and video captioning (Kinghorn, Zhang, and Shao 2019;Wang et al 2018;Chen et al 2019) to large-scale pre-training (Lu et al 2019), learning from vision-language pairs is an active field of research. In this work, we use the manifold alignment approach of Nguyen et al (2021), in which language and vision representations are projected into a shared manifold, which is used to retrieve relevant objects given a natural language description. The novelty of our work is not in the triplet loss learning method for multi-modal alignment but in the comparison of transcription-based versus raw speech methods, and analysis of performance for end-users.…”
Section: Related Workmentioning
confidence: 99%
“…Our learning approach is to use manifold alignment with triplet loss (Nguyen et al 2021) in an attempt to capture a manifold between speech and visual perception. This manifold represents the grounding between query language and objects in a selection task.…”
Section: Approachmentioning
confidence: 99%
See 1 more Smart Citation