2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00682
|View full text |Cite
|
Sign up to set email alerts
|

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Abstract: To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challengin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
111
1

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 123 publications
(112 citation statements)
references
References 33 publications
0
111
1
Order By: Relevance
“…Additionally, we diversify the training trajectories by sampling actions from the agent's policy with probability , instead of exclusively following the expert trajectories. We use inflection weighting to prevent the policy from simply repeating the previous action (Wijmans et al 2019). The perturbation is varied from start to end during the course of training with a constant increase of 0.1 after every E updates.…”
Section: Imitation Learningmentioning
confidence: 99%
“…Additionally, we diversify the training trajectories by sampling actions from the agent's policy with probability , instead of exclusively following the expert trajectories. We use inflection weighting to prevent the policy from simply repeating the previous action (Wijmans et al 2019). The perturbation is varied from start to end during the course of training with a constant increase of 0.1 after every E updates.…”
Section: Imitation Learningmentioning
confidence: 99%
“…In this section, we provide the quantitative and qualitative results of our 3D adversarial perturbations on EQA and EVR through our differentiable renderer. For EQA, besides PACMAN-RL+Q, we also evaluate the transferability of our attacks using the following models: (1) NAV-GRU, an agent using GRU instead of LSTM in navigation [37];…”
Section: Attack Via a Differentiable Renderermentioning
confidence: 99%
“…Concurrently, Gordon et al [15] studied the EQA task in an interactive environment named AI2-THOR [20]. Recently, several studies have been proposed to improve agent performance using different frameworks [9] and point cloud perception [37]. Similar to EQA, embodied vision recognition (EVR) [40] is an embodied task, in which an agent instantiated close to an occluded target object to perform visual object recognition.…”
Section: Introductionmentioning
confidence: 99%
“…The Embodied Question Answering (EQA) v1.0 [ 18 ] dataset consists of scenes sampled from the SUNCG dataset with additional question–answer pairs. The authors further extended the EQA task for realistic scene setting by adapting the Matterport3D dataset to their Matterport3D EQA dataset [ 19 ]. The Room-to-Room dataset [ 20 ] added navigation instruction annotation to the Matterport3D dataset for the vision-language navigation task.…”
Section: Related Workmentioning
confidence: 99%
“…Various 2D approaches have been adapted for 3D data, such as recognition [ 3 , 4 , 5 ], detection [ 16 ], and segmentation [ 17 ]. Researchers have proposed a series of embodied AI tasks that define an indoor scene and an agent that explores the scene and answers vision-related questions (e.g., embodied question answering [ 18 , 19 ]), or navigates based on a given instruction (e.g., vision-language navigation [ 20 , 21 ]). However, most 3D recognition-related studies have focused on static scenes.…”
Section: Introductionmentioning
confidence: 99%