2020
DOI: 10.1007/978-3-030-58574-7_36
|View full text |Cite
|
Sign up to set email alerts
|

Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 38 publications
0
8
0
Order By: Relevance
“…However, SLAM has two notable shortcomings. First, SLAM is unable to build-up an intuitive understanding of the environment (Gupta et al [ 69 ]): ‘These maps are built purely geometrically, and nothing is known until it has been explicitly observed, even when there are obvious patterns.’ New approaches therefore seek to augment SLAM with deep learning [ 70 – 74 ] (see also 3D semantic scene graphs [ 75 77 ]). Others seek an alternative to SLAM in deep reinforcement learning [ 69 , 78 83 ] or deep learning [ 84 ].…”
Section: Computer Visionmentioning
confidence: 99%
“…However, SLAM has two notable shortcomings. First, SLAM is unable to build-up an intuitive understanding of the environment (Gupta et al [ 69 ]): ‘These maps are built purely geometrically, and nothing is known until it has been explicitly observed, even when there are obvious patterns.’ New approaches therefore seek to augment SLAM with deep learning [ 70 – 74 ] (see also 3D semantic scene graphs [ 75 77 ]). Others seek an alternative to SLAM in deep reinforcement learning [ 69 , 78 83 ] or deep learning [ 84 ].…”
Section: Computer Visionmentioning
confidence: 99%
“…The initial step of our model is to encode the source I s ∈ R 3×h×w with a fully convolutional U-Net encoder that produces a feature map F s ∈ R c×h×w that preserves the spatial resolution of the source image. Once a feature map F s is obtained, we perform an inverse projection step to back-project F s into a latent volumetric tensor Z s ∈ R c×ds×hs×ws , where d s , h s , w s are depth, height and width for the volumetric representation 2 . Instead of reshaping 2D feature maps into a 3D volumetric representation like ENR [13], we found that using an inverse projection step is beneficial to preserve the 3D geometry and texture information (Cf.…”
Section: Encodingmentioning
confidence: 99%
“…In order to understand the 3D world, an intelligent agent must be able to perform inference about a scene's appearance and shape from unseen viewpoints. Being able to synthesize images at target camera viewpoints efficiently given sparse source views serves a fundamental purpose in building intelligent visual behaviour [2,3,4]. The problem of learning to synthesize novel views has been widely studied in literature, with approaches ranging from traditional small-baseline view synthesis relying on multi-plane imaging [5,6,7,8], flow estimation [9,10], to explicitly modeling 3D geometry via point-clouds [11], meshes [12], and voxels [13].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Many of the environments are photo-realistic reconstructions of indoor [52,3,60,9] and outdoor [15,19] scenes, and provide 3D ground truth labels for objects. These simulated environments have been used to study tasks such as visual navigation and exploration [10,21,18], visual question answering [14], tracking [22], and object recognition [11,62]. In our work, we use a simulated embodied agent to discover objects and fixate their sensors on them to obtain object-centric data for fine-tuning a detector.…”
Section: Active Visual Learningmentioning
confidence: 99%