2022
DOI: 10.1007/978-3-031-19836-6_16
|View full text |Cite
|
Sign up to set email alerts
|

Camera Pose Estimation and Localization with Active Audio Sensing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 95 publications
0
6
0
Order By: Relevance
“…Multi-modal learning has enabled a variety of applications for embodied agents, i.e., autonomously moving agents. For example, there are audio-visual navigation, depth estimation, and camera pose estimation [4]- [12]. In particular, several methods have been proposed to process multiple sound sources by taking advantage of the agent's movement, even though the sound input is binaural [7]- [9].…”
Section: C: Multi-modal Learning For Embodied Agentsmentioning
confidence: 99%
See 1 more Smart Citation
“…Multi-modal learning has enabled a variety of applications for embodied agents, i.e., autonomously moving agents. For example, there are audio-visual navigation, depth estimation, and camera pose estimation [4]- [12]. In particular, several methods have been proposed to process multiple sound sources by taking advantage of the agent's movement, even though the sound input is binaural [7]- [9].…”
Section: C: Multi-modal Learning For Embodied Agentsmentioning
confidence: 99%
“…On the other hand, focusing on the fact that most devices are equipped with multiple sensors such as cameras and microphones, various models utilizing multi-modal input have been proposed [4]- [12]. These methods include those that leverage the movement of the devices themselves to analyze scenes with multiple sound sources, even with devices like binaural microphones that have insufficient spatial information.…”
Section: Introductionmentioning
confidence: 99%
“…In contrast, we learn camera pose and sound localization solely from self-supervision, obtaining angular predictions without labeled data. Other work uses echolocation sounds to learn representations [27,93] and predict depth maps [19,68] and estimate camera poses [93] using labeled data. In contrast, our proposed approach jointly learns binaural sound localization and camera pose through passive audio sensing, without supervision.…”
Section: Related Workmentioning
confidence: 99%
“…We evaluate our method on Replica [79], 12-Scenes [31], and 7-Scenes [20] datasets. Replica contains high-fidelity indoor scenes and is widely used by recent works of NeRFs and localization [80]- [83]. We use the sequences recorded in [80], choosing the first sequence of each scene for training and the second for testing.…”
Section: B Camera Pose Estimationmentioning
confidence: 99%