Memory-Augmented Reinforcement Learning for Image-Goal Navigation

Mezghani, Lina; Sukhbaatar, Sainbayar; Lavril, Thibaut; Maksymets, Oleksandr; Batra, Dhruv; Bojanowski, Piotr; Alahari, Karteek

doi:10.48550/arxiv.2101.05181

Cited by 6 publications

(9 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With increased data and steps RL baselines unsurprisingly increase in performance, however we find with 5X more data and 10x more compute RL baselines still are outperformed by NRNS. The low performance of behavioral cloning and RL methods for image-goal navigation is unsurprising [4,15]. This demonstrates the difficulty of learning rewards on low level actions instead of value learning on possible exploration directions, exacerbating the difficulty of exploration in image-goal navigation.…”

Section: Resultsmentioning

confidence: 99%

“…Navigation tasks largely fall into two main categories [1], ones in which a goal location is known [11,12,13] and limited exploration is required, and ones in which the goal location is not known and efficient exploration is necessary. In the second category, tasks range from finding the location of specific objects [5], rooms [14], or images [15], to the task of exploration itself [2]. The majority of current work [12,15,16,3] leverages simulators [7] and extensive interaction to learn end-to-end models for these tasks.…”

Section: Related Workmentioning

confidence: 99%

“…In the second category, tasks range from finding the location of specific objects [5], rooms [14], or images [15], to the task of exploration itself [2]. The majority of current work [12,15,16,3] leverages simulators [7] and extensive interaction to learn end-to-end models for these tasks. In contrast, our work shows that the semantic cues needed for exploration-based navigation tasks can be learned directly from video trajectories.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

No RL, No Simulation: Learning to Navigate without Navigating

Hahn¹,

Chaplot²,

Tulsiani³

et al. 2021

Preprint

View full text Add to dashboard Cite

Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap. In this paper, we pose a simple question: Do we really need active interaction, ground-truth maps or even reinforcement-learning (RL) in order to solve the image-goal navigation task? We propose a self-supervised approach to learn to navigate from only passive videos of roaming. Our approach, No RL, No Simulator (NRNS), is simple and scalable, yet highly effective. NRNS outperforms RL-based formulations by a significant margin. We present NRNS as a strong baseline for any future imagebased navigation tasks that use RL or Simulation.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

No RL, No Simulation: Learning to Navigate without Navigating

Hahn¹,

Chaplot²,

Tulsiani³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformer Memory. Transformers [71] have been shown to do very well on long-horizon embodied tasks like navigation and exploration [11,13,17,24,46,47,57]. The performance gains arise from a transformer's ability to effectively leverage past experiences of the agent [11,24,46,47,57] and also to do cross-modal reasoning [13,17].…”

Section: Related Workmentioning

confidence: 99%

“…Transformers [71] have been shown to do very well on long-horizon embodied tasks like navigation and exploration [11,13,17,24,46,47,57]. The performance gains arise from a transformer's ability to effectively leverage past experiences of the agent [11,24,46,47,57] and also to do cross-modal reasoning [13,17]. Different from these methods, our idea is to use a transformer as a memory model for capturing long-range acoustic correlations for audio-visual separation.…”

Section: Related Workmentioning

confidence: 99%

Active Audio-Visual Separation of Dynamic Sound Sources

Majumder¹,

Al-Halah²,

Grauman³

2022

Preprint

View full text Add to dashboard Cite

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, improving its own estimates for past timesteps via self-attention. Using highly realistic acoustic SoundSpaces [14] simulations in real-world scanned Mat-terport3D [12] environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target. Project: https: //vision.cs.utexas.edu/projects/activeav-dynamic-separation/.

show abstract

Coupling Vision and Proprioception for Navigation of Legged Robots

Fu¹,

Kumar²,

Agarwal³

et al. 2021

Preprint

View full text Add to dashboard Cite

We exploit the complementary strengths of vision and proprioception to achieve point goal navigation in a legged robot. Legged systems are capable of traversing more complex terrain than wheeled robots, but to fully exploit this capability, we need the high-level path planner in the navigation system to be aware of the walking capabilities of the low-level locomotion policy on varying terrains. We achieve this by using proprioceptive feedback to estimate the safe operating limits of the walking policy, and to sense unexpected obstacles and terrain properties like smoothness or softness of the ground that may be missed by vision. The navigation system uses onboard cameras to generate an occupancy map and a corresponding cost map to reach the goal. The FMM (Fast Marching Method) planner then generates a target path. The velocity command generator takes this as input to generate the desired velocity for the locomotion policy using as input additional constraints, from the safety advisor, of unexpected obstacles and terrain determined speed limits. We show superior performance compared to wheeled robot (LoCoBot) baselines, and other baselines which have disjoint high-level planning and low-level control. We also show the real-world deployment of our system on a quadruped robot with onboard sensors and compute. Videos at https://navigationlocomotion.github.io/camera-ready

show abstract

Memory-Augmented Reinforcement Learning for Image-Goal Navigation

Cited by 6 publications

References 20 publications

No RL, No Simulation: Learning to Navigate without Navigating

No RL, No Simulation: Learning to Navigate without Navigating

Active Audio-Visual Separation of Dynamic Sound Sources

Coupling Vision and Proprioception for Navigation of Legged Robots

Contact Info

Product

Resources

About