Do Autonomous Agents Benefit from Hearing?

Woubie, Abraham; Kanervisto, Anssi; Karttunen, Janne; Hautamäki, Ville

doi:10.48550/arxiv.1905.04192

Cited by 5 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Learning to Navigate in 3D Environments: Realistic 3D environments and simulation platforms [1,11,84,76] egocentric visual observations [104,103,66,50,12,14] for performing question answering [43,26,78,25], instructions following [4,19,3], and active visual tracking [60,94,95]. Previous studies have shown audio as a strong cue for obstacle avoidance and navigation [77,44,33,70,64,32,83]. However, these methods are either non-photorealistic, not supporting AI agents, or rendering audio not geometrically and acoustically correct.…”

Section: Related Workmentioning

confidence: 99%

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Li¹,

Rahtu²,

Zhao³

2022

Preprint

View full text Add to dashboard Cite

This paper focuses on perceiving and navigating 3D environments using echoes and RGB image. In particular, we perform depth estimation by fusing RGB image with echoes, received from multiple orientations. Unlike previous works, we go beyond the field of view of the RGB and estimate dense depth maps for substantially larger parts of the environment. We show that the echoes provide holistic and in-expensive information about the 3D structures complementing the RGB image. Moreover, we study how echoes and the wide field-of-view depth maps can be utilised in robot navigation. We compare the proposed methods against recent baselines using two sets of challenging realistic 3D environments: Replica and Matterport3D. The implementation and pre-trained models will be made publicly available.

show abstract

Section: Related Workmentioning

confidence: 99%

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Li¹,

Rahtu²,

Zhao³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Park et al [11] introduced a general-purpose simulation platform based on Unity engine with both auditory and visual observations. While ViZDoom [7] supports in-game stereo sounds, the default audio subsystem is not designed for faster-than-realtime experience collection, and thus can only be used in relatively basic scenarios [12]. To our best knowledge, the version of ViZDoom presented in this work is the first simulation platform that enables accelerated embodied simulation with sounds at tens of thousands of actions per second, enabling large-scale training.…”

Section: Related Workmentioning

confidence: 99%

Agents that Listen: High-Throughput Reinforcement Learning with Multiple Sensory Systems

Hegde¹,

Kanervisto²,

Petrenko³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Humans and other intelligent animals evolved highly sophisticated perception systems that combine multiple sensory modalities. On the other hand, state-of-the-art artificial agents rely mostly on visual inputs or structured low-dimensional observations provided by instrumented environments. Learning to act based on combined visual and auditory inputs is still a new topic of research that has not been explored beyond simple scenarios. To facilitate progress in this area we introduce a new version of VizDoom simulator to create a highly efficient learning environment that provides raw audio observations. We study the performance of different model architectures in a series of tasks that require the agent to recognize sounds and execute instructions given in natural language. Finally, we train our agent to play the full game of Doom and find that it can consistently defeat a traditional vision-based adversary.We are currently in the process of merging the augmented simulator with the main ViZDoom code repository. Video demonstrations and experiment code can be found at https:// sites.google.com/view/sound-rl.

show abstract

“…To our knowledge there has been limited published work in the domains of audio-based navigation and audio source localization for methods based on reinforcement learning. The work in [2] combined deep reinforcement learning with environmental audio information in the context of navigation of a virtual environment by an autonomous agent. They discovered that an agent trained via reinforcement learning and leveraging raw audio information could reach more reliably a sound-emitting target within a maze compared to the case when only visual information was used.…”

Section: Related Workmentioning

confidence: 99%

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

Giannakopoulos¹,

Pikrakis²,

Cotronis³

2021

Preprint

View full text Add to dashboard Cite

In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw sound from the environment, as a simulated human listener placed in the environment would hear it. For this purpose we create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio source localization problem. We also create an autonomous agent based on PPO online reinforcement learning algorithm and attempt to train it to solve these environments. Our experiments show that our agent achieves adequate performance and generalization ability in both environments, measured by quantitative metrics, even when a limited amount of training data are available or the environment parameters shift in ways not encountered during training. We also show that a degree of agent knowledge transfer is possible between the environments.

show abstract

Do Autonomous Agents Benefit from Hearing?

Cited by 5 publications

References 12 publications

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Agents that Listen: High-Throughput Reinforcement Learning with Multiple Sensory Systems

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

Contact Info

Product

Resources

About