Situational Fusion of Visual Representation for Visual Navigation

Shen, William; Xu, Danfei; Zhu, Yuke; Li, Feifei; Guibas, Leonidas J.; Savarese, Silvio

doi:10.1109/iccv.2019.00297

Cited by 59 publications

(28 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our empirical results and analysis have shown several directions to pursue in the future. First, we need to develop more advanced component technologies integral to task learning, e.g., more advanced navigation modules through either more effective structures (Hong et al, 2020) or richer perceptions (Shen et al, 2019) to solve navigation bottleneck. We need to develop better representations and more robust and adaptive learning algorithms to support self-monitoring and backtracking.…”

Section: Discussionmentioning

confidence: 99%

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Zhang¹,

Chai²

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem. On the AL-FRED benchmark for task learning, the published state-of-the-art system only achieves a task success rate of less than 10% in an unseen environment, compared to the human performance of over 90%. To address this issue, this paper takes a closer look at task learning. In a departure from a widely applied end-toend architecture, we decomposed task learning into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model HiTUT 1 (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art. The explicit representation of task structures also enables an in-depth understanding of the nature of the problem and the ability of the agent, which provides insight for future benchmark development and evaluation.

show abstract

Section: Discussionmentioning

confidence: 99%

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Zhang¹,

Chai²

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…Image captioning (Anderson et al, 2018b;Vinyals et al, 2015;Xu et al, 2015), visual question answering (Antol et al, 2015;Goyal et al, 2017), and visual dialog (Das et al, 2017a,b) are examples of active research areas in this field. At the same time, visual navigation (Gupta et al, 2017;Shen et al, 2019;Xia et al, 2018) and goal-oriented instruction following (Chen et al, 2019;Fu et al, 2019;Qi et al, 2020b) represent an important part of current work on embodied AI (Das et al, 2018a,b;Savva et al, 2019;Yang et al, 2019). In this context, Visionand-Language Navigation (VLN) (Anderson et al, 2018c) constitutes a peculiar challenge, as it enriches traditional navigation with a set of visually rich environments and detailed instructions.…”

Section: Related Workmentioning

confidence: 99%

Multimodal attention networks for low-level vision-and-language navigation

Landi¹,

Baraldi²,

Cornia³

et al. 2021

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

show abstract

“…Therefore, this paper also investigated the action decision to evaluate the practicability of the proposed method. As a prior work [8] pointed out, fusion at the action level, which is predicting an action candidate from each representation and adaptively consolidating these action candidates into the final action, reduces redundancies and improves generalization.…”

Section: Fusion At the Action Levelmentioning

confidence: 99%

“…The representation of the environment could be extracted based on computer vision technology [7], including the semantic segments, depth perception, object classes, room layouts and scene class [8]. However, most of the high level representation methods mentioned above require costly and high performance computational resources.…”

Section: Introductionmentioning

confidence: 99%

A Machine Learning Method for Vision-Based Unmanned Aerial Vehicle Systems to Understand Unknown Environments

Wang

Xiao

2020

Sensors

View full text Add to dashboard Cite

What makes unmanned aerial vehicles (UAVs) intelligent is their capability of sensing and understanding new unknown environments. Some studies utilize computer vision algorithms like Visual Simultaneous Localization and Mapping (VSLAM) and Visual Odometry (VO) to sense the environment for pose estimation, obstacles avoidance and visual servoing. However, understanding the new environment (i.e., make the UAV recognize generic objects) is still an essential scientific problem that lacks a solution. Therefore, this paper takes a step to understand the items in an unknown environment. The aim of this research is to enable the UAV with basic understanding capability for a high-level UAV flock application in the future. Specially, firstly, the proposed understanding method combines machine learning and traditional algorithm to understand the unknown environment through RGB images; secondly, the You Only Look Once (YOLO) object detection system is integrated (based on TensorFlow) in a smartphone to perceive the position and category of 80 classes of objects in the images; thirdly, the method makes the UAV more intelligent and liberates the operator from labor; fourthly, detection accuracy and latency in working condition are quantitatively evaluated, and properties of generality (can be used in various platforms), transportability (easily deployed from one platform to another) and scalability (easily updated and maintained) for UAV flocks are qualitatively discussed. The experiments suggest that the method has enough accuracy to recognize various objects with high computational speed, and excellent properties of generality, transportability and scalability.

show abstract

Situational Fusion of Visual Representation for Visual Navigation

Cited by 59 publications

References 40 publications

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Multimodal attention networks for low-level vision-and-language navigation

A Machine Learning Method for Vision-Based Unmanned Aerial Vehicle Systems to Understand Unknown Environments

Contact Info

Product

Resources

About