2021
DOI: 10.48550/arxiv.2111.05759
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Abstract: Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(9 citation statements)
references
References 52 publications
0
9
0
Order By: Relevance
“…Recently, HAMT [42] and Episodic Transformer [43] explicitly modeled the history information by directly encoding all past observations and actions, but this is fairly complex. In contrast, MTVM [5] proposes a Transformer with a variable-length memory framework to model history information explicitly by copying past activations into a memory bank without the need to consider distances in the path.…”
Section: Multimodal Transformersmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, HAMT [42] and Episodic Transformer [43] explicitly modeled the history information by directly encoding all past observations and actions, but this is fairly complex. In contrast, MTVM [5] proposes a Transformer with a variable-length memory framework to model history information explicitly by copying past activations into a memory bank without the need to consider distances in the path.…”
Section: Multimodal Transformersmentioning
confidence: 99%
“…Remembering history information is essential for an agent to implement correct decision-making during navigation. Inspired by [5], we introduce a memory-based Trans-former module as the action policy framework to explicitly model history information. The information generated during reasoning is stored in the form of scene memory token m t , where m t is composed of multiple features concatenated:…”
Section: Memory-based Transformermentioning
confidence: 99%
See 1 more Smart Citation
“…Most neural network models cannot read and write long-term memory parts and cannot be tightly coupled with inference. The method based on a memory network stores the extracted features in external memory and designs a pairing and reading algorithm, which has a good effect on improving the modal information inference of sequences and is effective on video captioning [21], vision-and-language navigation [22,23], and visual-and-textual question answering [24] tasks.…”
Section: Feature Extractionmentioning
confidence: 99%
“…In addition, as the VLN agent faces structured environments, understanding the topology of the environment is crucial for the success of navigation. However, existing methods either arrange the historical observations in a sequential manner [14,32,42], or adopt complicated modules (e.g., graph neural networks) for modeling environment layouts [10,22,53]. Differently, we develop a Structured Transformer Planner (STP), where the position-embedded visual observations are used as input tokens, and the geometric relations (local connectivity) among the navigation locations are elegantly formulated as the directional attention among input tokens.…”
Section: Introductionmentioning
confidence: 99%