“…Because most natural language instructions provide only partial descriptions of the trajectory to the target position, the agent encounters difficulties in understanding the instructions unless they are efficiently combined with visual features using the alignment and grounding feature. In VLN-related studies [ 7 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ], various attention mechanisms, such as visual or textual attention, are employed to ensure the alignment and grounding between natural language instruction and input images. However, these attention-based VLN models are trained only to learn the association between natural language instructions and images with a limited number of R2R datasets; moreover, they have difficulties acquiring broader general knowledge of the relationship between natural language instruction and images.…”