We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the TOUCHDOWN task and dataset, where an agent must first follow navigation instructions in a real-life visual urban environment, and then identify a location described in natural language to find a hidden object at the goal position. The data contains 9,326 examples of English instructions and spatial descriptions paired with demonstrations. Empirical analysis shows the data presents an open challenge to existing methods, and qualitative linguistic analysis shows that the data displays richer use of spatial reasoning compared to related resources. The environment and data are available at https://touchdown.ai. the dumpster has a blue tarp draped over the end closest to you. touchdown is on the top of the blue tarp on the dumpster.LINGUNET The model correctly predicts the location of Touchdown, putting most of the predicted distribution (green) on the top-left of the dumpster at the center.
3TEXT2CONV The model incorrectly predicts the location of Touchdown to the top of the car on the far right. While some of the probability mass is correctly placed on the dumpster, the pixel with the highest probability is on the car.
3CONCATCONV The model correctly predicts the location of Touchdown. The distribution is heavily concentrated at a couple of nearby pixels.
3CONCAT The prediction is similar to CONCATCONV.3 Figure 9. Three of the models are doing fairly well. Only TEXT2CONV fails to predict the location of Touchdown.turn to your right and you will see a green trash barrel between the two blue benches on the right. click to the base of the green trash barrel to find touchdown.LINGUNET The model accurately predicts the green trash barrel on the right as Touchdown's location.
41TEXT2CONV The model predicts successfully as well. The distribution is focused on a smaller area compared to LIN-GUNET closer to the top of the object. This possibly shows a learned bias towards placing Touchdown on the top of objects that TEXT2CONV is more suceptible to.
41CONCATCONV The model prediction is correct. The distribution is focused on fewer pixels compared to LINGUNET.
41CONCAT The model prediction is correct. Similar to CONCATCONV, it focuses on a few pixels. 41 Figure 10. All the models predict the location of Touchdown correctly. Trash can is a relatively common object that workers use to place Touchdown in the dataset . on your right is a parking garage, there is a red sign with bikes parked out in front of the garage, the bear is on the red sign.LINGUNET The model predicted the location of Touchdown correctly to the red stop sign on the right side.
59TEXT2CONV The model predicts the location of Touchdown correctly.
59CONCATCONV The model predicts the location of Touchdown correctly.
59CONCAT The model predicts the location of Touchdown correctly. Figure 11. All the models predict the location of Touchdown correctly. Reference to a red sign are relatively common in the data (Figure 8) potentially sim...