“…Combining information from different modalities, such as images, and text, allows more informative representations, as they provide complementary insights for the same instances. Several works focus on using both vision and language modalities, introducing tasks such as visual question answering [1], visual reasoning [2], visual commonsense reasoning [3], visual entailment [4], image captioning [5], image-text retrieval and inversely text-image retrieval [6], referring expressions [7], visual explanations [8] and grounding [9], visual-language navigation [10], visual generation from text [11], visual storytelling [12] and its inverse task of story visualization [13], and visual dialog [14].…”