2021
DOI: 10.1016/j.imavis.2021.104194
|View full text |Cite
|
Sign up to set email alerts
|

Interpretable visual reasoning: A survey

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Visiolinguistic (VL) learning has been one of the fastest evolving fields of artificial intelligence, especially after the emergence of the Transformer [1], which enabled a variety of powerful architectures. Popular VL tasks such as Visual Question Answering (VQA) [2], Visual Reasoning (VR) [3], Visual Commonsense Reasoning (VCR) [4], Visual Entailment (VE) [5], Image Captioning (IC) [6], Image-Text Retrieval (ITR) and inversely Text-Image Retrieval (TIR) [7], Visual-Language Navigation (VLN) [8], Visual Storytelling (VIST) and Visual Dialog (VD) [9] have been significantly benefited from recent transformer-based advancements which follow the pre-train fine-tune learning framework. Pre-training is responsible of fusing generic information regarding visual and linguistic patterns, as well as how those two modalities interact, based on information present in large-scale datasets.…”
Section: Introductionmentioning
confidence: 99%
“…Visiolinguistic (VL) learning has been one of the fastest evolving fields of artificial intelligence, especially after the emergence of the Transformer [1], which enabled a variety of powerful architectures. Popular VL tasks such as Visual Question Answering (VQA) [2], Visual Reasoning (VR) [3], Visual Commonsense Reasoning (VCR) [4], Visual Entailment (VE) [5], Image Captioning (IC) [6], Image-Text Retrieval (ITR) and inversely Text-Image Retrieval (TIR) [7], Visual-Language Navigation (VLN) [8], Visual Storytelling (VIST) and Visual Dialog (VD) [9] have been significantly benefited from recent transformer-based advancements which follow the pre-train fine-tune learning framework. Pre-training is responsible of fusing generic information regarding visual and linguistic patterns, as well as how those two modalities interact, based on information present in large-scale datasets.…”
Section: Introductionmentioning
confidence: 99%
“…Combining information from different modalities, such as images, and text, allows more informative representations, as they provide complementary insights for the same instances. Several works focus on using both vision and language modalities, introducing tasks such as visual question answering [1], visual reasoning [2], visual commonsense reasoning [3], visual entailment [4], image captioning [5], image-text retrieval and inversely text-image retrieval [6], referring expressions [7], visual explanations [8] and grounding [9], visual-language navigation [10], visual generation from text [11], visual storytelling [12] and its inverse task of story visualization [13], and visual dialog [14].…”
Section: Introductionmentioning
confidence: 99%
“…At the same time, we ensure that our data is robust to perturbations and artefacts by i) controlling for word frequency biases between captions and foils, and ii) testing against unimodal collapse, a known issue of V&L models (Goyal et al, 2017;Madhyastha et al, 2018), thereby preventing models from solving the task using a single input modality. The issue of neural models exploiting data artefacts is well-known (Gururangan et al, 2018;Jia et al, 2019;He et al, 2021) and methods have been proposed to uncover such effects, including gradient-based, adversarial perturbations or input reduction techniques (cf. ).…”
Section: Introductionmentioning
confidence: 99%