Interpretable visual reasoning: A survey

He, Feijuan; Wang, Yaxian; Miao, Xianglin; Sun, Xia

doi:10.1016/j.imavis.2021.104194

Cited by 11 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visiolinguistic (VL) learning has been one of the fastest evolving fields of artificial intelligence, especially after the emergence of the Transformer [1], which enabled a variety of powerful architectures. Popular VL tasks such as Visual Question Answering (VQA) [2], Visual Reasoning (VR) [3], Visual Commonsense Reasoning (VCR) [4], Visual Entailment (VE) [5], Image Captioning (IC) [6], Image-Text Retrieval (ITR) and inversely Text-Image Retrieval (TIR) [7], Visual-Language Navigation (VLN) [8], Visual Storytelling (VIST) and Visual Dialog (VD) [9] have been significantly benefited from recent transformer-based advancements which follow the pre-train fine-tune learning framework. Pre-training is responsible of fusing generic information regarding visual and linguistic patterns, as well as how those two modalities interact, based on information present in large-scale datasets.…”

Section: Introductionmentioning

confidence: 99%

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Lymperaiou¹,

Stamou²

2023

Preprint

View full text Add to dashboard Cite

Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.

show abstract

Section: Introductionmentioning

confidence: 99%

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Lymperaiou¹,

Stamou²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Combining information from different modalities, such as images, and text, allows more informative representations, as they provide complementary insights for the same instances. Several works focus on using both vision and language modalities, introducing tasks such as visual question answering [1], visual reasoning [2], visual commonsense reasoning [3], visual entailment [4], image captioning [5], image-text retrieval and inversely text-image retrieval [6], referring expressions [7], visual explanations [8] and grounding [9], visual-language navigation [10], visual generation from text [11], visual storytelling [12] and its inverse task of story visualization [13], and visual dialog [14].…”

Section: Introductionmentioning

confidence: 99%

A Survey on Knowledge-Enhanced Multimodal Learning

Lymperaiou

Stamou

2023

Preprint

View full text Add to dashboard Cite

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visi-olinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.

show abstract

“…At the same time, we ensure that our data is robust to perturbations and artefacts by i) controlling for word frequency biases between captions and foils, and ii) testing against unimodal collapse, a known issue of V&L models (Goyal et al, 2017;Madhyastha et al, 2018), thereby preventing models from solving the task using a single input modality. The issue of neural models exploiting data artefacts is well-known (Gururangan et al, 2018;Jia et al, 2019;He et al, 2021) and methods have been proposed to uncover such effects, including gradient-based, adversarial perturbations or input reduction techniques (cf. ).…”

Section: Introductionmentioning

confidence: 99%

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

bescu

Cafagna²,

Muradjan

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical taskcentred V&L evaluations.

show abstract

Interpretable visual reasoning: A survey

Cited by 11 publications

References 17 publications

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

A Survey on Knowledge-Enhanced Multimodal Learning

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Contact Info

Product

Resources

About