A survey on knowledge-enhanced multimodal learning

Lymperaiou, Maria; Stamou, Giorgos

doi:10.48550/arxiv.2211.12328

Cited by 2 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Web knowledge encompasses both external and internal knowledge [66] and offers a significant advantage. Therefore, the work described in this article relies on the web knowledge rather than common sense knowledge sourced from Concept-Net.…”

Section: Related Work a Multimodal Machine Learning For Memesmentioning

confidence: 99%

Capturing the Concept Projection in Metaphorical Memes for Downstream Learning Tasks

Acharya,

Das,

Sudarshan

2024

IEEE Access

View full text Add to dashboard Cite

Metaphorical memes, where a source concept is projected into a target concept, are an essential construct in figurative language. In this article, we present a novel approach for downstream learning tasks on metaphorical multimodal memes. Our proposed framework replaces traditional methods using metaphor annotations with a metaphor-capturing mechanism. Besides using the significant zero-shot learning capability of state-of-the-art pretrained encoders, this work introduces an alternative external knowledge enhancement strategy based on ChatGPT (chatbot generative pretrained transformer), demonstrating its effectiveness in bridging the intermodal semantic gap. We propose a new concept projection process consisting of three distinct components to capture the intramodal knowledge and intermodal concept gap in the forms of text modality embedding, visual modality embedding, and concept projection embedding. This approach leverages the attention mechanism of the Graph Attention Network for fusing the common aspects of external knowledge related to the knowledge in the text and image modality to implement the concept projection process. Our experimental results demonstrate the superiority of our proposed approach compared to existing methods.

show abstract

Section: Related Work a Multimodal Machine Learning For Memesmentioning

confidence: 99%

Capturing the Concept Projection in Metaphorical Memes for Downstream Learning Tasks

Acharya,

Das,

Sudarshan

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Prior surveys in VL learning [36,37,38,39,40,41,42] do not focus on the collaboration between knowledge and deep learning VL models. An exhaustive presentation of the knowledgeenhanced VL (KVL) topic was presented in [43] for the first time. In the current survey paper, we focus on state-of-the-art endeavors involving transformer models for the VL representation, leading to hybrid approaches when combined with external knowledge.…”

Section: Figurementioning

confidence: 99%

“…External knowledge sources are divided in two main categories, explicit and implicit [43]. They are both capable of providing factual, commonsense, temporal, lexical or other knowledge senses [44] missing from pre-trained VL models.…”

Section: Types Of External Knowledgementioning

confidence: 99%

“…Aligning independent modality representations within a single multimodal embedding is proposed in [97]. The same work introduces extensions of VL pretraining objectives [43] to incorporate commonsense knowledge from [24] as an extra modality, therefore enforcing learning KVL interrelationships. Dynamic commonsense augmentation of image-text training data is a suggested direction, accompanied by learning to reconstruct hidden visual labels based on knowledge facts retrieved from commonsense KBs [98].…”

Section: Visual Commonsense Reasoning (Vcr)mentioning

confidence: 99%

See 1 more Smart Citation

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Lymperaiou¹,

Stamou²

2023

Preprint

View full text Add to dashboard Cite

Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.

show abstract

A survey on knowledge-enhanced multimodal learning

Cited by 2 publications

References 0 publications

Capturing the Concept Projection in Metaphorical Memes for Downstream Learning Tasks

Capturing the Concept Projection in Metaphorical Memes for Downstream Learning Tasks

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Contact Info

Product

Resources

About