Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Xue, Haizhou; Huang, Yupan; Liu, Bei; Peng, Houwen; Fu, Jianlong; Li, Houqiang; Luo, Jiebo

doi:10.48550/arxiv.2106.13488

Cited by 3 publications

(2 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LXMERT [27] reőnes the cross-modal part to achieve advanced performance in downstream tasks. More recent models employing a double-stream fusion encoder are ALBEF [88], Visual Parsing [89] and WenLan [90]. In general, double-stream encoders demand training of two transformer models (one for each stream), which is computationally inefficient.…”

Section: Double-stream Fusion Encodermentioning

confidence: 99%

A Survey on Knowledge-Enhanced Multimodal Learning

Lymperaiou

Stamou

2023

Preprint

View full text Add to dashboard Cite

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visi-olinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.

show abstract

Section: Double-stream Fusion Encodermentioning

confidence: 99%

A Survey on Knowledge-Enhanced Multimodal Learning

Lymperaiou

Stamou

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The main difference between patch and grid features is that grid features are extracted from the feature map of a convolutional model while patch features directly utilize a linear projection. Patch features were first introduced by Vision Transformer (ViT) (Dosovitskiy et al, 2021a) and then adopted by VLP models Xue et al, 2021). The advantage of using patch features is efficiency.…”

Section: B Modality Embeddingmentioning

confidence: 99%

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Li¹,

Zhang²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or few shot learning tasks. Finally, we discuss some potential future trends towards modality cooperation, unified representation, and knowledge incorporation. We believe that this review will be of help for researchers and practitioners of AI and ML, especially those interested in computer vision and natural language processing. * Equal contribution.†This work was done when Feng Li, Hao Zhang, and Shilong Liu were interns at IDEA. ‡Corresponding author.

show abstract

A survey on knowledge-enhanced multimodal learning

Lymperaiou,

Stamou

2024

Artif Intell Rev

View full text Add to dashboard Cite

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. At the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.

show abstract

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Cited by 3 publications

References 42 publications

A Survey on Knowledge-Enhanced Multimodal Learning

A Survey on Knowledge-Enhanced Multimodal Learning

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

A survey on knowledge-enhanced multimodal learning

Contact Info

Product

Resources

About